Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Christoph Hellwig @ 2016-04-07 14:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: lsf, linux-mm, netdev@vger.kernel.org, Brenden Blanco,
	James Bottomley, Tom Herbert, lsf-pc, Alexei Starovoitov
In-Reply-To: <20160407161715.52635cac@redhat.com>

This is also very interesting for storage targets, which face the same
issue.  SCST has a mode where it caches some fully constructed SGLs,
which is probably very similar to what NICs want to do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: Boot failure when using NFS on OMAP based evms
From: Franklin S Cooper Jr. @ 2016-04-07 14:36 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Sam Kumar, Willem de Bruijn, Eric Dumazet, tony, mugunthanvnm,
	Network Development, LKML, David Miller, kuznet, jmorris,
	yoshfuji, kaber, nsekhar, linux-omap
In-Reply-To: <CAF=yD-L=HZtD+LaKt1y4OcY0CQKOk_jmXZcQcyXjdfrPLpAhEQ@mail.gmail.com>



On 04/06/2016 09:22 PM, Willem de Bruijn wrote:
> On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr. <fcooper@ti.com> wrote:
>> Hi All,
>>
>> Currently linux-next is failing to boot via NFS on my AM335x GP evm,
>> AM437x GP evm and Beagle X15. I bisected the problem down to the commit
>> "udp: remove headers from UDP packets before queueing".
>>
>> I had to revert the following three commits to get things working again:
>>
>> e6afc8ace6dd5cef5e812f26c72579da8806f5ac
>> udp: remove headers from UDP packets before queueing
>>
>> 627d2d6b550094d88f9e518e15967e7bf906ebbf
>> udp: enable MSG_PEEK at non-zero offset
>>
>> b9bb53f3836f4eb2bdeb3447be11042bd29c2408
>> sock: convert sk_peek_offset functions to WRITE_ONCE
>>
> 
> Thanks for the report, and apologies for breaking your configuration.
> I had missed that sunrpc can dequeue skbs from a udp receive
> queue and makes assumptions about the layout of those packets. rxrpc
> does the same. From what I can tell so far, those are the only two
> protocols that do this. I have verified that the following fixes rxrpc for me
> 
> --- a/net/rxrpc/ar-input.c
> +++ b/net/rxrpc/ar-input.c
> @@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv
> *sp, struct sk_buff *skb)
>         struct rxrpc_wire_header whdr;
> 
>         /* dig out the RxRPC connection details */
> -       if (skb_copy_bits(skb, sizeof(struct udphdr), &whdr, sizeof(whdr)) < 0)
> +       if (skb_copy_bits(skb, 0, &whdr, sizeof(whdr)) < 0)
>                 return -EBADMSG;
> -       if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr)))
> +       if (!pskb_pull(skb, sizeof(whdr)))
>                 BUG();
> 
> I have not yet been able to reproduce the sunrpc/nfs issue, but I
> suspect that the following might fix it. I will try to create an NFS
> setup.
> 
> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
> index 2df87f7..8ab40ba 100644
> --- a/net/sunrpc/socklib.c
> +++ b/net/sunrpc/socklib.c
> @@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr,
> struct sk_buff *skb)
>         struct xdr_skb_reader   desc;
> 
>         desc.skb = skb;
> -       desc.offset = sizeof(struct udphdr);
> +       desc.offset = 0;
>         desc.count = skb->len - desc.offset;
> 
>         if (skb_csum_unnecessary(skb))
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 1413cdc..71d6072 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>         svsk->sk_sk->sk_stamp = skb->tstamp;
>         set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); /* there may be
> more data... */
> 
> -       len  = skb->len - sizeof(struct udphdr);
> +       len  = skb->len;
>         rqstp->rq_arg.len = len;
> 
>         rqstp->rq_prot = IPPROTO_UDP;
> @@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>                 skb_free_datagram_locked(svsk->sk_sk, skb);
>         } else {
>                 /* we can use it in-place */
> -               rqstp->rq_arg.head[0].iov_base = skb->data +
> -                       sizeof(struct udphdr);
> +               rqstp->rq_arg.head[0].iov_base = skb->data;
>                 rqstp->rq_arg.head[0].iov_len = len;
>                 if (skb_checksum_complete(skb))
>                         goto out_free;
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 65e7595..c1fc7b2 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
>         u32 _xid;
>         __be32 *xp;
> 
> -       repsize = skb->len - sizeof(struct udphdr);
> +       repsize = skb->len;
>         if (repsize < 4) {
>                 dprintk("RPC:       impossible RPC reply size %d!\n", repsize);
>                 return;
>         }
> 
> 
> 
>         /* Copy the XID from the skb... */
> -       xp = skb_header_pointer(skb, sizeof(struct udphdr),
> -                               sizeof(_xid), &_xid);
> +       xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid);
>         if (xp == NULL)
>                 return;
> 


Thank you for your quick response. I verified with all of the above
suggested changes that NFS works again on my 3 evms.

^ permalink raw reply

* Re: [PATCH] wlcore: spi: add wl18xx support
From: Kalle Valo @ 2016-04-07 14:32 UTC (permalink / raw)
  To: Reizer, Eyal
  Cc: Eyal Reizer, linux-wireless@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	devicetree@vger.kernel.org
In-Reply-To: <8665E2433BC68541A24DFFCA87B70F5B360B5430-1tpBd5JUCm6IQmiDNMet8wC/G2K4zDHf@public.gmane.org>

"Reizer, Eyal" <eyalr-l0cyMroinI0@public.gmane.org> writes:

>> >  static const struct of_device_id wlcore_spi_of_match_table[] = {
>> > -	{ .compatible = "ti,wl1271" },
>> > +	{ .compatible = "ti,wl1271", .data = &wl12xx_data},
>> > +	{ .compatible = "ti,wl1273", .data = &wl12xx_data},
>> > +	{ .compatible = "ti,wl1281", .data = &wl12xx_data},
>> > +	{ .compatible = "ti,wl1283", .data = &wl12xx_data},
>> > +	{ .compatible = "ti,wl1801", .data = &wl18xx_data},
>> > +	{ .compatible = "ti,wl1805", .data = &wl18xx_data},
>> > +	{ .compatible = "ti,wl1807", .data = &wl18xx_data},
>> > +	{ .compatible = "ti,wl1831", .data = &wl18xx_data},
>> > +	{ .compatible = "ti,wl1835", .data = &wl18xx_data},
>> > +	{ .compatible = "ti,wl1837", .data = &wl18xx_data},
>> >  	{ }
>> 
>> Shouldn't you also update bindings/net/wireless/ti,wlcore,spi.txt? Now it only
>> mentions about ti,wl1271 and not anything about the rest.
>
> You are right! Will be fixed in v2

Thanks. Also remember to CC devicetree list.

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v6 net-next] net: ipv4: Consider failed nexthops in multipath routes
From: David Ahern @ 2016-04-07 14:21 UTC (permalink / raw)
  To: netdev, ja; +Cc: David Ahern

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
---
v6
- changed __neigh_lookup_noref to __ipv4_neigh_lookup_noref per Dave's
  comment

v5
- returned comma that got lost in the ether and removed resetting of
  nhsel at end of loop - again comments from Julian

v4
- remove NULL initializer and logic for fallback per Julian's comment

v3
- Julian comments: changed use of dead in documentation to failed,
  init state to NUD_REACHABLE which simplifies fib_good_nh, use of
  nh_dev for neighbor lookup, fallback to first entry which is what
  current logic does

v2
- use rcu locking to avoid refcnts per Eric's suggestion
- only consider neighbor info for nh_scope == RT_SCOPE_LINK per Julian's
  comment
- drop the 'state == NUD_REACHABLE' from the state check since it is
  part of NUD_VALID (comment from Julian)
- wrapped the use of the neigh in a sysctl

 Documentation/networking/ip-sysctl.txt | 10 ++++++++++
 include/net/netns/ipv4.h               |  3 +++
 net/ipv4/fib_semantics.c               | 34 +++++++++++++++++++++++++++++-----
 net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
 4 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index b183e2b606c8..6c7f365b1515 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN
 	fwmark of the packet they are replying to.
 	Default: 0
 
+fib_multipath_use_neigh - BOOLEAN
+	Use status of existing neighbor entry when determining nexthop for
+	multipath routes. If disabled, neighbor information is not used and
+	packets could be directed to a failed nexthop. Only valid for kernels
+	built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+	Default: 0 (disabled)
+	Possible values:
+	0 - disabled
+	1 - enabled
+
 route/max_size - INTEGER
 	Maximum number of routes allowed in the kernel.  Increase
 	this when using large numbers of interfaces and/or routes.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a69cde3ce460..d061ffeb1e71 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -133,6 +133,9 @@ struct netns_ipv4 {
 	struct fib_rules_ops	*mr_rules_ops;
 #endif
 #endif
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+	int sysctl_fib_multipath_use_neigh;
+#endif
 	atomic_t	rt_genid;
 };
 #endif
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d97268e8ff10..ab64d9f2eef9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1559,21 +1559,45 @@ int fib_sync_up(struct net_device *dev, unsigned int nh_flags)
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
+static bool fib_good_nh(const struct fib_nh *nh)
+{
+	int state = NUD_REACHABLE;
+
+	if (nh->nh_scope == RT_SCOPE_LINK) {
+		struct neighbour *n;
+
+		rcu_read_lock_bh();
+
+		n = __ipv4_neigh_lookup_noref(nh->nh_dev, nh->nh_gw);
+		if (n)
+			state = n->nud_state;
+
+		rcu_read_unlock_bh();
+	}
+
+	return !!(state & NUD_VALID);
+}
 
 void fib_select_multipath(struct fib_result *res, int hash)
 {
 	struct fib_info *fi = res->fi;
+	struct net *net = fi->fib_net;
+	bool first = false;
 
 	for_nexthops(fi) {
 		if (hash > atomic_read(&nh->nh_upper_bound))
 			continue;
 
-		res->nh_sel = nhsel;
-		return;
+		if (!net->ipv4.sysctl_fib_multipath_use_neigh ||
+		    fib_good_nh(nh)) {
+			res->nh_sel = nhsel;
+			return;
+		}
+		if (!first) {
+			res->nh_sel = nhsel;
+			first = true;
+		}
 	} endfor_nexthops(fi);
-
-	/* Race condition: route has just become dead. */
-	res->nh_sel = 0;
 }
 #endif
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1e1fe6086dd9..bb0419582b8d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -960,6 +960,17 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+	{
+		.procname	= "fib_multipath_use_neigh",
+		.data		= &init_net.ipv4.sysctl_fib_multipath_use_neigh,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{ }
 };
 
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next] tcp/dccp: fix inet_reuseport_add_sock()
From: David Ahern @ 2016-04-07 14:18 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: netdev@vger.kernel.org, edumazet
In-Reply-To: <1460005654.6473.377.camel@edumazet-glaptop3.roam.corp.google.com>

On 4/6/16 11:07 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> David Ahern reported panics in __inet_hash() caused by my recent commit.
>
> The reason is inet_reuseport_add_sock() was still using
> sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
> SO_REUSEPORT enabled listeners were causing an instant crash.
>
> While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
> flag, as it is inherited from the parent at clone time.
>
> Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: David Ahern <dsa@cumulusnetworks.com>
> ---

Fixed the problem I as hiting.

Tested-by: David Ahern <dsa@cumulusnetworks.com>

^ permalink raw reply

* [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Jesper Dangaard Brouer @ 2016-04-07 14:17 UTC (permalink / raw)
  To: lsf, linux-mm
  Cc: brouer, James Bottomley, netdev@vger.kernel.org, Tom Herbert,
	Alexei Starovoitov, Brenden Blanco, lsf-pc
In-Reply-To: <1460034425.20949.7.camel@HansenPartnership.com>

(Topic proposal for MM-summit)

Network Interface Cards (NIC) drivers, and increasing speeds stress
the page-allocator (and DMA APIs).  A number of driver specific
open-coded approaches exists that work-around these bottlenecks in the
page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
allocating larger pages and handing-out page "fragments".

I'm proposing a generic page-pool recycle facility, that can cover the
driver use-cases, increase performance and open up for zero-copy RX.

The basic performance problem is that pages (containing packets at RX)
are cycled through the page allocator (freed at TX DMA completion
time).  While a system in a steady state, could avoid calling the page
allocator, when having a pool of pages equal to the size of the RX
ring plus the number of outstanding frames in the TX ring (waiting for
DMA completion).

The motivation for quick page recycling came primarily for performance
reasons.  But returning pages to the same pool also benefit other
use-cases.  If a NIC HW RX ring is strictly bound (e.g. to a process
or guest/KVM) then pages can be shared/mmap'ed (RX zero-copy) as
information leaking does not occur.  (Obviously for this use-case,
when adding pages into the pool these need to zero'ed out).

The motivation behind implemeting this (extremely fast page-pool) is
because we need it as a building block in the network stack, but
hopefully other areas could also benefit from this.

[Resources/Links]: It is specifically related to:

What Facebook calls XDP (eXpress Data Path)
 * https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
 * RFC patchset thread: http://thread.gmane.org/gmane.linux.network/406288

And what I call the "packet-page" level:
 * BoF on kernel network performance: http://lwn.net/Articles/676806/
 * http://people.netfilter.org/hawk/presentations/NetDev1.1_2016/links.html

See you soon at LFS/MM-summit :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: optimizations to sk_buff handling in rds_tcp_data_ready
From: Eric Dumazet @ 2016-04-07 14:16 UTC (permalink / raw)
  To: Sowmini Varadhan; +Cc: alexei.starovoitov, tom, netdev
In-Reply-To: <20160407130859.GA2818@oracle.com>

On Thu, 2016-04-07 at 09:11 -0400, Sowmini Varadhan wrote:
> I was going back to Alexei's comment here:
>   http://permalink.gmane.org/gmane.linux.network/387806
> In rds-stress profiling,  we are indeed seeing the pksb_pull 
> (from rds_tcp_data_recv) show up in the perf profile.
> 
> At least for rds-tcp, we cannot re-use the skb even if
> it is not shared, because what we need to do is to carve out
> the interesting bits (start at offset, and trim it to to_copy)
> and queue up those interesting bits on the PF_RDS socket,
> while allowing tcp_data_read to go back and read the next 
> tcp segment (which may be part of the next rds dgram).
> 
> But, when  pskb_expand_head is invoked in the call-stack
>   pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
> it will memcpy the offset bytes to the start of data. At least 
> for the rds_tcp_data_recv, we are not interested in being able 
> to do a *skb_push after the *skb_pull, so we dont really care 
> about the intactness of these bytes in offset.
> Thus what I am finding is that when delta > 0, if we skip the 
> following in pskb_expand_head (for the rds-tcp recv case only!)
> 
>         /* Copy only real data... and, alas, header. This should be
>          * optimized for the cases when header is void.
>          */
>         memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
> 
> and also (only for this case!) this one in __pskb_pull_tail
> 
>  if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
>                 BUG();
> 
> I am able to get a 40% improvement in the measured IOPS (req/response 
> transactions per second, using 8K byte requests, 256 byte responses,
> 16 concurrent threads), so this optimization seems worth doing.
> 
> Does my analysis above make sense? If yes, the question is, how to
> achieve this bypass in a neat way.  Clearly there are many callers of
> pskb_expand_head who will expect to find the skb_header_len bytes at
> the start of data, but we also dont want to duplicate code in these
> functions. One thought I had is to pass a flag arouund saying "caller
> doesnt care about retaining offset bytes", and use this flag
> - in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
>   pass delta to pskb_expand_head,
> - in pskb_expand_head, only do the memcpy listed above 
>   if delta <= 0
> Any other ideas for how to achieve this?

Use skb split like TCP in output path ?

Really, pskb_expand_head() is not supposed to copy payload ;)

^ permalink raw reply

* [PATCH net-next v3 2/2] tipc: stricter filtering of packets in bearer layer
From: Jon Maloy @ 2016-04-07 14:09 UTC (permalink / raw)
  To: davem; +Cc: Jon Maloy, netdev, Paul Gortmaker, tipc-discussion, ying.xue
In-Reply-To: <1460038154-26856-1-git-send-email-jon.maloy@ericsson.com>

Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.

To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only.  This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/bearer.c | 50 +++++++++++++++++++++++++++++++++-----------------
 net/tipc/msg.h    |  5 +++++
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 20566e9..6f11c62 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -337,23 +337,16 @@ static int tipc_reset_bearer(struct net *net, struct tipc_bearer *b)
  */
 static void bearer_disable(struct net *net, struct tipc_bearer *b)
 {
-	struct tipc_net *tn = net_generic(net, tipc_net_id);
-	u32 i;
+	struct tipc_net *tn = tipc_net(net);
+	int bearer_id = b->identity;
 
 	pr_info("Disabling bearer <%s>\n", b->name);
 	b->media->disable_media(b);
-
-	tipc_node_delete_links(net, b->identity);
+	tipc_node_delete_links(net, bearer_id);
 	RCU_INIT_POINTER(b->media_ptr, NULL);
 	if (b->link_req)
 		tipc_disc_delete(b->link_req);
-
-	for (i = 0; i < MAX_BEARERS; i++) {
-		if (b == rtnl_dereference(tn->bearer_list[i])) {
-			RCU_INIT_POINTER(tn->bearer_list[i], NULL);
-			break;
-		}
-	}
+	RCU_INIT_POINTER(tn->bearer_list[bearer_id], NULL);
 	kfree_rcu(b, rcu);
 }
 
@@ -396,7 +389,7 @@ void tipc_disable_l2_media(struct tipc_bearer *b)
 
 /**
  * tipc_l2_send_msg - send a TIPC packet out over an L2 interface
- * @buf: the packet to be sent
+ * @skb: the packet to be sent
  * @b: the bearer through which the packet is to be sent
  * @dest: peer destination address
  */
@@ -405,17 +398,21 @@ int tipc_l2_send_msg(struct net *net, struct sk_buff *skb,
 {
 	struct net_device *dev;
 	int delta;
+	void *tipc_ptr;
 
 	dev = (struct net_device *)rcu_dereference_rtnl(b->media_ptr);
 	if (!dev)
 		return 0;
 
+	/* Send RESET message even if bearer is detached from device */
+	tipc_ptr = rtnl_dereference(dev->tipc_ptr);
+	if (unlikely(!tipc_ptr && !msg_is_reset(buf_msg(skb))))
+		goto drop;
+
 	delta = dev->hard_header_len - skb_headroom(skb);
 	if ((delta > 0) &&
-	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) {
-		kfree_skb(skb);
-		return 0;
-	}
+	    pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC))
+		goto drop;
 
 	skb_reset_network_header(skb);
 	skb->dev = dev;
@@ -424,6 +421,9 @@ int tipc_l2_send_msg(struct net *net, struct sk_buff *skb,
 			dev->dev_addr, skb->len);
 	dev_queue_xmit(skb);
 	return 0;
+drop:
+	kfree_skb(skb);
+	return 0;
 }
 
 int tipc_bearer_mtu(struct net *net, u32 bearer_id)
@@ -549,9 +549,18 @@ static int tipc_l2_device_event(struct notifier_block *nb, unsigned long evt,
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct net *net = dev_net(dev);
+	struct tipc_net *tn = tipc_net(net);
 	struct tipc_bearer *b;
+	int i;
 
 	b = rtnl_dereference(dev->tipc_ptr);
+	if (!b) {
+		for (i = 0; i < MAX_BEARERS; b = NULL, i++) {
+			b = rtnl_dereference(tn->bearer_list[i]);
+			if (b && (b->media_ptr == dev))
+				break;
+		}
+	}
 	if (!b)
 		return NOTIFY_DONE;
 
@@ -561,13 +570,20 @@ static int tipc_l2_device_event(struct notifier_block *nb, unsigned long evt,
 	case NETDEV_CHANGE:
 		if (netif_carrier_ok(dev))
 			break;
+	case NETDEV_UP:
+		rcu_assign_pointer(dev->tipc_ptr, b);
+		break;
 	case NETDEV_GOING_DOWN:
+		RCU_INIT_POINTER(dev->tipc_ptr, NULL);
+		synchronize_net();
+		tipc_reset_bearer(net, b);
+		break;
 	case NETDEV_CHANGEMTU:
 		tipc_reset_bearer(net, b);
 		break;
 	case NETDEV_CHANGEADDR:
 		b->media->raw2addr(b, &b->addr,
-				       (char *)dev->dev_addr);
+				   (char *)dev->dev_addr);
 		tipc_reset_bearer(net, b);
 		break;
 	case NETDEV_UNREGISTER:
diff --git a/net/tipc/msg.h b/net/tipc/msg.h
index 55778a0..f34f639 100644
--- a/net/tipc/msg.h
+++ b/net/tipc/msg.h
@@ -779,6 +779,11 @@ static inline bool msg_peer_node_is_up(struct tipc_msg *m)
 	return msg_redundant_link(m);
 }
 
+static inline bool msg_is_reset(struct tipc_msg *hdr)
+{
+	return (msg_user(hdr) == LINK_PROTOCOL) && (msg_type(hdr) == RESET_MSG);
+}
+
 struct sk_buff *tipc_buf_acquire(u32 size);
 bool tipc_msg_validate(struct sk_buff *skb);
 bool tipc_msg_reverse(u32 own_addr, struct sk_buff **skb, int err);
-- 
1.9.1


------------------------------------------------------------------------------

^ permalink raw reply related

* [PATCH net-next v3 1/2] tipc: eliminate buffer leak in bearer layer
From: Jon Maloy @ 2016-04-07 14:09 UTC (permalink / raw)
  To: davem; +Cc: Jon Maloy, netdev, Paul Gortmaker, tipc-discussion, ying.xue
In-Reply-To: <1460038154-26856-1-git-send-email-jon.maloy@ericsson.com>

When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a  bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.

This commit fixes this problem by introducing two changes:

1) Instead of attemting to send the discovery message directly, we let
   tipc_disc_create() return the discovery buffer to the calling
   function, tipc_enable_bearer(), so that the latter can send it
   when the enabling sequence is finished.

2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
   functions at the bearer layer, we now free the indicated buffer or
   buffer chain when a valid bearer cannot be found.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/bearer.c   | 51 ++++++++++++++++++++++++++-------------------------
 net/tipc/discover.c |  7 ++-----
 net/tipc/discover.h |  2 +-
 3 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 27a5406..20566e9 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -205,6 +205,7 @@ static int tipc_enable_bearer(struct net *net, const char *name,
 	struct tipc_bearer *b;
 	struct tipc_media *m;
 	struct tipc_bearer_names b_names;
+	struct sk_buff *skb;
 	char addr_string[16];
 	u32 bearer_id;
 	u32 with_this_prio;
@@ -301,7 +302,7 @@ restart:
 	b->net_plane = bearer_id + 'A';
 	b->priority = priority;
 
-	res = tipc_disc_create(net, b, &b->bcast_addr);
+	res = tipc_disc_create(net, b, &b->bcast_addr, &skb);
 	if (res) {
 		bearer_disable(net, b);
 		pr_warn("Bearer <%s> rejected, discovery object creation failed\n",
@@ -310,7 +311,8 @@ restart:
 	}
 
 	rcu_assign_pointer(tn->bearer_list[bearer_id], b);
-
+	if (skb)
+		tipc_bearer_xmit_skb(net, bearer_id, skb, &b->bcast_addr);
 	pr_info("Enabled bearer <%s>, discovery domain %s, priority %u\n",
 		name,
 		tipc_addr_string_fill(addr_string, disc_domain), priority);
@@ -450,6 +452,8 @@ void tipc_bearer_xmit_skb(struct net *net, u32 bearer_id,
 	b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
 	if (likely(b))
 		b->media->send_msg(net, skb, b, dest);
+	else
+		kfree_skb(skb);
 	rcu_read_unlock();
 }
 
@@ -468,11 +472,11 @@ void tipc_bearer_xmit(struct net *net, u32 bearer_id,
 
 	rcu_read_lock();
 	b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
-	if (likely(b)) {
-		skb_queue_walk_safe(xmitq, skb, tmp) {
-			__skb_dequeue(xmitq);
-			b->media->send_msg(net, skb, b, dst);
-		}
+	if (unlikely(!b))
+		__skb_queue_purge(xmitq);
+	skb_queue_walk_safe(xmitq, skb, tmp) {
+		__skb_dequeue(xmitq);
+		b->media->send_msg(net, skb, b, dst);
 	}
 	rcu_read_unlock();
 }
@@ -490,14 +494,14 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
 
 	rcu_read_lock();
 	b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
-	if (likely(b)) {
-		skb_queue_walk_safe(xmitq, skb, tmp) {
-			hdr = buf_msg(skb);
-			msg_set_non_seq(hdr, 1);
-			msg_set_mc_netid(hdr, net_id);
-			__skb_dequeue(xmitq);
-			b->media->send_msg(net, skb, b, &b->bcast_addr);
-		}
+	if (unlikely(!b))
+		__skb_queue_purge(xmitq);
+	skb_queue_walk_safe(xmitq, skb, tmp) {
+		hdr = buf_msg(skb);
+		msg_set_non_seq(hdr, 1);
+		msg_set_mc_netid(hdr, net_id);
+		__skb_dequeue(xmitq);
+		b->media->send_msg(net, skb, b, &b->bcast_addr);
 	}
 	rcu_read_unlock();
 }
@@ -513,24 +517,21 @@ void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
  * ignores packets sent using interface multicast, and traffic sent to other
  * nodes (which can happen if interface is running in promiscuous mode).
  */
-static int tipc_l2_rcv_msg(struct sk_buff *buf, struct net_device *dev,
+static int tipc_l2_rcv_msg(struct sk_buff *skb, struct net_device *dev,
 			   struct packet_type *pt, struct net_device *orig_dev)
 {
 	struct tipc_bearer *b;
 
 	rcu_read_lock();
 	b = rcu_dereference_rtnl(dev->tipc_ptr);
-	if (likely(b)) {
-		if (likely(buf->pkt_type <= PACKET_BROADCAST)) {
-			buf->next = NULL;
-			tipc_rcv(dev_net(dev), buf, b);
-			rcu_read_unlock();
-			return NET_RX_SUCCESS;
-		}
+	if (likely(b && (skb->pkt_type <= PACKET_BROADCAST))) {
+		skb->next = NULL;
+		tipc_rcv(dev_net(dev), skb, b);
+		rcu_read_unlock();
+		return NET_RX_SUCCESS;
 	}
 	rcu_read_unlock();
-
-	kfree_skb(buf);
+	kfree_skb(skb);
 	return NET_RX_DROP;
 }
 
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index f1e738e..ad9d477 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -268,10 +268,9 @@ exit:
  * Returns 0 if successful, otherwise -errno.
  */
 int tipc_disc_create(struct net *net, struct tipc_bearer *b,
-		     struct tipc_media_addr *dest)
+		     struct tipc_media_addr *dest, struct sk_buff **skb)
 {
 	struct tipc_link_req *req;
-	struct sk_buff *skb;
 
 	req = kmalloc(sizeof(*req), GFP_ATOMIC);
 	if (!req)
@@ -293,9 +292,7 @@ int tipc_disc_create(struct net *net, struct tipc_bearer *b,
 	setup_timer(&req->timer, disc_timeout, (unsigned long)req);
 	mod_timer(&req->timer, jiffies + req->timer_intv);
 	b->link_req = req;
-	skb = skb_clone(req->buf, GFP_ATOMIC);
-	if (skb)
-		tipc_bearer_xmit_skb(net, req->bearer_id, skb, &req->dest);
+	*skb = skb_clone(req->buf, GFP_ATOMIC);
 	return 0;
 }
 
diff --git a/net/tipc/discover.h b/net/tipc/discover.h
index c9b1277..b80a335 100644
--- a/net/tipc/discover.h
+++ b/net/tipc/discover.h
@@ -40,7 +40,7 @@
 struct tipc_link_req;
 
 int tipc_disc_create(struct net *net, struct tipc_bearer *b_ptr,
-		     struct tipc_media_addr *dest);
+		     struct tipc_media_addr *dest, struct sk_buff **skb);
 void tipc_disc_delete(struct tipc_link_req *req);
 void tipc_disc_reset(struct net *net, struct tipc_bearer *b_ptr);
 void tipc_disc_add_dest(struct tipc_link_req *req);
-- 
1.9.1


------------------------------------------------------------------------------

^ permalink raw reply related

* [PATCH net-next v3 0/2]  tipc: some small fixes
From: Jon Maloy @ 2016-04-07 14:09 UTC (permalink / raw)
  To: davem; +Cc: Jon Maloy, netdev, Paul Gortmaker, tipc-discussion, ying.xue

When fix a minor buffer leak, and ensure that bearers filter packets
correctly while they are being shut down.

v2: Corrected typos in commit #3, as per feedback from S. Shtylyov
v3: Removed commit #3 from the series. Improved version will be 
    re-submitted later.

Jon Maloy (2):
  tipc: eliminate buffer leak in bearer layer
  tipc: stricter filtering of packets in bearer layer

 net/tipc/bearer.c   | 101 ++++++++++++++++++++++++++++++----------------------
 net/tipc/discover.c |   7 ++--
 net/tipc/discover.h |   2 +-
 net/tipc/msg.h      |   5 +++
 4 files changed, 67 insertions(+), 48 deletions(-)

-- 
1.9.1


------------------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH] net: mark DECnet as broken
From: One Thousand Gnomes @ 2016-04-07 14:01 UTC (permalink / raw)
  To: Vegard Nossum
  Cc: David Miller, netdev, linux-kernel, Eric Dumazet, Sasha Levin
In-Reply-To: <1460013763-22985-1-git-send-email-vegard.nossum@oracle.com>

On Thu,  7 Apr 2016 09:22:43 +0200
Vegard Nossum <vegard.nossum@oracle.com> wrote:

> There are NULL pointer dereference bugs in DECnet which can be triggered
> by unprivileged users and have been reported multiple times to LKML,
> however nobody seems confident enough in the proposed fixes to merge them
> and the consensus seems to be that nobody cares enough about DECnet to
> see it fixed anyway.
> 
> To shield unsuspecting users from the possible DOS, we should mark this
> BROKEN until somebody who actually uses this code can fix it.

How about consigning it to staging at this point ?

Alan

^ permalink raw reply

* Re: [PATCH v2 2/2] sctp: delay calls to sk_data_ready() as much as possible
From: Marcelo Ricardo Leitner @ 2016-04-07 13:35 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: netdev, Neil Horman, Vlad Yasevich, linux-sctp
In-Reply-To: <87egah3ktv.fsf@beetle.home>

On Thu, Apr 07, 2016 at 10:05:32AM +0200, Jakub Sitnicki wrote:
> On Wed, Apr 06, 2016 at 07:53 PM CEST, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote:
> > Currently, the processing of multiple chunks in a single SCTP packet
> > leads to multiple calls to sk_data_ready, causing multiple wake up
> > signals which are costly and doesn't make it wake up any faster.
> >
> > With this patch it will notice that the wake up is pending and will do it
> > before leaving the state machine interpreter, latest place possible to
> > do it realiably and cleanly.
> >
> > Note that sk_data_ready events are not dependent on asocs, unlike waking
> > up writers.
> >
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> 
> [...]
> 
> > diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> > index 7fe56d0acabf66cfd8fe29dfdb45f7620b470ac7..e7042f9ce63b0cfca50cae252f51b60b68cb5731 100644
> > --- a/net/sctp/sm_sideeffect.c
> > +++ b/net/sctp/sm_sideeffect.c
> > @@ -1742,6 +1742,11 @@ out:
> >  			error = sctp_outq_uncork(&asoc->outqueue, gfp);
> >  	} else if (local_cork)
> >  		error = sctp_outq_uncork(&asoc->outqueue, gfp);
> > +
> > +	if (sctp_sk(ep->base.sk)->pending_data_ready) {
> > +		ep->base.sk->sk_data_ready(ep->base.sk);
> > +		sctp_sk(ep->base.sk)->pending_data_ready = 0;
> > +	}
> >  	return error;
> >  nomem:
> >  	error = -ENOMEM;
> 
> Would it make sense to introduce a local variable for ep->base.sk (and
> make this function 535+1 lines long ;-)
> 
>       struct sock *sk = ep->base.sk;
> 
> ... like sctp_ulpq_tail_event() does?

I guess so, yes. Same for sctp_sk() cast then. I´ll post a new version
later, thanks.

  Marcelo

^ permalink raw reply

* optimizations to sk_buff handling in rds_tcp_data_ready
From: Sowmini Varadhan @ 2016-04-07 13:11 UTC (permalink / raw)
  To: alexei.starovoitov, tom, netdev, sowmini.varadhan

I was going back to Alexei's comment here:
  http://permalink.gmane.org/gmane.linux.network/387806
In rds-stress profiling,  we are indeed seeing the pksb_pull 
(from rds_tcp_data_recv) show up in the perf profile.

At least for rds-tcp, we cannot re-use the skb even if
it is not shared, because what we need to do is to carve out
the interesting bits (start at offset, and trim it to to_copy)
and queue up those interesting bits on the PF_RDS socket,
while allowing tcp_data_read to go back and read the next 
tcp segment (which may be part of the next rds dgram).

But, when  pskb_expand_head is invoked in the call-stack
  pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
it will memcpy the offset bytes to the start of data. At least 
for the rds_tcp_data_recv, we are not interested in being able 
to do a *skb_push after the *skb_pull, so we dont really care 
about the intactness of these bytes in offset.
Thus what I am finding is that when delta > 0, if we skip the 
following in pskb_expand_head (for the rds-tcp recv case only!)

        /* Copy only real data... and, alas, header. This should be
         * optimized for the cases when header is void.
         */
        memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);

and also (only for this case!) this one in __pskb_pull_tail

 if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
                BUG();

I am able to get a 40% improvement in the measured IOPS (req/response 
transactions per second, using 8K byte requests, 256 byte responses,
16 concurrent threads), so this optimization seems worth doing.

Does my analysis above make sense? If yes, the question is, how to
achieve this bypass in a neat way.  Clearly there are many callers of
pskb_expand_head who will expect to find the skb_header_len bytes at
the start of data, but we also dont want to duplicate code in these
functions. One thought I had is to pass a flag arouund saying "caller
doesnt care about retaining offset bytes", and use this flag
- in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
  pass delta to pskb_expand_head,
- in pskb_expand_head, only do the memcpy listed above 
  if delta <= 0
Any other ideas for how to achieve this?

--Sowmini

^ permalink raw reply

* Re: [PATCH net-next 5/7] sctp: reuse the some transport traversal functions in proc
From: Neil Horman @ 2016-04-07 13:09 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <cdc77cc7b894aab62dad372f0018c571e33cf16e.1459829123.git.lucien.xin@gmail.com>

On Tue, Apr 05, 2016 at 12:06:30PM +0800, Xin Long wrote:
> There are some transport traversal functions for sctp_diag, we can also
> use it for sctp_proc. cause they have the similar situation to traversal
> transport.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/sctp/proc.c | 80 +++++++++++++--------------------------------------------
>  1 file changed, 18 insertions(+), 62 deletions(-)
> 
> diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> index 5cfac8d..dd8492f 100644
> --- a/net/sctp/proc.c
> +++ b/net/sctp/proc.c
> @@ -282,80 +282,31 @@ struct sctp_ht_iter {
>  	struct rhashtable_iter hti;
>  };
>  
> -static struct sctp_transport *sctp_transport_get_next(struct seq_file *seq)
> -{
> -	struct sctp_ht_iter *iter = seq->private;
> -	struct sctp_transport *t;
> -
> -	t = rhashtable_walk_next(&iter->hti);
> -	for (; t; t = rhashtable_walk_next(&iter->hti)) {
> -		if (IS_ERR(t)) {
> -			if (PTR_ERR(t) == -EAGAIN)
> -				continue;
> -			break;
> -		}
> -
> -		if (net_eq(sock_net(t->asoc->base.sk), seq_file_net(seq)) &&
> -		    t->asoc->peer.primary_path == t)
> -			break;
> -	}
> -
> -	return t;
> -}
> -

this may just be a nit, but you defined the new sctp_transport_get_next in patch
2 of this series, and didn't remove this private version until here.  Is that
going to cause some behavioral issue, if someone builds a kernel between patch 2
and 7?  Seems like perhaps those two patches should be merged.

Neil

^ permalink raw reply

* [PATCH v7 net-next 1/1] hv_sock: introduce Hyper-V Sockets
From: Dexuan Cui @ 2016-04-07 12:50 UTC (permalink / raw)
  To: gregkh, davem, netdev, linux-kernel, devel, olaf, apw, jasowang,
	kys, haiyangz

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 MAINTAINERS                 |    2 +
 include/linux/hyperv.h      |   16 +
 include/linux/socket.h      |    5 +-
 include/net/af_hvsock.h     |   51 ++
 include/uapi/linux/hyperv.h |   16 +
 net/Kconfig                 |    1 +
 net/Makefile                |    1 +
 net/hv_sock/Kconfig         |   10 +
 net/hv_sock/Makefile        |    3 +
 net/hv_sock/af_hvsock.c     | 1481 +++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 1584 insertions(+), 2 deletions(-)
 create mode 100644 include/net/af_hvsock.h
 create mode 100644 net/hv_sock/Kconfig
 create mode 100644 net/hv_sock/Makefile
 create mode 100644 net/hv_sock/af_hvsock.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 67d99dd..7b6f203 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5267,7 +5267,9 @@ F:	drivers/pci/host/pci-hyperv.c
 F:	drivers/net/hyperv/
 F:	drivers/scsi/storvsc_drv.c
 F:	drivers/video/fbdev/hyperv_fb.c
+F:	net/hv_sock/
 F:	include/linux/hyperv.h
+F:	include/net/af_hvsock.h
 F:	tools/hv/
 F:	Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index aa0fadc..b92439d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1338,4 +1338,20 @@ extern __u32 vmbus_proto_version;
 
 int vmbus_send_tl_connect_request(const uuid_le *shv_guest_servie_id,
 				  const uuid_le *shv_host_servie_id);
+struct vmpipe_proto_header {
+	u32 pkt_type;
+	u32 data_size;
+} __packed;
+
+#define HVSOCK_HEADER_LEN	(sizeof(struct vmpacket_descriptor) + \
+				 sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN	(sizeof(u64))
+
+#define HVSOCK_PKT_LEN(payload_len)	(HVSOCK_HEADER_LEN + \
+					ALIGN((payload_len), 8) + \
+					PREV_INDICES_LEN)
+#define HVSOCK_MIN_PKT_LEN	HVSOCK_PKT_LEN(1)
+
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 73bf6c6..88b1ccd 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -201,8 +201,8 @@ struct ucred {
 #define AF_NFC		39	/* NFC sockets			*/
 #define AF_VSOCK	40	/* vSockets			*/
 #define AF_KCM		41	/* Kernel Connection Multiplexor*/
-
-#define AF_MAX		42	/* For now.. */
+#define AF_HYPERV	42	/* Hyper-V Sockets		*/
+#define AF_MAX		43	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -249,6 +249,7 @@ struct ucred {
 #define PF_NFC		AF_NFC
 #define PF_VSOCK	AF_VSOCK
 #define PF_KCM		AF_KCM
+#define PF_HYPERV	AF_HYPERV
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
new file mode 100644
index 0000000..a5aa28d
--- /dev/null
+++ b/include/net/af_hvsock.h
@@ -0,0 +1,51 @@
+#ifndef __AF_HVSOCK_H__
+#define __AF_HVSOCK_H__
+
+#include <linux/kernel.h>
+#include <linux/hyperv.h>
+#include <net/sock.h>
+
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV (5 * PAGE_SIZE)
+#define VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND (5 * PAGE_SIZE)
+
+#define HVSOCK_RCV_BUF_SZ	VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV
+#define HVSOCK_SND_BUF_SZ	PAGE_SIZE
+
+#define sk_to_hvsock(__sk)    ((struct hvsock_sock *)(__sk))
+#define hvsock_to_sk(__hvsk)   ((struct sock *)(__hvsk))
+
+struct hvsock_sock {
+	/* sk must be the first member. */
+	struct sock sk;
+
+	struct sockaddr_hv local_addr;
+	struct sockaddr_hv remote_addr;
+
+	/* protected by the global hvsock_mutex */
+	struct list_head bound_list;
+	struct list_head connected_list;
+
+	struct list_head accept_queue;
+	/* used by enqueue and dequeue */
+	struct mutex accept_queue_mutex;
+
+	struct delayed_work dwork;
+
+	u32 peer_shutdown;
+
+	struct vmbus_channel *channel;
+
+	struct {
+		struct vmpipe_proto_header hdr;
+		char buf[HVSOCK_SND_BUF_SZ];
+	} __packed send;
+
+	struct {
+		struct vmpipe_proto_header hdr;
+		char buf[HVSOCK_RCV_BUF_SZ];
+		unsigned int data_len;
+		unsigned int data_offset;
+	} __packed recv;
+};
+
+#endif /* __AF_HVSOCK_H__ */
diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
index e347b24..160b614 100644
--- a/include/uapi/linux/hyperv.h
+++ b/include/uapi/linux/hyperv.h
@@ -26,6 +26,7 @@
 #define _UAPI_HYPERV_H
 
 #include <linux/uuid.h>
+#include <linux/socket.h>
 
 /*
  * Framework version for util services.
@@ -396,4 +397,19 @@ struct hv_kvp_ip_msg {
 	struct hv_kvp_ipaddr_value      kvp_ip_val;
 } __attribute__((packed));
 
+/* This is the address fromat of Hyper-V Sockets. */
+struct sockaddr_hv {
+	__kernel_sa_family_t	shv_family;  /* Address family		*/
+	__le16		reserved;	     /* Must be Zero		*/
+	uuid_le		shv_vm_id;	     /* Not used. Must be Zero. */
+	uuid_le		shv_service_id;	     /* Service ID		*/
+};
+
+#define SHV_VMID_GUEST	NULL_UUID_LE
+#define SHV_VMID_HOST	NULL_UUID_LE
+
+#define SHV_SERVICE_ID_ANY	NULL_UUID_LE
+
+#define SHV_PROTO_RAW		1
+
 #endif /* _UAPI_HYPERV_H */
diff --git a/net/Kconfig b/net/Kconfig
index a8934d8..68d13d7 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -231,6 +231,7 @@ source "net/dns_resolver/Kconfig"
 source "net/batman-adv/Kconfig"
 source "net/openvswitch/Kconfig"
 source "net/vmw_vsock/Kconfig"
+source "net/hv_sock/Kconfig"
 source "net/netlink/Kconfig"
 source "net/mpls/Kconfig"
 source "net/hsr/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 81d1411..d115c31 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_BATMAN_ADV)	+= batman-adv/
 obj-$(CONFIG_NFC)		+= nfc/
 obj-$(CONFIG_OPENVSWITCH)	+= openvswitch/
 obj-$(CONFIG_VSOCKETS)	+= vmw_vsock/
+obj-$(CONFIG_HYPERV_SOCK)	+= hv_sock/
 obj-$(CONFIG_MPLS)		+= mpls/
 obj-$(CONFIG_HSR)		+= hsr/
 ifneq ($(CONFIG_NET_SWITCHDEV),)
diff --git a/net/hv_sock/Kconfig b/net/hv_sock/Kconfig
new file mode 100644
index 0000000..1f41848
--- /dev/null
+++ b/net/hv_sock/Kconfig
@@ -0,0 +1,10 @@
+config HYPERV_SOCK
+	tristate "Hyper-V Sockets"
+	depends on HYPERV
+	default m if HYPERV
+	help
+	  Hyper-V Sockets is somewhat like TCP over VMBus, allowing
+	  communication between Linux guest and Hyper-V host without TCP/IP.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called hv_sock.
diff --git a/net/hv_sock/Makefile b/net/hv_sock/Makefile
new file mode 100644
index 0000000..716c012
--- /dev/null
+++ b/net/hv_sock/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_HYPERV_SOCK) += hv_sock.o
+
+hv_sock-y += af_hvsock.o
diff --git a/net/hv_sock/af_hvsock.c b/net/hv_sock/af_hvsock.c
new file mode 100644
index 0000000..55ee98c
--- /dev/null
+++ b/net/hv_sock/af_hvsock.c
@@ -0,0 +1,1481 @@
+/*
+ * Hyper-V Sockets -- a socket-based communication channel between the
+ * Hyper-V host and the virtual machines running on it.
+ *
+ * Copyright(c) 2016, Microsoft Corporation. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. The name of the author may not be used to endorse or promote
+ *    products derived from this software without specific prior written
+ *    permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+ * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <net/af_hvsock.h>
+
+static struct proto hvsock_proto = {
+	.name = "HV_SOCK",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof(struct hvsock_sock),
+};
+
+#define SS_LISTEN 255
+
+static LIST_HEAD(hvsock_bound_list);
+static LIST_HEAD(hvsock_connected_list);
+static DEFINE_MUTEX(hvsock_mutex);
+
+static bool uuid_equals(uuid_le u1, uuid_le u2)
+{
+	return !uuid_le_cmp(u1, u2);
+}
+
+/* NOTE: hvsock_mutex must be held when the below helper functions, whose
+ * names begin with __ hvsock, are invoked.
+ */
+static void __hvsock_insert_bound(struct list_head *list,
+				  struct hvsock_sock *hvsk)
+{
+	sock_hold(&hvsk->sk);
+	list_add(&hvsk->bound_list, list);
+}
+
+static void __hvsock_insert_connected(struct list_head *list,
+				      struct hvsock_sock *hvsk)
+{
+	sock_hold(&hvsk->sk);
+	list_add(&hvsk->connected_list, list);
+}
+
+static void __hvsock_remove_bound(struct hvsock_sock *hvsk)
+{
+	list_del_init(&hvsk->bound_list);
+	sock_put(&hvsk->sk);
+}
+
+static void __hvsock_remove_connected(struct hvsock_sock *hvsk)
+{
+	list_del_init(&hvsk->connected_list);
+	sock_put(&hvsk->sk);
+}
+
+static struct sock *__hvsock_find_bound_socket(const struct sockaddr_hv *addr)
+{
+	struct hvsock_sock *hvsk;
+
+	list_for_each_entry(hvsk, &hvsock_bound_list, bound_list)
+		if (uuid_equals(addr->shv_service_id,
+				hvsk->local_addr.shv_service_id))
+			return hvsock_to_sk(hvsk);
+	return NULL;
+}
+
+static struct sock *__hvsock_find_connected_socket_by_channel(
+	const struct vmbus_channel *channel)
+{
+	struct hvsock_sock *hvsk;
+
+	list_for_each_entry(hvsk, &hvsock_connected_list, connected_list)
+		if (hvsk->channel == channel)
+			return hvsock_to_sk(hvsk);
+	return NULL;
+}
+
+static bool __hvsock_in_bound_list(struct hvsock_sock *hvsk)
+{
+	return !list_empty(&hvsk->bound_list);
+}
+
+static bool __hvsock_in_connected_list(struct hvsock_sock *hvsk)
+{
+	return !list_empty(&hvsk->connected_list);
+}
+
+static void hvsock_insert_connected(struct hvsock_sock *hvsk)
+{
+	__hvsock_insert_connected(&hvsock_connected_list, hvsk);
+}
+
+static
+void hvsock_enqueue_accept(struct sock *listener, struct sock *connected)
+{
+	struct hvsock_sock *hvlistener;
+	struct hvsock_sock *hvconnected;
+
+	hvlistener = sk_to_hvsock(listener);
+	hvconnected = sk_to_hvsock(connected);
+
+	sock_hold(connected);
+	sock_hold(listener);
+
+	mutex_lock(&hvlistener->accept_queue_mutex);
+	list_add_tail(&hvconnected->accept_queue, &hvlistener->accept_queue);
+	listener->sk_ack_backlog++;
+	mutex_unlock(&hvlistener->accept_queue_mutex);
+}
+
+static struct sock *hvsock_dequeue_accept(struct sock *listener)
+{
+	struct hvsock_sock *hvlistener;
+	struct hvsock_sock *hvconnected;
+
+	hvlistener = sk_to_hvsock(listener);
+
+	mutex_lock(&hvlistener->accept_queue_mutex);
+
+	if (list_empty(&hvlistener->accept_queue)) {
+		mutex_unlock(&hvlistener->accept_queue_mutex);
+		return NULL;
+	}
+
+	hvconnected = list_entry(hvlistener->accept_queue.next,
+				 struct hvsock_sock, accept_queue);
+
+	list_del_init(&hvconnected->accept_queue);
+	listener->sk_ack_backlog--;
+
+	mutex_unlock(&hvlistener->accept_queue_mutex);
+
+	sock_put(listener);
+	/* The caller will need a reference on the connected socket so we let
+	 * it call sock_put().
+	 */
+
+	return hvsock_to_sk(hvconnected);
+}
+
+static bool hvsock_is_accept_queue_empty(struct sock *sk)
+{
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	int ret;
+
+	mutex_lock(&hvsk->accept_queue_mutex);
+	ret = list_empty(&hvsk->accept_queue);
+	mutex_unlock(&hvsk->accept_queue_mutex);
+
+	return ret;
+}
+
+static void hvsock_addr_init(struct sockaddr_hv *addr, uuid_le service_id)
+{
+	memset(addr, 0, sizeof(*addr));
+	addr->shv_family = AF_HYPERV;
+	addr->shv_service_id = service_id;
+}
+
+static int hvsock_addr_validate(const struct sockaddr_hv *addr)
+{
+	if (!addr)
+		return -EFAULT;
+
+	if (addr->shv_family != AF_HYPERV)
+		return -EAFNOSUPPORT;
+
+	if (addr->reserved != 0)
+		return -EINVAL;
+
+	if (!uuid_equals(addr->shv_vm_id, NULL_UUID_LE))
+		return -EINVAL;
+
+	return 0;
+}
+
+static bool hvsock_addr_bound(const struct sockaddr_hv *addr)
+{
+	return !uuid_equals(addr->shv_service_id, SHV_SERVICE_ID_ANY);
+}
+
+static int hvsock_addr_cast(const struct sockaddr *addr, size_t len,
+			    struct sockaddr_hv **out_addr)
+{
+	if (len < sizeof(**out_addr))
+		return -EFAULT;
+
+	*out_addr = (struct sockaddr_hv *)addr;
+	return hvsock_addr_validate(*out_addr);
+}
+
+static int __hvsock_do_bind(struct hvsock_sock *hvsk,
+			    struct sockaddr_hv *addr)
+{
+	struct sockaddr_hv hv_addr;
+	int ret = 0;
+
+	hvsock_addr_init(&hv_addr, addr->shv_service_id);
+
+	mutex_lock(&hvsock_mutex);
+
+	if (uuid_equals(addr->shv_service_id, SHV_SERVICE_ID_ANY)) {
+		do {
+			uuid_le_gen(&hv_addr.shv_service_id);
+		} while (__hvsock_find_bound_socket(&hv_addr));
+	} else {
+		if (__hvsock_find_bound_socket(&hv_addr)) {
+			ret = -EADDRINUSE;
+			goto out;
+		}
+	}
+
+	hvsock_addr_init(&hvsk->local_addr, hv_addr.shv_service_id);
+	__hvsock_insert_bound(&hvsock_bound_list, hvsk);
+
+out:
+	mutex_unlock(&hvsock_mutex);
+
+	return ret;
+}
+
+static int __hvsock_bind(struct sock *sk, struct sockaddr_hv *addr)
+{
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	int ret;
+
+	if (hvsock_addr_bound(&hvsk->local_addr))
+		return -EINVAL;
+
+	switch (sk->sk_socket->type) {
+	case SOCK_STREAM:
+		ret = __hvsock_do_bind(hvsk, addr);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+/* Autobind this socket to the local address if necessary. */
+static int hvsock_auto_bind(struct hvsock_sock *hvsk)
+{
+	struct sock *sk = hvsock_to_sk(hvsk);
+	struct sockaddr_hv local_addr;
+
+	if (hvsock_addr_bound(&hvsk->local_addr))
+		return 0;
+	hvsock_addr_init(&local_addr, SHV_SERVICE_ID_ANY);
+	return __hvsock_bind(sk, &local_addr);
+}
+
+static void hvsock_sk_destruct(struct sock *sk)
+{
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	struct vmbus_channel *channel = hvsk->channel;
+
+	if (!channel)
+		return;
+
+	vmbus_hvsock_device_unregister(channel);
+}
+
+static void __hvsock_release(struct sock *sk)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *pending;
+
+	hvsk = sk_to_hvsock(sk);
+
+	mutex_lock(&hvsock_mutex);
+	if (__hvsock_in_bound_list(hvsk))
+		__hvsock_remove_bound(hvsk);
+
+	if (__hvsock_in_connected_list(hvsk))
+		__hvsock_remove_connected(hvsk);
+	mutex_unlock(&hvsock_mutex);
+
+	lock_sock(sk);
+	sock_orphan(sk);
+	sk->sk_shutdown = SHUTDOWN_MASK;
+
+	/* Clean up any sockets that never were accepted. */
+	while ((pending = hvsock_dequeue_accept(sk)) != NULL) {
+		__hvsock_release(pending);
+		sock_put(pending);
+	}
+
+	release_sock(sk);
+	sock_put(sk);
+}
+
+static int hvsock_release(struct socket *sock)
+{
+	/* If accept() is interrupted by a signal, the temporary socket
+	 * struct's sock->sk is NULL.
+	 */
+	if (sock->sk) {
+		__hvsock_release(sock->sk);
+		sock->sk = NULL;
+	}
+
+	sock->state = SS_FREE;
+	return 0;
+}
+
+static struct sock *__hvsock_create(struct net *net, struct socket *sock,
+				    gfp_t priority, unsigned short type)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	sk = sk_alloc(net, AF_HYPERV, priority, &hvsock_proto, 0);
+	if (!sk)
+		return NULL;
+
+	sock_init_data(sock, sk);
+
+	/* sk->sk_type is normally set in sock_init_data, but only if sock is
+	 * non-NULL. We make sure that our sockets always have a type by
+	 * setting it here if needed.
+	 */
+	if (!sock)
+		sk->sk_type = type;
+
+	hvsk = sk_to_hvsock(sk);
+	hvsock_addr_init(&hvsk->local_addr, SHV_SERVICE_ID_ANY);
+	hvsock_addr_init(&hvsk->remote_addr, SHV_SERVICE_ID_ANY);
+
+	sk->sk_destruct = hvsock_sk_destruct;
+
+	/* Looks stream-based socket doesn't need this. */
+	sk->sk_backlog_rcv = NULL;
+
+	sk->sk_state = 0;
+	sock_reset_flag(sk, SOCK_DONE);
+
+	INIT_LIST_HEAD(&hvsk->bound_list);
+	INIT_LIST_HEAD(&hvsk->connected_list);
+
+	INIT_LIST_HEAD(&hvsk->accept_queue);
+	mutex_init(&hvsk->accept_queue_mutex);
+
+	hvsk->peer_shutdown = 0;
+
+	hvsk->recv.data_len = 0;
+	hvsk->recv.data_offset = 0;
+
+	return sk;
+}
+
+static int hvsock_bind(struct socket *sock, struct sockaddr *addr,
+		       int addr_len)
+{
+	struct sockaddr_hv *hv_addr;
+	struct sock *sk;
+	int ret;
+
+	sk = sock->sk;
+
+	if (hvsock_addr_cast(addr, addr_len, &hv_addr) != 0)
+		return -EINVAL;
+
+	lock_sock(sk);
+	ret = __hvsock_bind(sk, hv_addr);
+	release_sock(sk);
+
+	return ret;
+}
+
+static int hvsock_getname(struct socket *sock,
+			  struct sockaddr *addr, int *addr_len, int peer)
+{
+	struct sockaddr_hv *hv_addr;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+	int ret;
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+	ret = 0;
+
+	lock_sock(sk);
+
+	if (peer) {
+		if (sock->state != SS_CONNECTED) {
+			ret = -ENOTCONN;
+			goto out;
+		}
+		hv_addr = &hvsk->remote_addr;
+	} else {
+		hv_addr = &hvsk->local_addr;
+	}
+
+	__sockaddr_check_size(sizeof(*hv_addr));
+
+	memcpy(addr, hv_addr, sizeof(*hv_addr));
+	*addr_len = sizeof(*hv_addr);
+
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static int hvsock_shutdown(struct socket *sock, int mode)
+{
+	struct sock *sk;
+
+	if (mode < SHUT_RD || mode > SHUT_RDWR)
+		return -EINVAL;
+	/* This maps:
+	 * SHUT_RD   (0) -> RCV_SHUTDOWN  (1)
+	 * SHUT_WR   (1) -> SEND_SHUTDOWN (2)
+	 * SHUT_RDWR (2) -> SHUTDOWN_MASK (3)
+	 */
+	++mode;
+
+	if (sock->state == SS_UNCONNECTED)
+		return -ENOTCONN;
+
+	sock->state = SS_DISCONNECTING;
+
+	sk = sock->sk;
+
+	lock_sock(sk);
+
+	sk->sk_shutdown |= mode;
+	sk->sk_state_change(sk);
+
+	/* TODO: how to send a FIN if we haven't done that? */
+	if (mode & SEND_SHUTDOWN)
+		;
+
+	release_sock(sk);
+
+	return 0;
+}
+
+static void get_ringbuffer_rw_status(struct vmbus_channel *channel,
+				     bool *can_read, bool *can_write)
+{
+	u32 avl_read_bytes, avl_write_bytes, dummy;
+
+	if (can_read) {
+		hv_get_ringbuffer_availbytes(&channel->inbound,
+					     &avl_read_bytes,
+					     &dummy);
+		*can_read = avl_read_bytes >= HVSOCK_MIN_PKT_LEN;
+	}
+
+	/* We write into the ringbuffer only when we're able to write a
+	 * a payload of 4096 bytes (the actual written payload's length may be
+	 * less than 4096).
+	 */
+	if (can_write) {
+		hv_get_ringbuffer_availbytes(&channel->outbound,
+					     &dummy,
+					     &avl_write_bytes);
+		*can_write = avl_write_bytes > HVSOCK_PKT_LEN(PAGE_SIZE);
+	}
+}
+
+static unsigned int hvsock_poll(struct file *file, struct socket *sock,
+				poll_table *wait)
+{
+	struct vmbus_channel *channel;
+	bool can_read, can_write;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+	unsigned int mask;
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+
+	poll_wait(file, sk_sleep(sk), wait);
+	mask = 0;
+
+	if (sk->sk_err)
+		/* Signify that there has been an error on this socket. */
+		mask |= POLLERR;
+
+	/* INET sockets treat local write shutdown and peer write shutdown as a
+	 * case of POLLHUP set.
+	 */
+	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
+	     (hvsk->peer_shutdown & SEND_SHUTDOWN))) {
+		mask |= POLLHUP;
+	}
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    hvsk->peer_shutdown & SEND_SHUTDOWN) {
+		mask |= POLLRDHUP;
+	}
+
+	lock_sock(sk);
+
+	/* Listening sockets that have connections in their accept
+	 * queue can be read.
+	 */
+	if (sk->sk_state == SS_LISTEN && !hvsock_is_accept_queue_empty(sk))
+		mask |= POLLIN | POLLRDNORM;
+
+	/* The mutex is to against hvsock_open_connection() */
+	mutex_lock(&hvsock_mutex);
+
+	channel = hvsk->channel;
+	if (channel) {
+		/* If there is something in the queue then we can read */
+		get_ringbuffer_rw_status(channel, &can_read, &can_write);
+
+		if (!can_read && hvsk->recv.data_len > 0)
+			can_read = true;
+
+		if (!(sk->sk_shutdown & RCV_SHUTDOWN) && can_read)
+			mask |= POLLIN | POLLRDNORM;
+	} else {
+		can_read = false;
+		can_write = false;
+	}
+
+	mutex_unlock(&hvsock_mutex);
+
+	/* Sockets whose connections have been closed terminated should
+	 * also be considered read, and we check the shutdown flag for that.
+	 */
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    hvsk->peer_shutdown & SEND_SHUTDOWN) {
+		mask |= POLLIN | POLLRDNORM;
+	}
+
+	/* Connected sockets that can produce data can be written. */
+	if (sk->sk_state == SS_CONNECTED && can_write &&
+	    !(sk->sk_shutdown & SEND_SHUTDOWN)) {
+		/* Remove POLLWRBAND since INET sockets are not setting it.
+		 */
+		mask |= POLLOUT | POLLWRNORM;
+	}
+
+	/* Simulate INET socket poll behaviors, which sets
+	 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
+	 * but local send is not shutdown.
+	 */
+	if (sk->sk_state == SS_UNCONNECTED &&
+	    !(sk->sk_shutdown & SEND_SHUTDOWN))
+		mask |= POLLOUT | POLLWRNORM;
+
+	release_sock(sk);
+
+	return mask;
+}
+
+/* This function runs in the tasklet context of process_chn_event() */
+static void hvsock_on_channel_cb(void *ctx)
+{
+	struct sock *sk = (struct sock *)ctx;
+	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
+	struct vmbus_channel *channel = hvsk->channel;
+	bool can_read, can_write;
+
+	if (!channel) {
+		WARN_ONCE(1, "NULL channel! There is a programming bug.\n");
+		return;
+	}
+
+	get_ringbuffer_rw_status(channel, &can_read, &can_write);
+
+	if (can_read)
+		sk->sk_data_ready(sk);
+
+	if (can_write)
+		sk->sk_write_space(sk);
+}
+
+static void hvsock_close_connection(struct vmbus_channel *channel)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	mutex_lock(&hvsock_mutex);
+
+	sk = __hvsock_find_connected_socket_by_channel(channel);
+
+	/* The guest has already closed the connection? */
+	if (!sk)
+		goto out;
+
+	sk->sk_socket->state = SS_UNCONNECTED;
+	sk->sk_state = SS_UNCONNECTED;
+	sock_set_flag(sk, SOCK_DONE);
+
+	hvsk = sk_to_hvsock(sk);
+	hvsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
+
+	sk->sk_state_change(sk);
+out:
+	mutex_unlock(&hvsock_mutex);
+}
+
+static int hvsock_open_connection(struct vmbus_channel *channel)
+{
+	struct hvsock_sock *hvsk, *new_hvsk;
+	struct sockaddr_hv hv_addr;
+	struct sock *sk, *new_sk;
+
+	uuid_le *instance, *service_id;
+	int ret;
+
+	instance = &channel->offermsg.offer.if_instance;
+	service_id = &channel->offermsg.offer.if_type;
+
+	hvsock_addr_init(&hv_addr, *instance);
+
+	mutex_lock(&hvsock_mutex);
+
+	sk = __hvsock_find_bound_socket(&hv_addr);
+
+	if (sk) {
+		/* It is from the guest client's connect() */
+		if (sk->sk_state != SS_CONNECTING) {
+			ret = -ENXIO;
+			goto out;
+		}
+
+		hvsk = sk_to_hvsock(sk);
+		hvsk->channel = channel;
+		set_channel_read_state(channel, false);
+		vmbus_set_chn_rescind_callback(channel,
+					       hvsock_close_connection);
+		ret = vmbus_open(channel, VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND,
+				 VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV, NULL, 0,
+				 hvsock_on_channel_cb, sk);
+		if (ret != 0) {
+			hvsk->channel = NULL;
+			goto out;
+		}
+
+		set_channel_pending_send_size(channel,
+					      HVSOCK_PKT_LEN(PAGE_SIZE));
+		sk->sk_state = SS_CONNECTED;
+		sk->sk_socket->state = SS_CONNECTED;
+		hvsock_insert_connected(hvsk);
+		sk->sk_state_change(sk);
+		goto out;
+	}
+
+	/* Now we suppose it is from a host client's connect() */
+	hvsock_addr_init(&hv_addr, *service_id);
+	sk = __hvsock_find_bound_socket(&hv_addr);
+
+	/* No guest server listening? Well, let's ignore the offer */
+	if (!sk || sk->sk_state != SS_LISTEN) {
+		ret = -ENXIO;
+		goto out;
+	}
+
+	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	new_sk = __hvsock_create(sock_net(sk), NULL, GFP_KERNEL, sk->sk_type);
+	if (!new_sk) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	new_hvsk = sk_to_hvsock(new_sk);
+	new_sk->sk_state = SS_CONNECTING;
+	hvsock_addr_init(&new_hvsk->local_addr, *service_id);
+	hvsock_addr_init(&new_hvsk->remote_addr, *instance);
+
+	set_channel_read_state(channel, false);
+	new_hvsk->channel = channel;
+	vmbus_set_chn_rescind_callback(channel, hvsock_close_connection);
+	ret = vmbus_open(channel, VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND,
+			 VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV, NULL, 0,
+			 hvsock_on_channel_cb, new_sk);
+	if (ret != 0) {
+		new_hvsk->channel = NULL;
+		sock_put(new_sk);
+		goto out;
+	}
+	set_channel_pending_send_size(channel, HVSOCK_PKT_LEN(PAGE_SIZE));
+
+	new_sk->sk_state = SS_CONNECTED;
+	hvsock_insert_connected(new_hvsk);
+	hvsock_enqueue_accept(sk, new_sk);
+	sk->sk_state_change(sk);
+out:
+	mutex_unlock(&hvsock_mutex);
+	return ret;
+}
+
+static void hvsock_connect_timeout(struct work_struct *work)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	hvsk = container_of(work, struct hvsock_sock, dwork.work);
+	sk = hvsock_to_sk(hvsk);
+
+	lock_sock(sk);
+	if ((sk->sk_state == SS_CONNECTING) &&
+	    (sk->sk_shutdown != SHUTDOWN_MASK)) {
+		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_err = ETIMEDOUT;
+		sk->sk_error_report(sk);
+	}
+	release_sock(sk);
+
+	sock_put(sk);
+}
+
+static int hvsock_connect(struct socket *sock, struct sockaddr *addr,
+			  int addr_len, int flags)
+{
+	struct sockaddr_hv *remote_addr;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	DEFINE_WAIT(wait);
+	long timeout;
+
+	int ret = 0;
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+
+	lock_sock(sk);
+
+	switch (sock->state) {
+	case SS_CONNECTED:
+		ret = -EISCONN;
+		goto out;
+	case SS_DISCONNECTING:
+		ret = -EINVAL;
+		goto out;
+	case SS_CONNECTING:
+		/* This continues on so we can move sock into the SS_CONNECTED
+		 * state once the connection has completed (at which point err
+		 * will be set to zero also).  Otherwise, we will either wait
+		 * for the connection or return -EALREADY should this be a
+		 * non-blocking call.
+		 */
+		ret = -EALREADY;
+		break;
+	default:
+		if ((sk->sk_state == SS_LISTEN) ||
+		    hvsock_addr_cast(addr, addr_len, &remote_addr) != 0) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* Set the remote address that we are connecting to. */
+		memcpy(&hvsk->remote_addr, remote_addr,
+		       sizeof(hvsk->remote_addr));
+
+		ret = hvsock_auto_bind(hvsk);
+		if (ret)
+			goto out;
+
+		sk->sk_state = SS_CONNECTING;
+
+		ret = vmbus_send_tl_connect_request(
+					&hvsk->local_addr.shv_service_id,
+					&hvsk->remote_addr.shv_service_id);
+		if (ret < 0)
+			goto out;
+
+		/* Mark sock as connecting and set the error code to in
+		 * progress in case this is a non-blocking connect.
+		 */
+		sock->state = SS_CONNECTING;
+		ret = -EINPROGRESS;
+	}
+
+	/* The receive path will handle all communication until we are able to
+	 * enter the connected state.  Here we wait for the connection to be
+	 * completed or a notification of an error.
+	 */
+	timeout = 30 * HZ;
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (sk->sk_state != SS_CONNECTED && sk->sk_err == 0) {
+		if (flags & O_NONBLOCK) {
+			/* If we're not going to block, we schedule a timeout
+			 * function to generate a timeout on the connection
+			 * attempt, in case the peer doesn't respond in a
+			 * timely manner. We hold on to the socket until the
+			 * timeout fires.
+			 */
+			sock_hold(sk);
+			INIT_DELAYED_WORK(&hvsk->dwork,
+					  hvsock_connect_timeout);
+			schedule_delayed_work(&hvsk->dwork, timeout);
+
+			/* Skip ahead to preserve error code set above. */
+			goto out_wait;
+		}
+
+		release_sock(sk);
+		timeout = schedule_timeout(timeout);
+		lock_sock(sk);
+
+		if (signal_pending(current)) {
+			ret = sock_intr_errno(timeout);
+			goto out_wait_error;
+		} else if (timeout == 0) {
+			ret = -ETIMEDOUT;
+			goto out_wait_error;
+		}
+
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+	}
+
+	ret = sk->sk_err ? -sk->sk_err : 0;
+
+out_wait_error:
+	if (ret < 0) {
+		sk->sk_state = SS_UNCONNECTED;
+		sock->state = SS_UNCONNECTED;
+	}
+out_wait:
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static
+int hvsock_accept(struct socket *sock, struct socket *newsock, int flags)
+{
+	struct hvsock_sock *hvconnected;
+	struct sock *connected;
+	struct sock *listener;
+
+	DEFINE_WAIT(wait);
+	long timeout;
+
+	int ret = 0;
+
+	listener = sock->sk;
+
+	lock_sock(listener);
+
+	if (sock->type != SOCK_STREAM) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (listener->sk_state != SS_LISTEN) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Wait for children sockets to appear; these are the new sockets
+	 * created upon connection establishment.
+	 */
+	timeout = sock_sndtimeo(listener, flags & O_NONBLOCK);
+	prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
+
+	while ((connected = hvsock_dequeue_accept(listener)) == NULL &&
+	       listener->sk_err == 0) {
+		release_sock(listener);
+		timeout = schedule_timeout(timeout);
+		lock_sock(listener);
+
+		if (signal_pending(current)) {
+			ret = sock_intr_errno(timeout);
+			goto out_wait;
+		} else if (timeout == 0) {
+			ret = -EAGAIN;
+			goto out_wait;
+		}
+
+		prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
+	}
+
+	if (listener->sk_err)
+		ret = -listener->sk_err;
+
+	if (connected) {
+		lock_sock(connected);
+		hvconnected = sk_to_hvsock(connected);
+
+		/* If the listener socket has received an error, then we should
+		 * reject this socket and return.  Note that we simply mark the
+		 * socket rejected, drop our reference, and let the cleanup
+		 * function handle the cleanup; the fact that we found it in
+		 * the listener's accept queue guarantees that the cleanup
+		 * function hasn't run yet.
+		 */
+		if (ret) {
+			release_sock(connected);
+			sock_put(connected);
+			goto out_wait;
+		}
+
+		newsock->state = SS_CONNECTED;
+		sock_graft(connected, newsock);
+		release_sock(connected);
+		sock_put(connected);
+	}
+
+out_wait:
+	finish_wait(sk_sleep(listener), &wait);
+out:
+	release_sock(listener);
+	return ret;
+}
+
+static int hvsock_listen(struct socket *sock, int backlog)
+{
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+	int ret = 0;
+
+	sk = sock->sk;
+	lock_sock(sk);
+
+	if (sock->type != SOCK_STREAM) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	if (sock->state != SS_UNCONNECTED) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (backlog <= 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+	/* This is an artificial limit */
+	if (backlog > 128)
+		backlog = 128;
+
+	hvsk = sk_to_hvsock(sk);
+	if (!hvsock_addr_bound(&hvsk->local_addr)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	sk->sk_ack_backlog = 0;
+	sk->sk_max_ack_backlog = backlog;
+	sk->sk_state = SS_LISTEN;
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static int hvsock_setsockopt(struct socket *sock,
+			     int level,
+			     int optname,
+			     char __user *optval, unsigned int optlen)
+{
+	return -ENOPROTOOPT;
+}
+
+static int hvsock_getsockopt(struct socket *sock,
+			     int level,
+			     int optname,
+			     char __user *optval, int __user *optlen)
+{
+	return -ENOPROTOOPT;
+}
+
+static int hvsock_send_data(struct vmbus_channel *channel,
+			    struct hvsock_sock *hvsk,
+			    size_t to_write)
+{
+	hvsk->send.hdr.pkt_type = 1;
+	hvsk->send.hdr.data_size = to_write;
+	return vmbus_sendpacket(channel, &hvsk->send.hdr,
+				sizeof(hvsk->send.hdr) + to_write,
+				0, VM_PKT_DATA_INBAND, 0);
+}
+
+static int hvsock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
+{
+	struct vmbus_channel *channel;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	size_t total_to_write = len;
+	size_t total_written = 0;
+
+	bool can_write;
+	long timeout;
+	int ret = 0;
+
+	DEFINE_WAIT(wait);
+
+	if (len == 0)
+		return -EINVAL;
+
+	if (msg->msg_flags & ~MSG_DONTWAIT) {
+		pr_err("hvsock_sendmsg: unsupported flags=0x%x\n",
+		       msg->msg_flags);
+		return -EOPNOTSUPP;
+	}
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+	channel = hvsk->channel;
+
+	lock_sock(sk);
+
+	/* Callers should not provide a destination with stream sockets. */
+	if (msg->msg_namelen) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* Send data only if both sides are not shutdown in the direction. */
+	if (sk->sk_shutdown & SEND_SHUTDOWN ||
+	    hvsk->peer_shutdown & RCV_SHUTDOWN) {
+		ret = -EPIPE;
+		goto out;
+	}
+
+	if (sk->sk_state != SS_CONNECTED ||
+	    !hvsock_addr_bound(&hvsk->local_addr)) {
+		ret = -ENOTCONN;
+		goto out;
+	}
+
+	if (!hvsock_addr_bound(&hvsk->remote_addr)) {
+		ret = -EDESTADDRREQ;
+		goto out;
+	}
+
+	timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (total_to_write > 0) {
+		size_t to_write;
+
+		while (1) {
+			get_ringbuffer_rw_status(channel, NULL, &can_write);
+
+			if (can_write || sk->sk_err != 0 ||
+			    (sk->sk_shutdown & SEND_SHUTDOWN) ||
+			    (hvsk->peer_shutdown & RCV_SHUTDOWN))
+				break;
+
+			/* Don't wait for non-blocking sockets. */
+			if (timeout == 0) {
+				ret = -EAGAIN;
+				goto out_wait;
+			}
+
+			release_sock(sk);
+
+			timeout = schedule_timeout(timeout);
+
+			lock_sock(sk);
+			if (signal_pending(current)) {
+				ret = sock_intr_errno(timeout);
+				goto out_wait;
+			} else if (timeout == 0) {
+				ret = -EAGAIN;
+				goto out_wait;
+			}
+
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+		}
+
+		/* These checks occur both as part of and after the loop
+		 * conditional since we need to check before and after
+		 * sleeping.
+		 */
+		if (sk->sk_err) {
+			ret = -sk->sk_err;
+			goto out_wait;
+		} else if ((sk->sk_shutdown & SEND_SHUTDOWN) ||
+			   (hvsk->peer_shutdown & RCV_SHUTDOWN)) {
+			ret = -EPIPE;
+			goto out_wait;
+		}
+
+		/* Note: that write will only write as many bytes as possible
+		 * in the ringbuffer. It is the caller's responsibility to
+		 * check how many bytes we actually wrote.
+		 */
+		do {
+			to_write = min_t(size_t, HVSOCK_SND_BUF_SZ,
+					 total_to_write);
+			ret = memcpy_from_msg(hvsk->send.buf, msg, to_write);
+			if (ret != 0)
+				goto out_wait;
+
+			ret = hvsock_send_data(channel, hvsk, to_write);
+			if (ret != 0)
+				goto out_wait;
+
+			total_written += to_write;
+			total_to_write -= to_write;
+		} while (total_to_write > 0);
+	}
+out_wait:
+	if (total_written > 0)
+		ret = total_written;
+
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+
+	/* ret is a bigger-than-0 total_written or a negative err code. */
+	if (ret == 0) {
+		WARN(1, "unexpected return value of 0\n");
+		ret = -EIO;
+	}
+
+	return ret;
+}
+
+static int hvsock_recv_data(struct vmbus_channel *channel,
+			    struct hvsock_sock *hvsk,
+			    size_t *payload_len)
+{
+	u32 buffer_actual_len;
+	u64 dummy_req_id;
+	int ret;
+
+	ret = vmbus_recvpacket(channel, &hvsk->recv.hdr,
+			       sizeof(hvsk->recv.hdr) + sizeof(hvsk->recv.buf),
+			       &buffer_actual_len, &dummy_req_id);
+	if (ret != 0 || buffer_actual_len <= sizeof(hvsk->recv.hdr))
+		*payload_len = 0;
+	else
+		*payload_len = hvsk->recv.hdr.data_size;
+
+	return ret;
+}
+
+static int hvsock_recvmsg(struct socket *sock, struct msghdr *msg,
+			  size_t len, int flags)
+{
+	struct vmbus_channel *channel;
+	struct hvsock_sock *hvsk;
+	struct sock *sk;
+
+	size_t total_to_read = len;
+	size_t copied;
+
+	bool can_read;
+	long timeout;
+
+	int ret = 0;
+
+	DEFINE_WAIT(wait);
+
+	sk = sock->sk;
+	hvsk = sk_to_hvsock(sk);
+	channel = hvsk->channel;
+
+	lock_sock(sk);
+
+	if (sk->sk_state != SS_CONNECTED) {
+		/* Recvmsg is supposed to return 0 if a peer performs an
+		 * orderly shutdown. Differentiate between that case and when a
+		 * peer has not connected or a local shutdown occurred with the
+		 * SOCK_DONE flag.
+		 */
+		if (sock_flag(sk, SOCK_DONE))
+			ret = 0;
+		else
+			ret = -ENOTCONN;
+
+		goto out;
+	}
+
+	/* We ignore msg->addr_name/len. */
+	if (flags & ~MSG_DONTWAIT) {
+		pr_err("hvsock_recvmsg: unsupported flags=0x%x\n", flags);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	/* We don't check peer_shutdown flag here since peer may actually shut
+	 * down, but there can be data in the queue that a local socket can
+	 * receive.
+	 */
+	if (sk->sk_shutdown & RCV_SHUTDOWN) {
+		ret = 0;
+		goto out;
+	}
+
+	/* It is valid on Linux to pass in a zero-length receive buffer.  This
+	 * is not an error.  We may as well bail out now.
+	 */
+	if (!len) {
+		ret = 0;
+		goto out;
+	}
+
+	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	copied = 0;
+
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+
+	while (1) {
+		bool need_refill = hvsk->recv.data_len == 0;
+
+		if (need_refill)
+			get_ringbuffer_rw_status(channel, &can_read, NULL);
+		else
+			can_read = true;
+
+		if (can_read) {
+			size_t payload_len;
+
+			if (need_refill) {
+				ret = hvsock_recv_data(channel, hvsk,
+						       &payload_len);
+				if (ret != 0 || payload_len == 0 ||
+				    payload_len > HVSOCK_RCV_BUF_SZ) {
+					ret = -EIO;
+					goto out_wait;
+				}
+
+				hvsk->recv.data_len = payload_len;
+				hvsk->recv.data_offset = 0;
+			}
+
+			if (hvsk->recv.data_len <= total_to_read) {
+				ret = memcpy_to_msg(msg, hvsk->recv.buf +
+						    hvsk->recv.data_offset,
+						    hvsk->recv.data_len);
+				if (ret != 0)
+					break;
+
+				copied += hvsk->recv.data_len;
+				total_to_read -= hvsk->recv.data_len;
+				hvsk->recv.data_len = 0;
+				hvsk->recv.data_offset = 0;
+
+				if (total_to_read == 0)
+					break;
+			} else {
+				ret = memcpy_to_msg(msg, hvsk->recv.buf +
+						    hvsk->recv.data_offset,
+						    total_to_read);
+				if (ret != 0)
+					break;
+
+				copied += total_to_read;
+				hvsk->recv.data_len -= total_to_read;
+				hvsk->recv.data_offset += total_to_read;
+				total_to_read = 0;
+				break;
+			}
+		} else {
+			if (sk->sk_err || (sk->sk_shutdown & RCV_SHUTDOWN) ||
+			    (hvsk->peer_shutdown & SEND_SHUTDOWN))
+				break;
+
+			/* Don't wait for non-blocking sockets. */
+			if (timeout == 0) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			if (copied > 0)
+				break;
+
+			release_sock(sk);
+			timeout = schedule_timeout(timeout);
+			lock_sock(sk);
+
+			if (signal_pending(current)) {
+				ret = sock_intr_errno(timeout);
+				break;
+			} else if (timeout == 0) {
+				ret = -EAGAIN;
+				break;
+			}
+
+			prepare_to_wait(sk_sleep(sk), &wait,
+					TASK_INTERRUPTIBLE);
+		}
+	}
+
+	if (sk->sk_err)
+		ret = -sk->sk_err;
+	else if (sk->sk_shutdown & RCV_SHUTDOWN)
+		ret = 0;
+
+	if (copied > 0) {
+		ret = copied;
+
+		/* If the other side has shutdown for sending and there
+		 * is nothing more to read, then we modify the socket
+		 * state.
+		 */
+		if ((hvsk->peer_shutdown & SEND_SHUTDOWN) &&
+		    hvsk->recv.data_len == 0) {
+			get_ringbuffer_rw_status(channel, &can_read, NULL);
+			if (!can_read) {
+				sk->sk_state = SS_UNCONNECTED;
+				sock_set_flag(sk, SOCK_DONE);
+				sk->sk_state_change(sk);
+			}
+		}
+	}
+out_wait:
+	finish_wait(sk_sleep(sk), &wait);
+out:
+	release_sock(sk);
+	return ret;
+}
+
+static const struct proto_ops hvsock_ops = {
+	.family = PF_HYPERV,
+	.owner = THIS_MODULE,
+	.release = hvsock_release,
+	.bind = hvsock_bind,
+	.connect = hvsock_connect,
+	.socketpair = sock_no_socketpair,
+	.accept = hvsock_accept,
+	.getname = hvsock_getname,
+	.poll = hvsock_poll,
+	.ioctl = sock_no_ioctl,
+	.listen = hvsock_listen,
+	.shutdown = hvsock_shutdown,
+	.setsockopt = hvsock_setsockopt,
+	.getsockopt = hvsock_getsockopt,
+	.sendmsg = hvsock_sendmsg,
+	.recvmsg = hvsock_recvmsg,
+	.mmap = sock_no_mmap,
+	.sendpage = sock_no_sendpage,
+};
+
+static int hvsock_create(struct net *net, struct socket *sock,
+			 int protocol, int kern)
+{
+	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (protocol != 0 && protocol != SHV_PROTO_RAW)
+		return -EPROTONOSUPPORT;
+
+	switch (sock->type) {
+	case SOCK_STREAM:
+		sock->ops = &hvsock_ops;
+		break;
+	default:
+		return -ESOCKTNOSUPPORT;
+	}
+
+	sock->state = SS_UNCONNECTED;
+
+	return __hvsock_create(net, sock, GFP_KERNEL, 0) ? 0 : -ENOMEM;
+}
+
+static const struct net_proto_family hvsock_family_ops = {
+	.family = AF_HYPERV,
+	.create = hvsock_create,
+	.owner = THIS_MODULE,
+};
+
+static int hvsock_probe(struct hv_device *hdev,
+			const struct hv_vmbus_device_id *dev_id)
+{
+	struct vmbus_channel *channel = hdev->channel;
+
+	/* We ignore the error return code to suppress the unnecessary
+	 * error message in vmbus_probe(): on error the host will rescind
+	 * the offer in 30 seconds and we can do cleanup at that time.
+	 */
+	(void)hvsock_open_connection(channel);
+
+	return 0;
+}
+
+static int hvsock_remove(struct hv_device *hdev)
+{
+	struct vmbus_channel *channel = hdev->channel;
+
+	vmbus_close(channel);
+
+	return 0;
+}
+
+/* It's not really used. See vmbus_match() and vmbus_probe(). */
+static const struct hv_vmbus_device_id id_table[] = {
+	{},
+};
+
+static struct hv_driver hvsock_drv = {
+	.name		= "hv_sock",
+	.hvsock		= true,
+	.id_table	= id_table,
+	.probe		= hvsock_probe,
+	.remove		= hvsock_remove,
+};
+
+static int __init hvsock_init(void)
+{
+	int ret;
+
+	/* Hyper-V Sockets requires at least VMBus 4.0 */
+	if ((vmbus_proto_version >> 16) < 4) {
+		pr_err("failed to load: VMBus 4 or later is required\n");
+		return -ENODEV;
+	}
+
+	ret = vmbus_driver_register(&hvsock_drv);
+	if (ret) {
+		pr_err("failed to register hv_sock driver\n");
+		return ret;
+	}
+
+	ret = proto_register(&hvsock_proto, 0);
+	if (ret) {
+		pr_err("failed to register protocol\n");
+		goto unreg_hvsock_drv;
+	}
+
+	ret = sock_register(&hvsock_family_ops);
+	if (ret) {
+		pr_err("failed to register address family\n");
+		goto unreg_proto;
+	}
+
+	return 0;
+
+unreg_proto:
+	proto_unregister(&hvsock_proto);
+unreg_hvsock_drv:
+	vmbus_driver_unregister(&hvsock_drv);
+	return ret;
+}
+
+static void __exit hvsock_exit(void)
+{
+	sock_unregister(AF_HYPERV);
+	proto_unregister(&hvsock_proto);
+	vmbus_driver_unregister(&hvsock_drv);
+}
+
+module_init(hvsock_init);
+module_exit(hvsock_exit);
+
+MODULE_DESCRIPTION("Hyper-V Sockets");
+MODULE_LICENSE("Dual BSD/GPL");
-- 
2.1.0

^ permalink raw reply related

* [PATCH v7 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)
From: Dexuan Cui @ 2016-04-07 12:50 UTC (permalink / raw)
  To: gregkh, davem, netdev, linux-kernel, devel, olaf, apw, jasowang,
	kys, haiyangz

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
proposing AF_VSOCK of virtio version:
http://marc.info/?l=linux-netdev&m=145952064004765&w=2

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: <u32 ContextID, u32 Port>, but in
AF_HYPERV, the endpoint type is: <GUID VM_ID, GUID ServiceID>. Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
like .notify_recv_init
.notify_recv_pre_block
.notify_recv_pre_dequeue
.notify_recv_post_dequeue
.notify_send_init
.notify_send_pre_block
.notify_send_pre_enqueue
.notify_send_post_enqueue
etc.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments!

Changes since v1:
- updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature"
- added __init and __exit for the module init/exit functions
- net/hv_sock/Kconfig: "default m" -> "default m if HYPERV"
- MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL"

Changes since v2:
- fixed various coding issue pointed out by David Miller
- fixed indentation issues
- removed pr_debug in net/hv_sock/af_hvsock.c
- used reverse-Chrismas-tree style for local variables.
- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL

Changes since v3:
- fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter
- fixed the ret value in vmbus_recvpacket_hvsock on error
- fixed the style of multi-line comment: vmbus_get_hvsock_rw_status()

Changes since v4 (https://lkml.org/lkml/2015/7/28/404):
- addressed all the comments about V4.
- treat the hvsock offers/channels as special VMBus devices
- add a mechanism to pass hvsock events to the hvsock driver
- fixed some corner cases with proper locking when a connection is closed
- rebased to the latest Greg's tree

Changes since v5 (https://lkml.org/lkml/2015/12/24/103):
- addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!)
- used a better coding for the per-channel rescind callback (Thank Vitaly!)
- avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock()
and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket()
in the higher level (i.e., the vmsock driver). Thank Vitaly!

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Dexuan Cui (1):
  hv_sock: introduce Hyper-V Sockets

 MAINTAINERS                 |    2 +
 include/linux/hyperv.h      |   16 +
 include/linux/socket.h      |    5 +-
 include/net/af_hvsock.h     |   51 ++
 include/uapi/linux/hyperv.h |   16 +
 net/Kconfig                 |    1 +
 net/Makefile                |    1 +
 net/hv_sock/Kconfig         |   10 +
 net/hv_sock/Makefile        |    3 +
 net/hv_sock/af_hvsock.c     | 1481 +++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 1584 insertions(+), 2 deletions(-)
 create mode 100644 include/net/af_hvsock.h
 create mode 100644 net/hv_sock/Kconfig
 create mode 100644 net/hv_sock/Makefile
 create mode 100644 net/hv_sock/af_hvsock.c

-- 
2.1.0

^ permalink raw reply

* RE: [PATCH] wlcore: spi: add wl18xx support
From: Reizer, Eyal @ 2016-04-07 12:45 UTC (permalink / raw)
  To: Kalle Valo, Eyal Reizer
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <87shyx38tz.fsf-5ukZ45wKbUHoml4zekdYB16hYfS7NtTn@public.gmane.org>



> -----Original Message-----
> From: Kalle Valo [mailto:kvalo-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org]
> Sent: Thursday, April 07, 2016 3:25 PM
> To: Eyal Reizer
> Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-
> kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Reizer, Eyal; devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH] wlcore: spi: add wl18xx support
> 
> Eyal Reizer <eyalreizer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> 
> > Add support for using with both wl12xx and wl18xx.
> >
> > - all wilink family needs special init command for entering wspi mode.
> >   extra clock cycles should be sent after the spi init command while the
> >   cs pin is high.
> > - switch to controling the cs pin from the spi driver for achieveing the
> >   above.
> > - the selected cs gpio is read from the spi device-tree node using the
> >   cs-gpios field and setup as a gpio.
> > - See the example below for specifying the cs gpio using the cs-gpios
> > entry
> >
> > &spi0	{
> >  	status = "okay";
> > 	pinctrl-names = "default";
> > 	pinctrl-0 = <&spi0_pins>;
> > 	cs-gpios = <&gpio0 5 0>;
> > 	#address-cells = <1>;
> > 	#size-cells = <0>;
> >  	wlcore: wlcore@0 {
> > 		compatible = "ti,wl1835";
> > 		vwlan-supply = <&wlan_en_reg>;
> > 		spi-max-frequency = <48000000>;
> > 		reg = <0>;      /* chip select 0 on spi0, ie spi0.0 */
> > 		interrupt-parent = <&gpio0>;
> > 		interrupts = <27 IRQ_TYPE_EDGE_RISING>;
> >  	};
> > };
> >
> > Signed-off-by: Eyal Reizer <eyalr-l0cyMroinI0@public.gmane.org>
> 
> [...]
> 
> >  static const struct of_device_id wlcore_spi_of_match_table[] = {
> > -	{ .compatible = "ti,wl1271" },
> > +	{ .compatible = "ti,wl1271", .data = &wl12xx_data},
> > +	{ .compatible = "ti,wl1273", .data = &wl12xx_data},
> > +	{ .compatible = "ti,wl1281", .data = &wl12xx_data},
> > +	{ .compatible = "ti,wl1283", .data = &wl12xx_data},
> > +	{ .compatible = "ti,wl1801", .data = &wl18xx_data},
> > +	{ .compatible = "ti,wl1805", .data = &wl18xx_data},
> > +	{ .compatible = "ti,wl1807", .data = &wl18xx_data},
> > +	{ .compatible = "ti,wl1831", .data = &wl18xx_data},
> > +	{ .compatible = "ti,wl1835", .data = &wl18xx_data},
> > +	{ .compatible = "ti,wl1837", .data = &wl18xx_data},
> >  	{ }
> 
> Shouldn't you also update bindings/net/wireless/ti,wlcore,spi.txt? Now it only
> mentions about ti,wl1271 and not anything about the rest.

You are right! 
Will be fixed in v2

> 
> Adding devicetree list for further comments.
> 
> --
> Kalle Valo

Best Regards,
Eyal
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH iproute2 4/4] vxlan: add support for VXLAN-GPE
From: Jiri Benc @ 2016-04-07 12:36 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger
In-Reply-To: <cover.1460032274.git.jbenc@redhat.com>

Adds support to create a VXLAN-GPE interface.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/iplink_vxlan.c     | 13 +++++++++++--
 man/man8/ip-link.8.in |  9 +++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index e9d64b451677..49a40befa5d5 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -31,7 +31,7 @@ static void print_explain(FILE *f)
 	fprintf(f, "                 [ ageing SECONDS ] [ maxaddress NUMBER ]\n");
 	fprintf(f, "                 [ [no]udpcsum ] [ [no]udp6zerocsumtx ] [ [no]udp6zerocsumrx ]\n");
 	fprintf(f, "                 [ [no]remcsumtx ] [ [no]remcsumrx ]\n");
-	fprintf(f, "                 [ [no]external ] [ gbp ]\n");
+	fprintf(f, "                 [ [no]external ] [ gbp ] [ gpe ]\n");
 	fprintf(f, "\n");
 	fprintf(f, "Where: VNI   := 0-16777215\n");
 	fprintf(f, "       ADDR  := { IP_ADDRESS | any }\n");
@@ -79,6 +79,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u8 remcsumrx = 0;
 	__u8 metadata = 0;
 	__u8 gbp = 0;
+	__u8 gpe = 0;
 	int dst_port_set = 0;
 	struct ifla_vxlan_port_range range = { 0, 0 };
 
@@ -239,6 +240,8 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			metadata = 0;
 		} else if (!matches(*argv, "gbp")) {
 			gbp = 1;
+		} else if (!matches(*argv, "gpe")) {
+			gpe = 1;
 		} else if (matches(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -267,7 +270,9 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 		return -1;
 	}
 
-	if (!dst_port_set) {
+	if (!dst_port_set && gpe) {
+		dstport = 4790;
+	} else if (!dst_port_set) {
 		fprintf(stderr, "vxlan: destination port not specified\n"
 			"Will use Linux kernel default (non-standard value)\n");
 		fprintf(stderr,
@@ -324,6 +329,8 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 
 	if (gbp)
 		addattr_l(n, 1024, IFLA_VXLAN_GBP, NULL, 0);
+	if (gpe)
+		addattr_l(n, 1024, IFLA_VXLAN_GPE, NULL, 0);
 
 
 	return 0;
@@ -490,6 +497,8 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 
 	if (tb[IFLA_VXLAN_GBP])
 		fputs("gbp ", f);
+	if (tb[IFLA_VXLAN_GPE])
+		fputs("gpe ", f);
 }
 
 static void vxlan_print_help(struct link_util *lu, int argc, char **argv,
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index f677f8c55365..984fb2eb0d63 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -425,6 +425,8 @@ the following additional arguments are supported:
 .RI "[no]external "
 ] [
 .B gbp
+] [
+.B gpe
 ]
 
 .in +8
@@ -566,6 +568,13 @@ Example:
 
 .in -4
 
+.sp
+.B gpe
+- enables the Generic Protocol extension (VXLAN-GPE). Currently, this is
+only supported together with the
+.B external
+keyword.
+
 .in -8
 
 .TP
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 3/4] ip-link.8: document "external" flag for vxlan
From: Jiri Benc @ 2016-04-07 12:36 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger
In-Reply-To: <cover.1460032274.git.jbenc@redhat.com>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 man/man8/ip-link.8.in | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 805511423ef2..f677f8c55365 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -422,6 +422,8 @@ the following additional arguments are supported:
 ] [
 .BI maxaddress " NUMBER "
 ] [
+.RI "[no]external "
+] [
 .B gbp
 ]
 
@@ -516,6 +518,12 @@ are entered into the VXLAN device forwarding database.
 - specifies the maximum number of FDB entries.
 
 .sp
+.I [no]external
+- specifies whether an external control plane
+.RB "(e.g. " "ip route encap" )
+or the internal FDB should be used.
+
+.sp
 .B gbp
 - enables the Group Policy extension (VXLAN-GBP).
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 2/4] vxlan: 'external' implies 'nolearning'
From: Jiri Benc @ 2016-04-07 12:36 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger
In-Reply-To: <cover.1460032274.git.jbenc@redhat.com>

It doesn't make sense to use external control plane and fill internal FDB at
the same time. It's even an illegal combination for VXLAN-GPE.

Just switch off learning when 'external' is specified.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 ip/iplink_vxlan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index e3bbea0031df..e9d64b451677 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -234,6 +234,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			remcsumrx = 0;
 		} else if (!matches(*argv, "external")) {
 			metadata = 1;
+			learning = 0;
 		} else if (!matches(*argv, "noexternal")) {
 			metadata = 0;
 		} else if (!matches(*argv, "gbp")) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 1/4] if_link.h: rebase to the kernel
From: Jiri Benc @ 2016-04-07 12:36 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger
In-Reply-To: <cover.1460032274.git.jbenc@redhat.com>

Not intended to be applied directly. Instead, I assume Stephen will rebase
all headers to net-next as usual.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 include/linux/if_link.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 9b7f3c8d64c4..7e6fe6798a41 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -484,6 +484,7 @@ enum {
 	IFLA_VXLAN_REMCSUM_NOPARTIAL,
 	IFLA_VXLAN_COLLECT_METADATA,
 	IFLA_VXLAN_LABEL,
+	IFLA_VXLAN_GPE,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH iproute2 0/4] vxlan: add VXLAN-GPE support
From: Jiri Benc @ 2016-04-07 12:36 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger

This patchset adds a couple of refinements to the 'external' keyword and
adds support for configuring VXLAN-GPE interfaces.

The VXLAN-GPE support was recently merged to net-next.

The first patch is not intended to be applied directly but I still include
it in order for this patchset to be compilable and testable by those
interested.

Jiri Benc (4):
  if_link.h: rebase to the kernel
  vxlan: 'external' implies 'nolearning'
  ip-link.8: document "external" flag for vxlan
  vxlan: add support for VXLAN-GPE

 include/linux/if_link.h |  1 +
 ip/iplink_vxlan.c       | 14 ++++++++++++--
 man/man8/ip-link.8.in   | 17 +++++++++++++++++
 3 files changed, 30 insertions(+), 2 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH] wlcore: spi: add wl18xx support
From: Kalle Valo @ 2016-04-07 12:24 UTC (permalink / raw)
  To: Eyal Reizer
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eyal Reizer,
	devicetree-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1459343218-24536-1-git-send-email-eyalr-l0cyMroinI0@public.gmane.org>

Eyal Reizer <eyalreizer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Add support for using with both wl12xx and wl18xx.
>
> - all wilink family needs special init command for entering wspi mode.
>   extra clock cycles should be sent after the spi init command while the
>   cs pin is high.
> - switch to controling the cs pin from the spi driver for achieveing the
>   above.
> - the selected cs gpio is read from the spi device-tree node using the
>   cs-gpios field and setup as a gpio.
> - See the example below for specifying the cs gpio using the cs-gpios entry
>
> &spi0	{
>  	status = "okay";
> 	pinctrl-names = "default";
> 	pinctrl-0 = <&spi0_pins>;
> 	cs-gpios = <&gpio0 5 0>;
> 	#address-cells = <1>;
> 	#size-cells = <0>;
>  	wlcore: wlcore@0 {
> 		compatible = "ti,wl1835";
> 		vwlan-supply = <&wlan_en_reg>;
> 		spi-max-frequency = <48000000>;
> 		reg = <0>;      /* chip select 0 on spi0, ie spi0.0 */
> 		interrupt-parent = <&gpio0>;
> 		interrupts = <27 IRQ_TYPE_EDGE_RISING>;
>  	};
> };
>
> Signed-off-by: Eyal Reizer <eyalr-l0cyMroinI0@public.gmane.org>

[...]

>  static const struct of_device_id wlcore_spi_of_match_table[] = {
> -	{ .compatible = "ti,wl1271" },
> +	{ .compatible = "ti,wl1271", .data = &wl12xx_data},
> +	{ .compatible = "ti,wl1273", .data = &wl12xx_data},
> +	{ .compatible = "ti,wl1281", .data = &wl12xx_data},
> +	{ .compatible = "ti,wl1283", .data = &wl12xx_data},
> +	{ .compatible = "ti,wl1801", .data = &wl18xx_data},
> +	{ .compatible = "ti,wl1805", .data = &wl18xx_data},
> +	{ .compatible = "ti,wl1807", .data = &wl18xx_data},
> +	{ .compatible = "ti,wl1831", .data = &wl18xx_data},
> +	{ .compatible = "ti,wl1835", .data = &wl18xx_data},
> +	{ .compatible = "ti,wl1837", .data = &wl18xx_data},
>  	{ }

Shouldn't you also update bindings/net/wireless/ti,wlcore,spi.txt? Now
it only mentions about ti,wl1271 and not anything about the rest.

Adding devicetree list for further comments.

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] wlcore: spi: add wl18xx support
From: Kalle Valo @ 2016-04-07 12:16 UTC (permalink / raw)
  To: Reizer, Eyal
  Cc: Eyal Reizer, linux-wireless@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <8665E2433BC68541A24DFFCA87B70F5B360B3C81@DFRE01.ent.ti.com>

"Reizer, Eyal" <eyalr@ti.com> writes:

> Ping on this patch
>
>> -----Original Message-----
>> From: Eyal Reizer [mailto:eyalreizer@gmail.com]
>> Sent: Wednesday, March 30, 2016 4:07 PM
>> To: kvalo@codeaurora.org; linux-wireless@vger.kernel.org;
>> netdev@vger.kernel.org; linux-kernel@vger.kernel.org
>> Cc: Reizer, Eyal
>> Subject: [PATCH] wlcore: spi: add wl18xx support

Please edit your quotes and don't top most. A oneliner and then followed
by almost 400 lines unnecessary text for example makes it harder to use
patchwork:

https://patchwork.kernel.org/patch/8696181/

-- 
Kalle Valo

^ permalink raw reply

* [PATCH 2/2] drivers: net: cpsw: drop host_port field from struct cpsw_priv
From: Grygorii Strashko @ 2016-04-07 12:16 UTC (permalink / raw)
  To: David S. Miller, netdev, Mugunthan V N
  Cc: Sekhar Nori, linux-kernel, linux-omap, Grygorii Strashko
In-Reply-To: <1460031404-28594-1-git-send-email-grygorii.strashko@ti.com>

The host_port field is constantly assigned to 0 and this value has
never changed (since time when cpsw driver was introduced. More over,
if this field will be assigned to non 0 value it will break current
driver functionality.

Hence, there are no reasons to continue maintaining this host_port
field and it can be removed, and the HOST_PORT_NUM and ALE_PORT_HOST
defines can be used instead.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 5292e70..54bcc38 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -381,7 +381,6 @@ struct cpsw_priv {
 	u32				coal_intvl;
 	u32				bus_freq_mhz;
 	int				rx_packet_max;
-	int				host_port;
 	struct clk			*clk;
 	u8				mac_addr[ETH_ALEN];
 	struct cpsw_slave		*slaves;
@@ -531,7 +530,7 @@ static const struct cpsw_stats cpsw_gstrings_stats[] = {
 			int slave_port = cpsw_get_slave_port(priv,	\
 						slave->slave_num);	\
 			cpsw_ale_add_mcast(priv->ale, addr,		\
-				1 << slave_port | 1 << priv->host_port,	\
+				1 << slave_port | ALE_PORT_HOST,	\
 				ALE_VLAN, slave->port_vlan, 0);		\
 		} else {						\
 			cpsw_ale_add_mcast(priv->ale, addr,		\
@@ -542,10 +541,7 @@ static const struct cpsw_stats cpsw_gstrings_stats[] = {
 
 static inline int cpsw_get_slave_port(struct cpsw_priv *priv, u32 slave_num)
 {
-	if (priv->host_port == 0)
-		return slave_num + 1;
-	else
-		return slave_num;
+	return slave_num + 1;
 }
 
 static void cpsw_set_promiscious(struct net_device *ndev, bool enable)
@@ -1090,7 +1086,7 @@ static inline void cpsw_add_dual_emac_def_ale_entries(
 		struct cpsw_priv *priv, struct cpsw_slave *slave,
 		u32 slave_port)
 {
-	u32 port_mask = 1 << slave_port | 1 << priv->host_port;
+	u32 port_mask = 1 << slave_port | ALE_PORT_HOST;
 
 	if (priv->version == CPSW_VERSION_1)
 		slave_write(slave, slave->port_vlan, CPSW1_PORT_VLAN);
@@ -1101,7 +1097,7 @@ static inline void cpsw_add_dual_emac_def_ale_entries(
 	cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
 			   port_mask, ALE_VLAN, slave->port_vlan, 0);
 	cpsw_ale_add_ucast(priv->ale, priv->mac_addr,
-		priv->host_port, ALE_VLAN | ALE_SECURE, slave->port_vlan);
+		HOST_PORT_NUM, ALE_VLAN | ALE_SECURE, slave->port_vlan);
 }
 
 static void soft_reset_slave(struct cpsw_slave *slave)
@@ -1202,7 +1198,7 @@ static void cpsw_init_host_port(struct cpsw_priv *priv)
 	cpsw_ale_start(priv->ale);
 
 	/* switch to vlan unaware mode */
-	cpsw_ale_control_set(priv->ale, priv->host_port, ALE_VLAN_AWARE,
+	cpsw_ale_control_set(priv->ale, HOST_PORT_NUM, ALE_VLAN_AWARE,
 			     CPSW_ALE_VLAN_AWARE);
 	control_reg = readl(&priv->regs->control);
 	control_reg |= CPSW_VLAN_AWARE;
@@ -1216,14 +1212,14 @@ static void cpsw_init_host_port(struct cpsw_priv *priv)
 		     &priv->host_port_regs->cpdma_tx_pri_map);
 	__raw_writel(0, &priv->host_port_regs->cpdma_rx_chan_map);
 
-	cpsw_ale_control_set(priv->ale, priv->host_port,
+	cpsw_ale_control_set(priv->ale, HOST_PORT_NUM,
 			     ALE_PORT_STATE, ALE_PORT_STATE_FORWARD);
 
 	if (!priv->data.dual_emac) {
-		cpsw_ale_add_ucast(priv->ale, priv->mac_addr, priv->host_port,
+		cpsw_ale_add_ucast(priv->ale, priv->mac_addr, HOST_PORT_NUM,
 				   0, 0);
 		cpsw_ale_add_mcast(priv->ale, priv->ndev->broadcast,
-				   1 << priv->host_port, 0, 0, ALE_MCAST_FWD_2);
+				   ALE_PORT_HOST, 0, 0, ALE_MCAST_FWD_2);
 	}
 }
 
@@ -1616,9 +1612,9 @@ static int cpsw_ndo_set_mac_address(struct net_device *ndev, void *p)
 		flags = ALE_VLAN;
 	}
 
-	cpsw_ale_del_ucast(priv->ale, priv->mac_addr, priv->host_port,
+	cpsw_ale_del_ucast(priv->ale, priv->mac_addr, HOST_PORT_NUM,
 			   flags, vid);
-	cpsw_ale_add_ucast(priv->ale, addr->sa_data, priv->host_port,
+	cpsw_ale_add_ucast(priv->ale, addr->sa_data, HOST_PORT_NUM,
 			   flags, vid);
 
 	memcpy(priv->mac_addr, addr->sa_data, ETH_ALEN);
@@ -1667,7 +1663,7 @@ static inline int cpsw_add_vlan_ale_entry(struct cpsw_priv *priv,
 		return ret;
 
 	ret = cpsw_ale_add_ucast(priv->ale, priv->mac_addr,
-				 priv->host_port, ALE_VLAN, vid);
+				 HOST_PORT_NUM, ALE_VLAN, vid);
 	if (ret != 0)
 		goto clean_vid;
 
@@ -1679,7 +1675,7 @@ static inline int cpsw_add_vlan_ale_entry(struct cpsw_priv *priv,
 
 clean_vlan_ucast:
 	cpsw_ale_del_ucast(priv->ale, priv->mac_addr,
-			    priv->host_port, ALE_VLAN, vid);
+			   HOST_PORT_NUM, ALE_VLAN, vid);
 clean_vid:
 	cpsw_ale_del_vlan(priv->ale, vid, 0);
 	return ret;
@@ -2148,7 +2144,6 @@ static int cpsw_probe_dual_emac(struct platform_device *pdev,
 	priv_sl2->bus_freq_mhz = priv->bus_freq_mhz;
 
 	priv_sl2->regs = priv->regs;
-	priv_sl2->host_port = priv->host_port;
 	priv_sl2->host_port_regs = priv->host_port_regs;
 	priv_sl2->wr_regs = priv->wr_regs;
 	priv_sl2->hw_stats = priv->hw_stats;
@@ -2317,7 +2312,6 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_runtime_disable_ret;
 	}
 	priv->regs = ss_regs;
-	priv->host_port = HOST_PORT_NUM;
 
 	/* Need to enable clocks with runtime PM api to access module
 	 * registers
-- 
2.8.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox