Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH 6/8] mlx4: Add support for inner IPv6 checksum offloads and TSO
From: Saeed Mahameed @ 2016-04-26 14:37 UTC (permalink / raw)
  To: Alexander Duyck, talal, netdev, davem, galp, ogerlitz, eranbe
In-Reply-To: <20160425183133.11331.54774.stgit@ahduyck-xeon-server>



On 4/25/2016 9:31 PM, Alexander Duyck wrote:
> >From what I can tell the ConnectX-3 will support an inner IPv6 checksum and
> segmentation offload, however it cannot support outer IPv6 headers.  For
> this reason I am adding the feature to the hw_enc_features and adding an
> extra check to the features_check call that will disable GSO and checksum
> offload in the case that the encapsulated frame has an outer IP version of
> that is not 4.

Hi Alex,

Can you share the testing commands of running vxlan over IPv6 and what 
exactly didn't work for you ?
we would like to test this in house and understand what went wrong, 
theoretically there shouldn't be a difference between IPv6/IPv4 outer
checksum offloading in ConnectX-3.

Anyway, I suspect it might be related to a driver bug most likely in 
get_real_size function @en_tx.c
specifically in : *lso_header_size = (skb_inner_transport_header(skb) - 
skb->data) + inner_tcp_hdrlen(skb);

will check this and get back to you.

for the mlx5 patches I will also go through them later today.

> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   25 +++++++++++++++++++-----
>   drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   15 ++++++++++++--
>   2 files changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index bce37cbfde24..6f28ac58251c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -2357,8 +2357,10 @@ out:
>   	}
>   
>   	/* set offloads */
> -	priv->dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
> -				      NETIF_F_TSO | NETIF_F_GSO_UDP_TUNNEL |
> +	priv->dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> +				      NETIF_F_RXCSUM |
> +				      NETIF_F_TSO | NETIF_F_TSO6 |
> +				      NETIF_F_GSO_UDP_TUNNEL |
>   				      NETIF_F_GSO_UDP_TUNNEL_CSUM |
>   				      NETIF_F_GSO_PARTIAL;
>   }
> @@ -2369,8 +2371,10 @@ static void mlx4_en_del_vxlan_offloads(struct work_struct *work)
>   	struct mlx4_en_priv *priv = container_of(work, struct mlx4_en_priv,
>   						 vxlan_del_task);
>   	/* unset offloads */
> -	priv->dev->hw_enc_features &= ~(NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
> -					NETIF_F_TSO | NETIF_F_GSO_UDP_TUNNEL |
> +	priv->dev->hw_enc_features &= ~(NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
> +					NETIF_F_RXCSUM |
> +					NETIF_F_TSO | NETIF_F_TSO6 |
> +					NETIF_F_GSO_UDP_TUNNEL |
>   					NETIF_F_GSO_UDP_TUNNEL_CSUM |
>   					NETIF_F_GSO_PARTIAL);
>   
> @@ -2431,7 +2435,18 @@ static netdev_features_t mlx4_en_features_check(struct sk_buff *skb,
>   						netdev_features_t features)
>   {
>   	features = vlan_features_check(skb, features);
> -	return vxlan_features_check(skb, features);
> +	features = vxlan_features_check(skb, features);
> +
> +	/* The ConnectX-3 doesn't support outer IPv6 checksums but it does
> +	 * support inner IPv6 checksums and segmentation so  we need to
> +	 * strip that feature if this is an IPv6 encapsulated frame.
> +	 */
> +	if (skb->encapsulation &&
> +	    (skb->ip_summed == CHECKSUM_PARTIAL) &&
> +	    (ip_hdr(skb)->version != 4))
> +		features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
Dejavu, didn't you fix this already in harmonize_features, in
i.e, it is enough to do here:

if (skb->encapsulation && (skb->ip_summed == CHECKSUM_PARTIAL))
             features &= ~NETIF_F_IPV6_CSUM;


> +
> +	return features;
>   }
>   #endif
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index c0d7b7296236..c9f5388ea22a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -41,6 +41,7 @@
>   #include <linux/vmalloc.h>
>   #include <linux/tcp.h>
>   #include <linux/ip.h>
> +#include <linux/ipv6.h>
>   #include <linux/moduleparam.h>
>   
>   #include "mlx4_en.h"
> @@ -918,8 +919,18 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>   				 tx_ind, fragptr);
>   
>   	if (skb->encapsulation) {
> -		struct iphdr *ipv4 = (struct iphdr *)skb_inner_network_header(skb);
> -		if (ipv4->protocol == IPPROTO_TCP || ipv4->protocol == IPPROTO_UDP)
> +		union {
> +			struct iphdr *v4;
> +			struct ipv6hdr *v6;
> +			unsigned char *hdr;
> +		} ip;
> +		u8 proto;
> +
> +		ip.hdr = skb_inner_network_header(skb);
> +		proto = (ip.v4->version == 4) ? ip.v4->protocol :
> +						ip.v6->nexthdr;
> +
> +		if (proto == IPPROTO_TCP || proto == IPPROTO_UDP)
>   			op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP | MLX4_WQE_CTRL_ILP);
>   		else
>   			op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);

basically this is a bug fix, I don't know why the original author 
assumed it will be ipv4 !

^ permalink raw reply

* Re: codel: split into multiple files
From: Jens Axboe @ 2016-04-26 14:37 UTC (permalink / raw)
  To: Michal Kazior, sedat.dilek; +Cc: David S. Miller, netdev@vger.kernel.org
In-Reply-To: <CA+BoTQmovjcYFA4JQBDNQwJHmu=MjZg_HnDn4+KbTi6vYcVV7A@mail.gmail.com>

On 04/26/2016 06:36 AM, Michal Kazior wrote:
> On 26 April 2016 at 08:43, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>> On 4/26/16, Michal Kazior <michal.kazior@tieto.com> wrote:
>>> On 26 April 2016 at 08:09, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I had a very quick view on net-next.git#master (up to commit
>>>> fab7b629a82da1b59620470d13152aff975239f6).
>>>>
>>>> Commit in [1] aka "codel: split into multiple files" removed codel.h
>>>> but [2] and [3] have relicts to it.
>>>> Forgot to remove?
>>>
>>> codel.h was not removed. diffstat for codel.h is all red which I
>>> presume is why you thought of it as removed, see:
>>>
>>> http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/include/net/codel.h?id=d068ca2ae2e614b9a418fb3b5f1fd4cf996ff032
>>>
>>
>> [ CC Jens ]
>>
>> OK.
>> So what are the plans in the future?
>> Keep a "generic" codel.h (compatibility reasons?) for net or is it your split?
>
> I'm interested in re-using codel in mac80211 for wireless. cfg80211
> drivers may want to do that as well later. Even vendor drivers could
> start to use it (I can dream :).
>
> I plan to re-spin my patches soonish re-based on the new codel.h/fq.h
> approach. There's quite a few spins already[1].
>
>
>> AFAICS I have seen a codel-implementation in block.git#wb-buf-throttle.
>> Does it make sense to have a more "super-generic" codel.h for re-use
>> (not only for net and block)?
>> Just a thought.
>
> Oh, I'm not really familiar with block and problems around it but it
> sounds reasonable and interesting. It doesn't look like it blatantly
> copies codel though (I did that in my initial mac80211 patches with
> some adjustments, you can check that in the link[1] which you can
> lookup via my patchset's cover letter[2]; I've based off of codel5[3]
> back then).

The block version is an adaptation, I guess you can say it pays homage 
to CoDel. But there are a sufficient amount of differences between 
networking and storage that I don't think a fully generic version is 
really feasible. My favorite thing to bring up is the fact that we don't 
have the luxury of dropping packets on the storage side...

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] net/mlx5e: avoid stack overflow in mlx5e_open_channels
From: Saeed Mahameed @ 2016-04-26 14:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, David S. Miller,
	Achiad Shochat, Or Gerlitz, Amir Vadai, Tariq Toukan,
	Linux Netdev List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <5940636.9Nic57IXcZ@wuerfel>

On Tue, Apr 26, 2016 at 4:53 PM, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:
>
> Sure, do you want to just edit this when you forward the patch, or
> do you need me to do it?
>

Well, I won't say no if you want to do it :)

Saeed
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 4/5] batman-adv: Reduce refcnt of removed router when updating route
From: Sergei Shtylyov @ 2016-04-26 14:42 UTC (permalink / raw)
  To: Antonio Quartulli, davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Marek Lindner
In-Reply-To: <1461641239-7097-5-git-send-email-a-2CpIooy/SPIKlTDg6p0iyA@public.gmane.org>

Hello.

On 4/26/2016 6:27 AM, Antonio Quartulli wrote:

> From: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
>
> _batadv_update_route rcu_derefences orig_ifinfo->router outside of a
> spinlock protected region to print some information messages to the debug
> log. But this pointer is not checked again when the new pointer is assigned
> in the spinlock protected region. Thus is can happen that the value of

    Thus is can? :-)

> orig_ifinfo->router changed in the meantime and thus the reference counter
> of the wrong router gets reduced after the spinlock protected region.
>
> Just rcu_dereferencing the value of orig_ifinfo->router inside the spinlock
> protected region (which also set the new pointer) is enough to get the
> correct old router object.
>
> Fixes: e1a5382f978b ("batman-adv: Make orig_node->router an rcu protected pointer")
> Signed-off-by: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
> Signed-off-by: Marek Lindner <mareklindner-rVWd3aGhH2z5bpWLKbzFeg@public.gmane.org>
> Signed-off-by: Antonio Quartulli <a@unstable.cc>
> ---
>  net/batman-adv/routing.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/batman-adv/routing.c b/net/batman-adv/routing.c
> index 4dd646a52f1a..b781bf753250 100644
> --- a/net/batman-adv/routing.c
> +++ b/net/batman-adv/routing.c
> @@ -105,6 +105,15 @@ static void _batadv_update_route(struct batadv_priv *bat_priv,
>  		neigh_node = NULL;
>
>  	spin_lock_bh(&orig_node->neigh_list_lock);
> +	/* curr_router used earlier may not be the current orig_ifinfo->router
> +	 * anymore because it was dereferenced outside of the neigh_list_lock
> +	 * protected region. After the new best neighbor has replace the current

    Replaced.

[...]

MBR, Sergei

^ permalink raw reply

* Re: [PATCH 4/5] batman-adv: Reduce refcnt of removed router when updating route
From: Sven Eckelmann @ 2016-04-26 15:00 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Antonio Quartulli,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, Marek Lindner
In-Reply-To: <83cedcb0-1090-2a0d-f3b8-c9c273c5f1d2-M4DtvfQ/ZS1MRgGoP+s0PdBPR1lH4CV8@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1009 bytes --]

On Tuesday 26 April 2016 17:42:54 Sergei Shtylyov wrote:
> > _batadv_update_route rcu_derefences orig_ifinfo->router outside of a
> > spinlock protected region to print some information messages to the debug
> > log. But this pointer is not checked again when the new pointer is assigned
> > in the spinlock protected region. Thus is can happen that the value of
> 
>     Thus is can? :-)

Yes, my fault. s/is/it/.

[...]
> >  	spin_lock_bh(&orig_node->neigh_list_lock);
> > +	/* curr_router used earlier may not be the current orig_ifinfo->router
> > +	 * anymore because it was dereferenced outside of the neigh_list_lock
> > +	 * protected region. After the new best neighbor has replace the current
> 
>     Replaced.
> 
> [...]

This one looks like one of Marek's modifications [1] to the patch. But I would
guess that he has nothing against adding a 'd'.

Should Antonio resent all the patches or is a different approach preferred?

Kind regards,
	Sven

[1] https://patchwork.open-mesh.org/patch/15940/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH net v2 1/3] drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config
From: David Rivshin (Allworx) @ 2016-04-26 15:27 UTC (permalink / raw)
  To: Grygorii Strashko, David Miller; +Cc: netdev, linux-omap, Mugunthan V N
In-Reply-To: <571E6C7F.10704@ti.com>

On Mon, 25 Apr 2016 22:14:07 +0300
Grygorii Strashko <grygorii.strashko@ti.com> wrote:

> On 04/22/2016 04:03 PM, Grygorii Strashko wrote:
> > On 04/21/2016 09:19 PM, David Rivshin (Allworx) wrote:  
> >> From: David Rivshin <drivshin@allworx.com>
> >>
> >> Commit 9e42f715264ff158478fa30eaed847f6e131366b ("drivers: net: cpsw: add
> >> phy-handle parsing") saved the "phy-handle" phandle into a new cpsw_priv
> >> field. However, phy connections are per-slave, so the phy_node field 
> >> should
> >> be in cpsw_slave_data rather than cpsw_priv.
> >>
> >> This would go unnoticed in a single emac configuration. But in dual_emac
> >> mode, the last "phy-handle" property parsed for either slave would be 
> >> used
> >> by both of them, causing them both to refer to the same phy_device.
> >>
> >> Fixes: 9e42f715264f ("drivers: net: cpsw: add phy-handle parsing")
> >> Signed-off-by: David Rivshin <drivshin@allworx.com>
> >> Tested-by: Nicolas Chauvet <kwizart@gmail.com>
> >> ---
> >> I would suggest this for -stable. It should apply cleanly as far back
> >> as 4.4.
> >>
> >> Changes since v1 [1]:
> >> - Rebased (no conflicts)
> >> - Added Tested-by from Nicolas Chauvet
> >>
> >> [1] https://patchwork.ozlabs.org/patch/560326/  
> > 
> > Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>  
> 
> In my opinion, it will be good to have this patch merged as part of -rc cycle, since
> it will fix "NULL pointer dereference" issue with current LKML as reported by Andrew Goodbody.

Dave, 
If you'd like to take just this first patch while enhancements for patch 2
are worked out, I'd have no problem with that. I would then just submit the 
rest of the series separately. If I don't see that you've taken this by the
time I have a V3 ready I'll include it again, but it will be unchanged and 
you can still take it separately if you wish. Or I can resubmit this patch
separately if you prefer.

(FYI, I tried to send this multiple times last night, but gmail has not 
been my friend lately. I ended up having to trim the CC list substantially
just to get this out at all; apologies for that. Suggestions for other 
email providers that are usable for patch submissions are welcome.)

^ permalink raw reply

* [PATCH net-next 3/3] qed: Add PF min bandwidth configuration support
From: Manish Chopra @ 2016-04-26 14:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1461682570-745-1-git-send-email-manish.chopra@qlogic.com>

This patch adds support for PF minimum bandwidth update
or configuration notified by management firmware.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qed/qed_dev.c          | 71 ++++++++++++++++++++++
 drivers/net/ethernet/qlogic/qed/qed_hsi.h          |  2 +
 .../net/ethernet/qlogic/qed/qed_init_fw_funcs.c    | 15 +++++
 drivers/net/ethernet/qlogic/qed/qed_mcp.c          | 10 ++-
 drivers/net/ethernet/qlogic/qed/qed_mcp.h          |  7 +++
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h     |  1 +
 6 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 4e99108..b500c86 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -220,9 +220,13 @@ static int qed_init_qm_info(struct qed_hwfn *p_hwfn)
 
 	qm_info->start_vport = (u8)RESC_START(p_hwfn, QED_VPORT);
 
+	for (i = 0; i < qm_info->num_vports; i++)
+		qm_info->qm_vport_params[i].vport_wfq = 1;
+
 	qm_info->pf_wfq = 0;
 	qm_info->pf_rl = 0;
 	qm_info->vport_rl_en = 1;
+	qm_info->vport_wfq_en = 1;
 
 	return 0;
 
@@ -1841,3 +1845,70 @@ int qed_configure_pf_max_bandwidth(struct qed_dev *cdev, u8 max_bw)
 
 	return rc;
 }
+
+int __qed_configure_pf_min_bandwidth(struct qed_hwfn *p_hwfn,
+				     struct qed_ptt *p_ptt,
+				     struct qed_mcp_link_state *p_link,
+				     u8 min_bw)
+{
+	int rc = 0;
+
+	p_hwfn->mcp_info->func_info.bandwidth_min = min_bw;
+	p_hwfn->qm_info.pf_wfq = min_bw;
+
+	if (!p_link->line_speed)
+		return rc;
+
+	p_link->min_pf_rate = (p_link->line_speed * min_bw) / 100;
+
+	rc = qed_init_pf_wfq(p_hwfn, p_ptt, p_hwfn->rel_pf_id, min_bw);
+
+	DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
+		   "Configured MIN bandwidth to be %d Mb/sec\n",
+		   p_link->min_pf_rate);
+
+	return rc;
+}
+
+/* Main API to configure PF min bandwidth where bw range is [1-100] */
+int qed_configure_pf_min_bandwidth(struct qed_dev *cdev, u8 min_bw)
+{
+	int i, rc = -EINVAL;
+
+	if (min_bw < 1 || min_bw > 100) {
+		DP_NOTICE(cdev, "PF min bw valid range is [1-100]\n");
+		return rc;
+	}
+
+	for_each_hwfn(cdev, i) {
+		struct qed_hwfn *p_hwfn = &cdev->hwfns[i];
+		struct qed_hwfn *p_lead = QED_LEADING_HWFN(cdev);
+		struct qed_mcp_link_state *p_link;
+		struct qed_ptt *p_ptt;
+
+		p_link = &p_lead->mcp_info->link_output;
+
+		p_ptt = qed_ptt_acquire(p_hwfn);
+		if (!p_ptt)
+			return -EBUSY;
+
+		rc = __qed_configure_pf_min_bandwidth(p_hwfn, p_ptt,
+						      p_link, min_bw);
+		if (rc) {
+			qed_ptt_release(p_hwfn, p_ptt);
+			return rc;
+		}
+
+		if (p_link->min_pf_rate) {
+			u32 min_rate = p_link->min_pf_rate;
+
+			rc = __qed_configure_vp_wfq_on_link_change(p_hwfn,
+								   p_ptt,
+								   min_rate);
+		}
+
+		qed_ptt_release(p_hwfn, p_ptt);
+	}
+
+	return rc;
+}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 81cf625..5aa78a9 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -5116,6 +5116,8 @@ struct hw_set_image {
 	struct hw_set_info	hw_sets[1];
 };
 
+int qed_init_pf_wfq(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt,
+		    u8 pf_id, u16 pf_wfq);
 int qed_init_vport_wfq(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt,
 		       u16 first_tx_pq_id[NUM_OF_TCS], u16 vport_wfq);
 #endif
diff --git a/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c b/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c
index e646987..e8a3b9d 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c
@@ -712,6 +712,21 @@ int qed_qm_pf_rt_init(struct qed_hwfn *p_hwfn,
 	return 0;
 }
 
+int qed_init_pf_wfq(struct qed_hwfn *p_hwfn,
+		    struct qed_ptt *p_ptt,
+		    u8 pf_id, u16 pf_wfq)
+{
+	u32 inc_val = QM_WFQ_INC_VAL(pf_wfq);
+
+	if (!inc_val || inc_val > QM_WFQ_MAX_INC_VAL) {
+		DP_NOTICE(p_hwfn, "Invalid PF WFQ weight configuration");
+		return -1;
+	}
+
+	qed_wr(p_hwfn, p_ptt, QM_REG_WFQPFWEIGHT + pf_id * 4, inc_val);
+	return 0;
+}
+
 int qed_init_pf_rl(struct qed_hwfn *p_hwfn,
 		   struct qed_ptt *p_ptt,
 		   u8 pf_id,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 578b09c..cb46dbd 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -472,8 +472,8 @@ static void qed_mcp_handle_link_change(struct qed_hwfn *p_hwfn,
 				       bool b_reset)
 {
 	struct qed_mcp_link_state *p_link;
+	u8 max_bw, min_bw;
 	u32 status = 0;
-	u8 max_bw;
 
 	p_link = &p_hwfn->mcp_info->link_output;
 	memset(p_link, 0, sizeof(*p_link));
@@ -534,10 +534,15 @@ static void qed_mcp_handle_link_change(struct qed_hwfn *p_hwfn,
 		p_link->line_speed = 0;
 
 	max_bw = p_hwfn->mcp_info->func_info.bandwidth_max;
+	min_bw = p_hwfn->mcp_info->func_info.bandwidth_min;
 
-	/* Correct speed according to bandwidth allocation */
+	/* Max bandwidth configuration */
 	__qed_configure_pf_max_bandwidth(p_hwfn, p_ptt, p_link, max_bw);
 
+	/* Min bandwidth configuration */
+	__qed_configure_pf_min_bandwidth(p_hwfn, p_ptt, p_link, min_bw);
+	qed_configure_vp_wfq_on_link_change(p_hwfn->cdev, p_link->min_pf_rate);
+
 	p_link->an = !!(status & LINK_STATUS_AUTO_NEGOTIATE_ENABLED);
 	p_link->an_complete = !!(status &
 				 LINK_STATUS_AUTO_NEGOTIATE_COMPLETE);
@@ -710,6 +715,7 @@ static void qed_mcp_update_bw(struct qed_hwfn *p_hwfn,
 
 	p_info = &p_hwfn->mcp_info->func_info;
 
+	qed_configure_pf_min_bandwidth(p_hwfn->cdev, p_info->bandwidth_min);
 	qed_configure_pf_max_bandwidth(p_hwfn->cdev, p_info->bandwidth_max);
 
 	/* Acknowledge the MFW */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index 29a51ad..608bcb2 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -40,6 +40,8 @@ struct qed_mcp_link_capabilities {
 struct qed_mcp_link_state {
 	bool    link_up;
 
+	u32	min_pf_rate;
+
 	/* Actual link speed in Mb/s */
 	u32	line_speed;
 
@@ -394,9 +396,14 @@ int qed_mcp_reset(struct qed_hwfn *p_hwfn,
  * @return true iff MFW is running and mcp_info is initialized
  */
 bool qed_mcp_is_init(struct qed_hwfn *p_hwfn);
+int qed_configure_pf_min_bandwidth(struct qed_dev *cdev, u8 min_bw);
 int qed_configure_pf_max_bandwidth(struct qed_dev *cdev, u8 max_bw);
 int __qed_configure_pf_max_bandwidth(struct qed_hwfn *p_hwfn,
 				     struct qed_ptt *p_ptt,
 				     struct qed_mcp_link_state *p_link,
 				     u8 max_bw);
+int __qed_configure_pf_min_bandwidth(struct qed_hwfn *p_hwfn,
+				     struct qed_ptt *p_ptt,
+				     struct qed_mcp_link_state *p_link,
+				     u8 min_bw);
 #endif
diff --git a/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h b/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h
index d2f5730..bf4d7cc 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h
@@ -458,5 +458,6 @@
 #define PBF_REG_NGE_COMP_VER			0xd80524UL
 #define PRS_REG_NGE_COMP_VER			0x1f0878UL
 
+#define QM_REG_WFQPFWEIGHT	0x2f4e80UL
 #define QM_REG_WFQVPWEIGHT	0x2fa000UL
 #endif
-- 
2.7.2

^ permalink raw reply related

* [PATCH net-next 2/3] qed: Add PF max bandwidth configuration support
From: Manish Chopra @ 2016-04-26 14:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1461682570-745-1-git-send-email-manish.chopra@qlogic.com>

This patch adds support for PF maximum bandwidth update
or configuration notified by management firmware.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qed/qed_dev.c |  68 ++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |   2 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 138 ++++++++++++++++++------------
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |  14 ++-
 4 files changed, 165 insertions(+), 57 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 28e0619..4e99108 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -579,7 +579,7 @@ static int qed_hw_init_pf(struct qed_hwfn *p_hwfn,
 			p_hwfn->qm_info.pf_wfq = p_info->bandwidth_min;
 
 		/* Update rate limit once we'll actually have a link */
-		p_hwfn->qm_info.pf_rl = 100;
+		p_hwfn->qm_info.pf_rl = 100000;
 	}
 
 	qed_cxt_hw_init_pf(p_hwfn);
@@ -1775,3 +1775,69 @@ void qed_configure_vp_wfq_on_link_change(struct qed_dev *cdev, u32 min_pf_rate)
 						      min_pf_rate);
 	}
 }
+
+int __qed_configure_pf_max_bandwidth(struct qed_hwfn *p_hwfn,
+				     struct qed_ptt *p_ptt,
+				     struct qed_mcp_link_state *p_link,
+				     u8 max_bw)
+{
+	int rc = 0;
+
+	p_hwfn->mcp_info->func_info.bandwidth_max = max_bw;
+
+	if (!p_link->line_speed && (max_bw != 100))
+		return rc;
+
+	p_link->speed = (p_link->line_speed * max_bw) / 100;
+	p_hwfn->qm_info.pf_rl = p_link->speed;
+
+	/* Since the limiter also affects Tx-switched traffic, we don't want it
+	 * to limit such traffic in case there's no actual limit.
+	 * In that case, set limit to imaginary high boundary.
+	 */
+	if (max_bw == 100)
+		p_hwfn->qm_info.pf_rl = 100000;
+
+	rc = qed_init_pf_rl(p_hwfn, p_ptt, p_hwfn->rel_pf_id,
+			    p_hwfn->qm_info.pf_rl);
+
+	DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
+		   "Configured MAX bandwidth to be %08x Mb/sec\n",
+		   p_link->speed);
+
+	return rc;
+}
+
+/* Main API to configure PF max bandwidth where bw range is [1 - 100] */
+int qed_configure_pf_max_bandwidth(struct qed_dev *cdev, u8 max_bw)
+{
+	int i, rc = -EINVAL;
+
+	if (max_bw < 1 || max_bw > 100) {
+		DP_NOTICE(cdev, "PF max bw valid range is [1-100]\n");
+		return rc;
+	}
+
+	for_each_hwfn(cdev, i) {
+		struct qed_hwfn	*p_hwfn = &cdev->hwfns[i];
+		struct qed_hwfn *p_lead = QED_LEADING_HWFN(cdev);
+		struct qed_mcp_link_state *p_link;
+		struct qed_ptt *p_ptt;
+
+		p_link = &p_lead->mcp_info->link_output;
+
+		p_ptt = qed_ptt_acquire(p_hwfn);
+		if (!p_ptt)
+			return -EBUSY;
+
+		rc = __qed_configure_pf_max_bandwidth(p_hwfn, p_ptt,
+						      p_link, max_bw);
+
+		qed_ptt_release(p_hwfn, p_ptt);
+
+		if (rc)
+			break;
+	}
+
+	return rc;
+}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 7d5ed0c..81cf625 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -3837,7 +3837,7 @@ struct public_drv_mb {
 
 #define DRV_MSG_CODE_SET_LLDP                   0x24000000
 #define DRV_MSG_CODE_SET_DCBX                   0x25000000
-
+#define DRV_MSG_CODE_BW_UPDATE_ACK		0x32000000
 #define DRV_MSG_CODE_NIG_DRAIN                  0x30000000
 
 #define DRV_MSG_CODE_INITIATE_FLR               0x02000000
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index b89c9a8..578b09c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -473,6 +473,7 @@ static void qed_mcp_handle_link_change(struct qed_hwfn *p_hwfn,
 {
 	struct qed_mcp_link_state *p_link;
 	u32 status = 0;
+	u8 max_bw;
 
 	p_link = &p_hwfn->mcp_info->link_output;
 	memset(p_link, 0, sizeof(*p_link));
@@ -527,17 +528,15 @@ static void qed_mcp_handle_link_change(struct qed_hwfn *p_hwfn,
 		p_link->speed = 0;
 	}
 
+	if (p_link->link_up && p_link->speed)
+		p_link->line_speed = p_link->speed;
+	else
+		p_link->line_speed = 0;
+
+	max_bw = p_hwfn->mcp_info->func_info.bandwidth_max;
+
 	/* Correct speed according to bandwidth allocation */
-	if (p_hwfn->mcp_info->func_info.bandwidth_max && p_link->speed) {
-		p_link->speed = p_link->speed *
-				p_hwfn->mcp_info->func_info.bandwidth_max /
-				100;
-		qed_init_pf_rl(p_hwfn, p_ptt, p_hwfn->rel_pf_id,
-			       p_link->speed);
-		DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
-			   "Configured MAX bandwidth to be %08x Mb/sec\n",
-			   p_link->speed);
-	}
+	__qed_configure_pf_max_bandwidth(p_hwfn, p_ptt, p_link, max_bw);
 
 	p_link->an = !!(status & LINK_STATUS_AUTO_NEGOTIATE_ENABLED);
 	p_link->an_complete = !!(status &
@@ -648,6 +647,76 @@ int qed_mcp_set_link(struct qed_hwfn *p_hwfn,
 	return 0;
 }
 
+static void qed_read_pf_bandwidth(struct qed_hwfn *p_hwfn,
+				  struct public_func *p_shmem_info)
+{
+	struct qed_mcp_function_info *p_info;
+
+	p_info = &p_hwfn->mcp_info->func_info;
+
+	p_info->bandwidth_min = (p_shmem_info->config &
+				 FUNC_MF_CFG_MIN_BW_MASK) >>
+					FUNC_MF_CFG_MIN_BW_SHIFT;
+	if (p_info->bandwidth_min < 1 || p_info->bandwidth_min > 100) {
+		DP_INFO(p_hwfn,
+			"bandwidth minimum out of bounds [%02x]. Set to 1\n",
+			p_info->bandwidth_min);
+		p_info->bandwidth_min = 1;
+	}
+
+	p_info->bandwidth_max = (p_shmem_info->config &
+				 FUNC_MF_CFG_MAX_BW_MASK) >>
+					FUNC_MF_CFG_MAX_BW_SHIFT;
+	if (p_info->bandwidth_max < 1 || p_info->bandwidth_max > 100) {
+		DP_INFO(p_hwfn,
+			"bandwidth maximum out of bounds [%02x]. Set to 100\n",
+			p_info->bandwidth_max);
+		p_info->bandwidth_max = 100;
+	}
+}
+
+static u32 qed_mcp_get_shmem_func(struct qed_hwfn *p_hwfn,
+				  struct qed_ptt *p_ptt,
+				  struct public_func *p_data,
+				  int pfid)
+{
+	u32 addr = SECTION_OFFSIZE_ADDR(p_hwfn->mcp_info->public_base,
+					PUBLIC_FUNC);
+	u32 mfw_path_offsize = qed_rd(p_hwfn, p_ptt, addr);
+	u32 func_addr = SECTION_ADDR(mfw_path_offsize, pfid);
+	u32 i, size;
+
+	memset(p_data, 0, sizeof(*p_data));
+
+	size = min_t(u32, sizeof(*p_data),
+		     QED_SECTION_SIZE(mfw_path_offsize));
+	for (i = 0; i < size / sizeof(u32); i++)
+		((u32 *)p_data)[i] = qed_rd(p_hwfn, p_ptt,
+					    func_addr + (i << 2));
+	return size;
+}
+
+static void qed_mcp_update_bw(struct qed_hwfn *p_hwfn,
+			      struct qed_ptt *p_ptt)
+{
+	struct qed_mcp_function_info *p_info;
+	struct public_func shmem_info;
+	u32 resp = 0, param = 0;
+
+	qed_mcp_get_shmem_func(p_hwfn, p_ptt, &shmem_info,
+			       MCP_PF_ID(p_hwfn));
+
+	qed_read_pf_bandwidth(p_hwfn, &shmem_info);
+
+	p_info = &p_hwfn->mcp_info->func_info;
+
+	qed_configure_pf_max_bandwidth(p_hwfn->cdev, p_info->bandwidth_max);
+
+	/* Acknowledge the MFW */
+	qed_mcp_cmd(p_hwfn, p_ptt, DRV_MSG_CODE_BW_UPDATE_ACK, 0, &resp,
+		    &param);
+}
+
 int qed_mcp_handle_events(struct qed_hwfn *p_hwfn,
 			  struct qed_ptt *p_ptt)
 {
@@ -679,6 +748,9 @@ int qed_mcp_handle_events(struct qed_hwfn *p_hwfn,
 		case MFW_DRV_MSG_TRANSCEIVER_STATE_CHANGE:
 			qed_mcp_handle_transceiver_change(p_hwfn, p_ptt);
 			break;
+		case MFW_DRV_MSG_BW_UPDATE:
+			qed_mcp_update_bw(p_hwfn, p_ptt);
+			break;
 		default:
 			DP_NOTICE(p_hwfn, "Unimplemented MFW message %d\n", i);
 			rc = -EINVAL;
@@ -758,28 +830,6 @@ int qed_mcp_get_media_type(struct qed_dev *cdev,
 	return 0;
 }
 
-static u32 qed_mcp_get_shmem_func(struct qed_hwfn *p_hwfn,
-				  struct qed_ptt *p_ptt,
-				  struct public_func *p_data,
-				  int pfid)
-{
-	u32 addr = SECTION_OFFSIZE_ADDR(p_hwfn->mcp_info->public_base,
-					PUBLIC_FUNC);
-	u32 mfw_path_offsize = qed_rd(p_hwfn, p_ptt, addr);
-	u32 func_addr = SECTION_ADDR(mfw_path_offsize, pfid);
-	u32 i, size;
-
-	memset(p_data, 0, sizeof(*p_data));
-
-	size = min_t(u32, sizeof(*p_data),
-		     QED_SECTION_SIZE(mfw_path_offsize));
-	for (i = 0; i < size / sizeof(u32); i++)
-		((u32 *)p_data)[i] = qed_rd(p_hwfn, p_ptt,
-					    func_addr + (i << 2));
-
-	return size;
-}
-
 static int
 qed_mcp_get_shmem_proto(struct qed_hwfn *p_hwfn,
 			struct public_func *p_info,
@@ -818,26 +868,7 @@ int qed_mcp_fill_shmem_func_info(struct qed_hwfn *p_hwfn,
 		return -EINVAL;
 	}
 
-
-	info->bandwidth_min = (shmem_info.config &
-			       FUNC_MF_CFG_MIN_BW_MASK) >>
-			      FUNC_MF_CFG_MIN_BW_SHIFT;
-	if (info->bandwidth_min < 1 || info->bandwidth_min > 100) {
-		DP_INFO(p_hwfn,
-			"bandwidth minimum out of bounds [%02x]. Set to 1\n",
-			info->bandwidth_min);
-		info->bandwidth_min = 1;
-	}
-
-	info->bandwidth_max = (shmem_info.config &
-			       FUNC_MF_CFG_MAX_BW_MASK) >>
-			      FUNC_MF_CFG_MAX_BW_SHIFT;
-	if (info->bandwidth_max < 1 || info->bandwidth_max > 100) {
-		DP_INFO(p_hwfn,
-			"bandwidth maximum out of bounds [%02x]. Set to 100\n",
-			info->bandwidth_max);
-		info->bandwidth_max = 100;
-	}
+	qed_read_pf_bandwidth(p_hwfn, &shmem_info);
 
 	if (shmem_info.mac_upper || shmem_info.mac_lower) {
 		info->mac[0] = (u8)(shmem_info.mac_upper >> 8);
@@ -938,9 +969,10 @@ qed_mcp_send_drv_version(struct qed_hwfn *p_hwfn,
 
 	p_drv_version = &union_data.drv_version;
 	p_drv_version->version = p_ver->version;
+
 	for (i = 0; i < MCP_DRV_VER_STR_SIZE - 1; i += 4) {
 		val = cpu_to_be32(p_ver->name[i]);
-		*(u32 *)&p_drv_version->name[i * sizeof(u32)] = val;
+		*(__be32 *)&p_drv_version->name[i * sizeof(u32)] = val;
 	}
 
 	memset(&mb_params, 0, sizeof(mb_params));
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.h b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
index 50917a2..29a51ad 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.h
@@ -40,7 +40,13 @@ struct qed_mcp_link_capabilities {
 struct qed_mcp_link_state {
 	bool    link_up;
 
-	u32     speed; /* In Mb/s */
+	/* Actual link speed in Mb/s */
+	u32	line_speed;
+
+	/* PF max speed in Mb/s, deduced from line_speed
+	 * according to PF max bandwidth configuration.
+	 */
+	u32     speed;
 	bool    full_duplex;
 
 	bool    an;
@@ -388,5 +394,9 @@ int qed_mcp_reset(struct qed_hwfn *p_hwfn,
  * @return true iff MFW is running and mcp_info is initialized
  */
 bool qed_mcp_is_init(struct qed_hwfn *p_hwfn);
-
+int qed_configure_pf_max_bandwidth(struct qed_dev *cdev, u8 max_bw);
+int __qed_configure_pf_max_bandwidth(struct qed_hwfn *p_hwfn,
+				     struct qed_ptt *p_ptt,
+				     struct qed_mcp_link_state *p_link,
+				     u8 max_bw);
 #endif
-- 
2.7.2

^ permalink raw reply related

* [PATCH net-next 1/3] qed: Add vport WFQ configuration APIs
From: Manish Chopra @ 2016-04-26 14:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1461682570-745-1-git-send-email-manish.chopra@qlogic.com>

This patch adds relevant APIs needed to configure WFQ
(Weighted fair queueing) values for the vports. WFQ configuration
is used per vport basis when minimum bandwidth update/configuration
is notified to the PF by the management firmware.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
---
 drivers/net/ethernet/qlogic/qed/qed.h              |  11 ++
 drivers/net/ethernet/qlogic/qed/qed_dev.c          | 188 ++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h          |   2 +
 .../net/ethernet/qlogic/qed/qed_init_fw_funcs.c    |  25 +++
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h     |   1 +
 5 files changed, 223 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h b/drivers/net/ethernet/qlogic/qed/qed.h
index 33e2ed6..cceac32 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -32,6 +32,8 @@ extern const struct qed_common_ops qed_common_ops_pass;
 #define NAME_SIZE 16
 #define VER_SIZE 16
 
+#define QED_WFQ_UNIT	100
+
 /* cau states */
 enum qed_coalescing_mode {
 	QED_COAL_MODE_DISABLE,
@@ -237,6 +239,12 @@ struct qed_dmae_info {
 	struct dmae_cmd *p_dmae_cmd;
 };
 
+struct qed_wfq_data {
+	/* when feature is configured for at least 1 vport */
+	u32	min_speed;
+	bool	configured;
+};
+
 struct qed_qm_info {
 	struct init_qm_pq_params	*qm_pq_params;
 	struct init_qm_vport_params	*qm_vport_params;
@@ -257,6 +265,7 @@ struct qed_qm_info {
 	bool				vport_wfq_en;
 	u8				pf_wfq;
 	u32				pf_rl;
+	struct qed_wfq_data		*wfq_data;
 };
 
 struct storm_stats {
@@ -526,6 +535,8 @@ static inline u8 qed_concrete_to_sw_fid(struct qed_dev *cdev,
 
 #define PURE_LB_TC 8
 
+void qed_configure_vp_wfq_on_link_change(struct qed_dev *cdev, u32 min_pf_rate);
+
 #define QED_LEADING_HWFN(dev)   (&dev->hwfns[0])
 
 /* Other Linux specific common definitions */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index bdae5a5..28e0619 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -105,6 +105,8 @@ static void qed_qm_info_free(struct qed_hwfn *p_hwfn)
 	qm_info->qm_vport_params = NULL;
 	kfree(qm_info->qm_port_params);
 	qm_info->qm_port_params = NULL;
+	kfree(qm_info->wfq_data);
+	qm_info->wfq_data = NULL;
 }
 
 void qed_resc_free(struct qed_dev *cdev)
@@ -175,6 +177,11 @@ static int qed_init_qm_info(struct qed_hwfn *p_hwfn)
 	if (!qm_info->qm_port_params)
 		goto alloc_err;
 
+	qm_info->wfq_data = kcalloc(num_vports, sizeof(*qm_info->wfq_data),
+				    GFP_KERNEL);
+	if (!qm_info->wfq_data)
+		goto alloc_err;
+
 	vport_id = (u8)RESC_START(p_hwfn, QED_VPORT);
 
 	/* First init per-TC PQs */
@@ -221,10 +228,7 @@ static int qed_init_qm_info(struct qed_hwfn *p_hwfn)
 
 alloc_err:
 	DP_NOTICE(p_hwfn, "Failed to allocate memory for QM params\n");
-	kfree(qm_info->qm_pq_params);
-	kfree(qm_info->qm_vport_params);
-	kfree(qm_info->qm_port_params);
-
+	qed_qm_info_free(p_hwfn);
 	return -ENOMEM;
 }
 
@@ -1595,3 +1599,179 @@ int qed_fw_rss_eng(struct qed_hwfn *p_hwfn,
 
 	return 0;
 }
+
+/* Calculate final WFQ values for all vports and configure them.
+ * After this configuration each vport will have
+ * approx min rate =  min_pf_rate * (vport_wfq / QED_WFQ_UNIT)
+ */
+static void qed_configure_wfq_for_all_vports(struct qed_hwfn *p_hwfn,
+					     struct qed_ptt *p_ptt,
+					     u32 min_pf_rate)
+{
+	struct init_qm_vport_params *vport_params;
+	int i;
+
+	vport_params = p_hwfn->qm_info.qm_vport_params;
+
+	for (i = 0; i < p_hwfn->qm_info.num_vports; i++) {
+		u32 wfq_speed = p_hwfn->qm_info.wfq_data[i].min_speed;
+
+		vport_params[i].vport_wfq = (wfq_speed * QED_WFQ_UNIT) /
+						min_pf_rate;
+		qed_init_vport_wfq(p_hwfn, p_ptt,
+				   vport_params[i].first_tx_pq_id,
+				   vport_params[i].vport_wfq);
+	}
+}
+
+static void qed_init_wfq_default_param(struct qed_hwfn *p_hwfn,
+				       u32 min_pf_rate)
+
+{
+	int i;
+
+	for (i = 0; i < p_hwfn->qm_info.num_vports; i++)
+		p_hwfn->qm_info.qm_vport_params[i].vport_wfq = 1;
+}
+
+static void qed_disable_wfq_for_all_vports(struct qed_hwfn *p_hwfn,
+					   struct qed_ptt *p_ptt,
+					   u32 min_pf_rate)
+{
+	struct init_qm_vport_params *vport_params;
+	int i;
+
+	vport_params = p_hwfn->qm_info.qm_vport_params;
+
+	for (i = 0; i < p_hwfn->qm_info.num_vports; i++) {
+		qed_init_wfq_default_param(p_hwfn, min_pf_rate);
+		qed_init_vport_wfq(p_hwfn, p_ptt,
+				   vport_params[i].first_tx_pq_id,
+				   vport_params[i].vport_wfq);
+	}
+}
+
+/* This function performs several validations for WFQ
+ * configuration and required min rate for a given vport
+ * 1. req_rate must be greater than one percent of min_pf_rate.
+ * 2. req_rate should not cause other vports [not configured for WFQ explicitly]
+ *    rates to get less than one percent of min_pf_rate.
+ * 3. total_req_min_rate [all vports min rate sum] shouldn't exceed min_pf_rate.
+ */
+static int qed_init_wfq_param(struct qed_hwfn *p_hwfn,
+			      u16 vport_id, u32 req_rate,
+			      u32 min_pf_rate)
+{
+	u32 total_req_min_rate = 0, total_left_rate = 0, left_rate_per_vp = 0;
+	int non_requested_count = 0, req_count = 0, i, num_vports;
+
+	num_vports = p_hwfn->qm_info.num_vports;
+
+	/* Accounting for the vports which are configured for WFQ explicitly */
+	for (i = 0; i < num_vports; i++) {
+		u32 tmp_speed;
+
+		if ((i != vport_id) &&
+		    p_hwfn->qm_info.wfq_data[i].configured) {
+			req_count++;
+			tmp_speed = p_hwfn->qm_info.wfq_data[i].min_speed;
+			total_req_min_rate += tmp_speed;
+		}
+	}
+
+	/* Include current vport data as well */
+	req_count++;
+	total_req_min_rate += req_rate;
+	non_requested_count = num_vports - req_count;
+
+	if (req_rate < min_pf_rate / QED_WFQ_UNIT) {
+		DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
+			   "Vport [%d] - Requested rate[%d Mbps] is less than one percent of configured PF min rate[%d Mbps]\n",
+			   vport_id, req_rate, min_pf_rate);
+		return -EINVAL;
+	}
+
+	if (num_vports > QED_WFQ_UNIT) {
+		DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
+			   "Number of vports is greater than %d\n",
+			   QED_WFQ_UNIT);
+		return -EINVAL;
+	}
+
+	if (total_req_min_rate > min_pf_rate) {
+		DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
+			   "Total requested min rate for all vports[%d Mbps] is greater than configured PF min rate[%d Mbps]\n",
+			   total_req_min_rate, min_pf_rate);
+		return -EINVAL;
+	}
+
+	total_left_rate	= min_pf_rate - total_req_min_rate;
+
+	left_rate_per_vp = total_left_rate / non_requested_count;
+	if (left_rate_per_vp <  min_pf_rate / QED_WFQ_UNIT) {
+		DP_VERBOSE(p_hwfn, NETIF_MSG_LINK,
+			   "Non WFQ configured vports rate [%d Mbps] is less than one percent of configured PF min rate[%d Mbps]\n",
+			   left_rate_per_vp, min_pf_rate);
+		return -EINVAL;
+	}
+
+	p_hwfn->qm_info.wfq_data[vport_id].min_speed = req_rate;
+	p_hwfn->qm_info.wfq_data[vport_id].configured = true;
+
+	for (i = 0; i < num_vports; i++) {
+		if (p_hwfn->qm_info.wfq_data[i].configured)
+			continue;
+
+		p_hwfn->qm_info.wfq_data[i].min_speed = left_rate_per_vp;
+	}
+
+	return 0;
+}
+
+static int __qed_configure_vp_wfq_on_link_change(struct qed_hwfn *p_hwfn,
+						 struct qed_ptt *p_ptt,
+						 u32 min_pf_rate)
+{
+	bool use_wfq = false;
+	int rc = 0;
+	u16 i;
+
+	/* Validate all pre configured vports for wfq */
+	for (i = 0; i < p_hwfn->qm_info.num_vports; i++) {
+		u32 rate;
+
+		if (!p_hwfn->qm_info.wfq_data[i].configured)
+			continue;
+
+		rate = p_hwfn->qm_info.wfq_data[i].min_speed;
+		use_wfq = true;
+
+		rc = qed_init_wfq_param(p_hwfn, i, rate, min_pf_rate);
+		if (rc) {
+			DP_NOTICE(p_hwfn,
+				  "WFQ validation failed while configuring min rate\n");
+			break;
+		}
+	}
+
+	if (!rc && use_wfq)
+		qed_configure_wfq_for_all_vports(p_hwfn, p_ptt, min_pf_rate);
+	else
+		qed_disable_wfq_for_all_vports(p_hwfn, p_ptt, min_pf_rate);
+
+	return rc;
+}
+
+/* API to configure WFQ from mcp link change */
+void qed_configure_vp_wfq_on_link_change(struct qed_dev *cdev, u32 min_pf_rate)
+{
+	int i;
+
+	for_each_hwfn(cdev, i) {
+		struct qed_hwfn *p_hwfn = &cdev->hwfns[i];
+
+		__qed_configure_vp_wfq_on_link_change(p_hwfn,
+						      p_hwfn->p_dpc_ptt,
+						      min_pf_rate);
+	}
+}
diff --git a/drivers/net/ethernet/qlogic/qed/qed_hsi.h b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
index 15e02ab..7d5ed0c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_hsi.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_hsi.h
@@ -5116,4 +5116,6 @@ struct hw_set_image {
 	struct hw_set_info	hw_sets[1];
 };
 
+int qed_init_vport_wfq(struct qed_hwfn *p_hwfn, struct qed_ptt *p_ptt,
+		       u16 first_tx_pq_id[NUM_OF_TCS], u16 vport_wfq);
 #endif
diff --git a/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c b/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c
index 1dd5324..e646987 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_init_fw_funcs.c
@@ -732,6 +732,31 @@ int qed_init_pf_rl(struct qed_hwfn *p_hwfn,
 	return 0;
 }
 
+int qed_init_vport_wfq(struct qed_hwfn *p_hwfn,
+		       struct qed_ptt *p_ptt,
+		       u16 first_tx_pq_id[NUM_OF_TCS],
+		       u16 vport_wfq)
+{
+	u32 inc_val = QM_WFQ_INC_VAL(vport_wfq);
+	u8 tc;
+
+	if (!inc_val || inc_val > QM_WFQ_MAX_INC_VAL) {
+		DP_NOTICE(p_hwfn, "Invalid VPORT WFQ weight configuration");
+		return -1;
+	}
+
+	for (tc = 0; tc < NUM_OF_TCS; tc++) {
+		u16 vport_pq_id = first_tx_pq_id[tc];
+
+		if (vport_pq_id != QM_INVALID_PQ_ID)
+			qed_wr(p_hwfn, p_ptt,
+			       QM_REG_WFQVPWEIGHT + vport_pq_id * 4,
+			       inc_val);
+	}
+
+	return 0;
+}
+
 int qed_init_vport_rl(struct qed_hwfn *p_hwfn,
 		      struct qed_ptt *p_ptt,
 		      u8 vport_id,
diff --git a/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h b/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h
index 55451a4..d2f5730 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h
+++ b/drivers/net/ethernet/qlogic/qed/qed_reg_addr.h
@@ -458,4 +458,5 @@
 #define PBF_REG_NGE_COMP_VER			0xd80524UL
 #define PRS_REG_NGE_COMP_VER			0x1f0878UL
 
+#define QM_REG_WFQVPWEIGHT	0x2fa000UL
 #endif
-- 
2.7.2

^ permalink raw reply related

* [PATCH net-next 0/3] qed: Bandwidth configuration support
From: Manish Chopra @ 2016-04-26 14:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz

Hi David,

This series adds support in driver for min/max bandwidth configuration
for the PF on the link change notification or on explicit request of
bandwidth update from the MFW [management firmware].

The same infrastructure would later be used by user based flows
[for example, rate shaping for the VFs]

Please consider applying this series to "net-next"

Thanks,
Manish

Manish Chopra (3):
  qed: Add vport WFQ configuration APIs
  qed: Add PF max bandwidth configuration support
  qed: Add PF min bandwidth configuration support

 drivers/net/ethernet/qlogic/qed/qed.h              |  11 +
 drivers/net/ethernet/qlogic/qed/qed_dev.c          | 327 ++++++++++++++++++++-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h          |   6 +-
 .../net/ethernet/qlogic/qed/qed_init_fw_funcs.c    |  40 +++
 drivers/net/ethernet/qlogic/qed/qed_mcp.c          | 146 +++++----
 drivers/net/ethernet/qlogic/qed/qed_mcp.h          |  21 +-
 drivers/net/ethernet/qlogic/qed/qed_reg_addr.h     |   2 +
 7 files changed, 491 insertions(+), 62 deletions(-)

-- 
2.7.2

^ permalink raw reply

* Re: [PATCH] net/mlx5e: avoid stack overflow in mlx5e_open_channels
From: Arnd Bergmann @ 2016-04-26 15:49 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, David S. Miller,
	Achiad Shochat, Or Gerlitz, Amir Vadai, Tariq Toukan,
	Linux Netdev List, linux-rdma, linux-kernel
In-Reply-To: <CALzJLG-d1mTs9SdubELzQtczsiVY0cgzBc3BC6wJoBgx4Wg_+Q@mail.gmail.com>

On Tuesday 26 April 2016 17:41:45 Saeed Mahameed wrote:
> On Tue, Apr 26, 2016 at 4:53 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > Sure, do you want to just edit this when you forward the patch, or
> > do you need me to do it?
> >
> 
> Well, I won't say no if you want to do it 
> 

All I want is to get rid of the patch in my queue. I guess it's
worth the 10 minute of work ;-)

v2 coming

	Arnd

^ permalink raw reply

* Re: [net-next PATCH 6/8] mlx4: Add support for inner IPv6 checksum offloads and TSO
From: Alex Duyck @ 2016-04-26 15:50 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: talal, Linux Kernel Network Developers, David Miller, galp,
	ogerlitz, Eran Ben Elisha
In-Reply-To: <571F7D14.7090504@dev.mellanox.co.il>

On Tue, Apr 26, 2016 at 7:37 AM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
>
>
> On 4/25/2016 9:31 PM, Alexander Duyck wrote:
>>
>> >From what I can tell the ConnectX-3 will support an inner IPv6 checksum
>> and
>> segmentation offload, however it cannot support outer IPv6 headers.  For
>> this reason I am adding the feature to the hw_enc_features and adding an
>> extra check to the features_check call that will disable GSO and checksum
>> offload in the case that the encapsulated frame has an outer IP version of
>> that is not 4.
>
>
> Hi Alex,
>
> Can you share the testing commands of running vxlan over IPv6 and what
> exactly didn't work for you ?
> we would like to test this in house and understand what went wrong,
> theoretically there shouldn't be a difference between IPv6/IPv4 outer
> checksum offloading in ConnectX-3.

The setup is pretty straight forward.  Basically I left the first port
in the default namespace and moved the second int a secondary
namespace referred to below as $netns.  I then assigned the IPv6
addresses fec0::10:1 and fec0::10:2. After that I ran the following:

        VXLAN=vx$net
        echo $VXLAN ${test_options[$i]}
        ip link add $VXLAN type vxlan id $net \
                local fec0::10:1 remote $addr6 dev $PF0 \
                ${test_options[$i]} dstport `expr 8800 + $net`
        ip netns exec $netns ip link add $VXLAN type vxlan id $net \
                                  local $addr6 remote fec0::10:1 dev $port \
                                  ${test_options[$i]} dstport `expr 8800 + $net`
        ifconfig $VXLAN 192.168.${net}.1/24
        ip netns exec $netns ifconfig $VXLAN 192.168.${net}.2/24


> Anyway, I suspect it might be related to a driver bug most likely in
> get_real_size function @en_tx.c
> specifically in : *lso_header_size = (skb_inner_transport_header(skb) -
> skb->data) + inner_tcp_hdrlen(skb);
>
> will check this and get back to you.

I'm not entirely convinced.  What I was seeing is t hat the hardware
itself was performing Rx checksum offload only on tunnels with an
outer IPv4 header and ignoring tunnels with an outer IPv6 header.

> for the mlx5 patches I will also go through them later today.

Thanks.

>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   25
>> +++++++++++++++++++-----
>>   drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   15 ++++++++++++--
>>   2 files changed, 33 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>> b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>> index bce37cbfde24..6f28ac58251c 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
>> @@ -2357,8 +2357,10 @@ out:
>>         }
>>         /* set offloads */
>> -       priv->dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
>> -                                     NETIF_F_TSO | NETIF_F_GSO_UDP_TUNNEL
>> |
>> +       priv->dev->hw_enc_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
>> |
>> +                                     NETIF_F_RXCSUM |
>> +                                     NETIF_F_TSO | NETIF_F_TSO6 |
>> +                                     NETIF_F_GSO_UDP_TUNNEL |
>>                                       NETIF_F_GSO_UDP_TUNNEL_CSUM |
>>                                       NETIF_F_GSO_PARTIAL;
>>   }
>> @@ -2369,8 +2371,10 @@ static void mlx4_en_del_vxlan_offloads(struct
>> work_struct *work)
>>         struct mlx4_en_priv *priv = container_of(work, struct
>> mlx4_en_priv,
>>                                                  vxlan_del_task);
>>         /* unset offloads */
>> -       priv->dev->hw_enc_features &= ~(NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
>> -                                       NETIF_F_TSO |
>> NETIF_F_GSO_UDP_TUNNEL |
>> +       priv->dev->hw_enc_features &= ~(NETIF_F_IP_CSUM |
>> NETIF_F_IPV6_CSUM |
>> +                                       NETIF_F_RXCSUM |
>> +                                       NETIF_F_TSO | NETIF_F_TSO6 |
>> +                                       NETIF_F_GSO_UDP_TUNNEL |
>>                                         NETIF_F_GSO_UDP_TUNNEL_CSUM |
>>                                         NETIF_F_GSO_PARTIAL);
>>   @@ -2431,7 +2435,18 @@ static netdev_features_t
>> mlx4_en_features_check(struct sk_buff *skb,
>>                                                 netdev_features_t
>> features)
>>   {
>>         features = vlan_features_check(skb, features);
>> -       return vxlan_features_check(skb, features);
>> +       features = vxlan_features_check(skb, features);
>> +
>> +       /* The ConnectX-3 doesn't support outer IPv6 checksums but it does
>> +        * support inner IPv6 checksums and segmentation so  we need to
>> +        * strip that feature if this is an IPv6 encapsulated frame.
>> +        */
>> +       if (skb->encapsulation &&
>> +           (skb->ip_summed == CHECKSUM_PARTIAL) &&
>> +           (ip_hdr(skb)->version != 4))
>> +               features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
>
> Dejavu, didn't you fix this already in harmonize_features, in
> i.e, it is enough to do here:
>
> if (skb->encapsulation && (skb->ip_summed == CHECKSUM_PARTIAL))
>             features &= ~NETIF_F_IPV6_CSUM;
>

So what this patch is doing is enabling an inner IPv6 header offloads.
Up above we set the NETIF_F_IPV6_CSUM bit and we want it to stay set
unless we have an outer IPv6 header because the inner headers may
still need that bit set.  If I did what you suggest it strips IPv6
checksum support for inner headers and if we have to use GSO partial I
ended up encountering some of the other bugs that I have fixed for GSO
partial where either sg or csum are not defined.

>
>> +
>> +       return features;
>>   }
>>   #endif
>>   diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> index c0d7b7296236..c9f5388ea22a 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> @@ -41,6 +41,7 @@
>>   #include <linux/vmalloc.h>
>>   #include <linux/tcp.h>
>>   #include <linux/ip.h>
>> +#include <linux/ipv6.h>
>>   #include <linux/moduleparam.h>
>>     #include "mlx4_en.h"
>> @@ -918,8 +919,18 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct
>> net_device *dev)
>>                                  tx_ind, fragptr);
>>         if (skb->encapsulation) {
>> -               struct iphdr *ipv4 = (struct iphdr
>> *)skb_inner_network_header(skb);
>> -               if (ipv4->protocol == IPPROTO_TCP || ipv4->protocol ==
>> IPPROTO_UDP)
>> +               union {
>> +                       struct iphdr *v4;
>> +                       struct ipv6hdr *v6;
>> +                       unsigned char *hdr;
>> +               } ip;
>> +               u8 proto;
>> +
>> +               ip.hdr = skb_inner_network_header(skb);
>> +               proto = (ip.v4->version == 4) ? ip.v4->protocol :
>> +                                               ip.v6->nexthdr;
>> +
>> +               if (proto == IPPROTO_TCP || proto == IPPROTO_UDP)
>>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP |
>> MLX4_WQE_CTRL_ILP);
>>                 else
>>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>
>
> basically this is a bug fix, I don't know why the original author assumed it
> will be ipv4 !

Because the feature flags didn't allow it any other way.  I am adding
the NETIF_F_TSO6 and NETIF_F_IPV6_CSUM flags in hw_enc_features and so
situations such as this couldn't be encountered until you start adding
those flags.

- Alex

^ permalink raw reply

* [PATCH v2] net/mlx5e: avoid stack overflow in mlx5e_open_channels
From: Arnd Bergmann @ 2016-04-26 15:52 UTC (permalink / raw)
  To: Saeed Mahameed, Matan Barak, Leon Romanovsky
  Cc: Arnd Bergmann, David S. Miller, Achiad Shochat, Or Gerlitz,
	Amir Vadai, Tariq Toukan, netdev, linux-rdma, linux-kernel

struct mlx5e_channel_param is a large structure that is allocated
on the stack of mlx5e_open_channels, and with a recent change
it has grown beyond the warning size for the maximum stack
that a single function should use:

mellanox/mlx5/core/en_main.c: In function 'mlx5e_open_channels':
mellanox/mlx5/core/en_main.c:1325:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

The function is already using dynamic allocation and is not in
a fast path, so the easiest workaround is to use another kzalloc
for allocating the channel parameters.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: d3c9bc2743dc ("net/mlx5e: Added ICO SQs")
---
v2: move allocation back into caller, as suggested by Saeed Mahameed

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index af8c54d2e99c..7106006c792b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1266,13 +1266,10 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 	param->icosq = true;
 }
 
-static void mlx5e_build_channel_param(struct mlx5e_priv *priv,
-				      struct mlx5e_channel_param *cparam)
+static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
 {
 	u8 icosq_log_wq_sz = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
 
-	memset(cparam, 0, sizeof(*cparam));
-
 	mlx5e_build_rq_param(priv, &cparam->rq);
 	mlx5e_build_sq_param(priv, &cparam->sq);
 	mlx5e_build_icosq_param(priv, &cparam->icosq, icosq_log_wq_sz);
@@ -1283,7 +1280,7 @@ static void mlx5e_build_channel_param(struct mlx5e_priv *priv,
 
 static int mlx5e_open_channels(struct mlx5e_priv *priv)
 {
-	struct mlx5e_channel_param cparam;
+	struct mlx5e_channel_param *cparam;
 	int nch = priv->params.num_channels;
 	int err = -ENOMEM;
 	int i;
@@ -1295,12 +1292,15 @@ static int mlx5e_open_channels(struct mlx5e_priv *priv)
 	priv->txq_to_sq_map = kcalloc(nch * priv->params.num_tc,
 				      sizeof(struct mlx5e_sq *), GFP_KERNEL);
 
-	if (!priv->channel || !priv->txq_to_sq_map)
+	cparam = kzalloc(sizeof(struct mlx5e_channel_param), GFP_KERNEL);
+
+	if (!priv->channel || !priv->txq_to_sq_map || !cparam)
 		goto err_free_txq_to_sq_map;
 
-	mlx5e_build_channel_param(priv, &cparam);
+	mlx5e_build_channel_param(priv, cparam);
+
 	for (i = 0; i < nch; i++) {
-		err = mlx5e_open_channel(priv, i, &cparam, &priv->channel[i]);
+		err = mlx5e_open_channel(priv, i, cparam, &priv->channel[i]);
 		if (err)
 			goto err_close_channels;
 	}
@@ -1311,6 +1311,7 @@ static int mlx5e_open_channels(struct mlx5e_priv *priv)
 			goto err_close_channels;
 	}
 
+	kfree(cparam);
 	return 0;
 
 err_close_channels:
@@ -1320,6 +1321,7 @@ err_close_channels:
 err_free_txq_to_sq_map:
 	kfree(priv->txq_to_sq_map);
 	kfree(priv->channel);
+	kfree(cparam);
 
 	return err;
 }
-- 
2.7.0

^ permalink raw reply related

* ipv6 ifdown change is back in...
From: David Miller @ 2016-04-26 15:56 UTC (permalink / raw)
  To: dsa; +Cc: netdev


Ok, I thought things over last night and decided to put the ipv6 ifdown
changes back in and apply that last bug fix in.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/8] netlink: align attributes when needed (patchset #3)
From: David Miller @ 2016-04-26 16:02 UTC (permalink / raw)
  To: nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, sd-y1jBWg8GRStKuXlAQpz2QA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, jhs-jkUAjuhPggJWk0Htik3J/w,
	jack-IBi9RG/b67k, johannes-cdvu00un1VgdHxzADdlk8Q,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA,
	kvalo-sgV2jX0FEOL9JmXXK+q4OQ, drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <1461657978-13360-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue, 26 Apr 2016 10:06:10 +0200

> The last user of nla_put_u64() is block/drbd. This module does not use
> standard netlink API (see all the stuff in include/linux/genl_magic_struct.h
> and include/linux/genl_magic_func.h).

Yet another example where doing things in a special unique way creates
headaches and pain for everyone... sigh.

> I didn't modify it because it's seems hard to do it whithout testing
> and fully understanding the context (for example, why
> include/linux/drbd_genl.h is not part of uapi?).  Any thoughts?

I think you'll need to work with the drbd maintainer(s) to resolve
this and test the result.

Series applied, thanks.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply

* Re: [PATCH v8 net-next 1/1] hv_sock: introduce Hyper-V Sockets
From: Cathy Avery @ 2016-04-26 16:19 UTC (permalink / raw)
  To: decui, gregkh, davem, netdev, linux-kernel, devel, olaf,
	Jason Wang, K. Y. Srinivasan, haiyangz, vkuznets, joe
In-Reply-To: <1460079411-31982-1-git-send-email-decui@microsoft.com>

Hi,

I will be working with Dexuan to possibly port this functionality into RHEL.

Here are my initial comments. Mostly stylistic. They are prefaced by CAA.

Thanks,

Cathy Avery

On 04/07/2016 09:36 PM, Dexuan Cui wrote:
> Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
> mechanism between the host and the guest. It's somewhat like TCP over
> VMBus, but the transportation layer (VMBus) is much simpler than IP.
>
> With Hyper-V Sockets, applications between the host and the guest can talk
> to each other directly by the traditional BSD-style socket APIs.
>
> Hyper-V Sockets is only available on new Windows hosts, like Windows Server
> 2016. More info is in this article "Make your own integration services":
> https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service
>
> The patch implements the necessary support in the guest side by introducing
> a new socket address family AF_HYPERV.
>
> Signed-off-by: Dexuan Cui<decui@microsoft.com>
> Cc: "K. Y. Srinivasan"<kys@microsoft.com>
> Cc: Haiyang Zhang<haiyangz@microsoft.com>
> Cc: Vitaly Kuznetsov<vkuznets@redhat.com>
> ---
>   MAINTAINERS                 |    2 +
>   include/linux/hyperv.h      |   16 +
>   include/linux/socket.h      |    5 +-
>   include/net/af_hvsock.h     |   51 ++
>   include/uapi/linux/hyperv.h |   25 +
>   net/Kconfig                 |    1 +
>   net/Makefile                |    1 +
>   net/hv_sock/Kconfig         |   10 +
>   net/hv_sock/Makefile        |    3 +
>   net/hv_sock/af_hvsock.c     | 1483 +++++++++++++++++++++++++++++++++++++++++++
>   10 files changed, 1595 insertions(+), 2 deletions(-)
>   create mode 100644 include/net/af_hvsock.h
>   create mode 100644 net/hv_sock/Kconfig
>   create mode 100644 net/hv_sock/Makefile
>   create mode 100644 net/hv_sock/af_hvsock.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 67d99dd..7b6f203 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5267,7 +5267,9 @@ F:	drivers/pci/host/pci-hyperv.c
>   F:	drivers/net/hyperv/
>   F:	drivers/scsi/storvsc_drv.c
>   F:	drivers/video/fbdev/hyperv_fb.c
> +F:	net/hv_sock/
>   F:	include/linux/hyperv.h
> +F:	include/net/af_hvsock.h
>   F:	tools/hv/
>   F:	Documentation/ABI/stable/sysfs-bus-vmbus
>   
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index aa0fadc..b92439d 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1338,4 +1338,20 @@ extern __u32 vmbus_proto_version;
>   
>   int vmbus_send_tl_connect_request(const uuid_le *shv_guest_servie_id,
>   				  const uuid_le *shv_host_servie_id);
> +struct vmpipe_proto_header {
> +	u32 pkt_type;
> +	u32 data_size;
> +} __packed;
> +
> +#define HVSOCK_HEADER_LEN	(sizeof(struct vmpacket_descriptor) + \
> +				 sizeof(struct vmpipe_proto_header))
> +
> +/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
> +#define PREV_INDICES_LEN	(sizeof(u64))
> +
> +#define HVSOCK_PKT_LEN(payload_len)	(HVSOCK_HEADER_LEN + \
> +					ALIGN((payload_len), 8) + \
> +					PREV_INDICES_LEN)
> +#define HVSOCK_MIN_PKT_LEN	HVSOCK_PKT_LEN(1)
> +
>   #endif /* _HYPERV_H */
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 73bf6c6..88b1ccd 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -201,8 +201,8 @@ struct ucred {
>   #define AF_NFC		39	/* NFC sockets			*/
>   #define AF_VSOCK	40	/* vSockets			*/
>   #define AF_KCM		41	/* Kernel Connection Multiplexor*/
> -
> -#define AF_MAX		42	/* For now.. */
> +#define AF_HYPERV	42	/* Hyper-V Sockets		*/
> +#define AF_MAX		43	/* For now.. */
>   
>   /* Protocol families, same as address families. */
>   #define PF_UNSPEC	AF_UNSPEC
> @@ -249,6 +249,7 @@ struct ucred {
>   #define PF_NFC		AF_NFC
>   #define PF_VSOCK	AF_VSOCK
>   #define PF_KCM		AF_KCM
> +#define PF_HYPERV	AF_HYPERV
>   #define PF_MAX		AF_MAX
>   
>   /* Maximum queue length specifiable by listen.  */
> diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
> new file mode 100644
> index 0000000..a5aa28d
> --- /dev/null
> +++ b/include/net/af_hvsock.h
> @@ -0,0 +1,51 @@
> +#ifndef __AF_HVSOCK_H__
> +#define __AF_HVSOCK_H__
> +
> +#include <linux/kernel.h>
> +#include <linux/hyperv.h>
> +#include <net/sock.h>
> +
> +#define VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV (5 * PAGE_SIZE)
> +#define VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND (5 * PAGE_SIZE)
> +
> +#define HVSOCK_RCV_BUF_SZ	VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV
> +#define HVSOCK_SND_BUF_SZ	PAGE_SIZE
> +
> +#define sk_to_hvsock(__sk)    ((struct hvsock_sock *)(__sk))
> +#define hvsock_to_sk(__hvsk)   ((struct sock *)(__hvsk))
> +
> +struct hvsock_sock {
> +	/* sk must be the first member. */
> +	struct sock sk;
> +
> +	struct sockaddr_hv local_addr;
> +	struct sockaddr_hv remote_addr;
> +
> +	/* protected by the global hvsock_mutex */
> +	struct list_head bound_list;
> +	struct list_head connected_list;
> +
> +	struct list_head accept_queue;
> +	/* used by enqueue and dequeue */
> +	struct mutex accept_queue_mutex;
> +
> +	struct delayed_work dwork;
> +
> +	u32 peer_shutdown;
> +
> +	struct vmbus_channel *channel;
> +
> +	struct {
> +		struct vmpipe_proto_header hdr;
> +		char buf[HVSOCK_SND_BUF_SZ];
> +	} __packed send;
> +
> +	struct {
> +		struct vmpipe_proto_header hdr;
> +		char buf[HVSOCK_RCV_BUF_SZ];
> +		unsigned int data_len;
> +		unsigned int data_offset;
> +	} __packed recv;
> +};
> +
> +#endif /* __AF_HVSOCK_H__ */
> diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
> index e347b24..f1d0bca 100644
> --- a/include/uapi/linux/hyperv.h
> +++ b/include/uapi/linux/hyperv.h
> @@ -26,6 +26,7 @@
>   #define _UAPI_HYPERV_H
>   
>   #include <linux/uuid.h>
> +#include <linux/socket.h>
>   
>   /*
>    * Framework version for util services.
> @@ -396,4 +397,28 @@ struct hv_kvp_ip_msg {
>   	struct hv_kvp_ipaddr_value      kvp_ip_val;
>   } __attribute__((packed));
>   
> +/*
> + * This is the address fromat of Hyper-V Sockets.
> + * Note: here we just borrow the kernel's built-in type uuid_le. When
> + * an application calls bind() or connect(), the 2 members of struct
> + * sockaddr_hv must be of GUID.
> + * The GUID format differs from the UUID format only in the byte order of
> + * the first 3 fields. Refer to:
> + *https://en.wikipedia.org/wiki/Globally_unique_identifier
> + */
> +#define guid_t uuid_le
> +struct sockaddr_hv {
> +	__kernel_sa_family_t	shv_family;  /* Address family		*/
> +	__le16		reserved;	     /* Must be Zero		*/
> +	guid_t		shv_vm_id;	     /* Not used. Must be Zero. */
> +	guid_t		shv_service_id;	     /* Service ID		*/
> +};
> +
> +#define SHV_VMID_GUEST	NULL_UUID_LE
> +#define SHV_VMID_HOST	NULL_UUID_LE
> +
> +#define SHV_SERVICE_ID_ANY	NULL_UUID_LE
> +
> +#define SHV_PROTO_RAW		1
> +
>   #endif /* _UAPI_HYPERV_H */
> diff --git a/net/Kconfig b/net/Kconfig
> index a8934d8..68d13d7 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -231,6 +231,7 @@ source "net/dns_resolver/Kconfig"
>   source "net/batman-adv/Kconfig"
>   source "net/openvswitch/Kconfig"
>   source "net/vmw_vsock/Kconfig"
> +source "net/hv_sock/Kconfig"
>   source "net/netlink/Kconfig"
>   source "net/mpls/Kconfig"
>   source "net/hsr/Kconfig" diff --git a/net/Makefile b/net/Makefile index 81d1411..d115c31 
> 100644 --- a/net/Makefile +++ b/net/Makefile @@ -70,6 +70,7 @@ 
> obj-$(CONFIG_BATMAN_ADV) += batman-adv/ obj-$(CONFIG_NFC) += nfc/ 
> obj-$(CONFIG_OPENVSWITCH) += openvswitch/ obj-$(CONFIG_VSOCKETS) += 
> vmw_vsock/ +obj-$(CONFIG_HYPERV_SOCK) += hv_sock/ obj-$(CONFIG_MPLS) 
> += mpls/ obj-$(CONFIG_HSR) += hsr/ ifneq ($(CONFIG_NET_SWITCHDEV),) 
> diff --git a/net/hv_sock/Kconfig b/net/hv_sock/Kconfig new file mode 
> 100644 index 0000000..1f41848 --- /dev/null +++ b/net/hv_sock/Kconfig 
> @@ -0,0 +1,10 @@ +config HYPERV_SOCK + tristate "Hyper-V Sockets"
> +	depends on HYPERV
> +	default m if HYPERV
> +	help
> +	  Hyper-V Sockets is somewhat like TCP over VMBus, allowing
> +	  communication between Linux guest and Hyper-V host without TCP/IP.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called hv_sock.
> diff --git a/net/hv_sock/Makefile b/net/hv_sock/Makefile
> new file mode 100644
> index 0000000..716c012
> --- /dev/null
> +++ b/net/hv_sock/Makefile
> @@ -0,0 +1,3 @@
> +obj-$(CONFIG_HYPERV_SOCK) += hv_sock.o
> +
> +hv_sock-y += af_hvsock.o
> diff --git a/net/hv_sock/af_hvsock.c b/net/hv_sock/af_hvsock.c
> new file mode 100644
> index 0000000..185a382
> --- /dev/null
> +++ b/net/hv_sock/af_hvsock.c
> @@ -0,0 +1,1483 @@
> +/*
> + * Hyper-V Sockets -- a socket-based communication channel between the
> + * Hyper-V host and the virtual machines running on it.
> + *
> + * Copyright(c) 2016, Microsoft Corporation. All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + *
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. The name of the author may not be used to endorse or promote
> + *    products derived from this software without specific prior written
> + *    permission.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
> + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
> + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
> + * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE.
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <net/af_hvsock.h>
> +
> +static struct proto hvsock_proto = {
> +	.name = "HV_SOCK",
> +	.owner = THIS_MODULE,
> +	.obj_size = sizeof(struct hvsock_sock),
> +};
> +
> +#define SS_LISTEN 255
> +
> +static LIST_HEAD(hvsock_bound_list);
> +static LIST_HEAD(hvsock_connected_list);
> +static DEFINE_MUTEX(hvsock_mutex);
> +
> +static bool uuid_equals(uuid_le u1, uuid_le u2)
> +{
> +	return !uuid_le_cmp(u1, u2);
> +}
> +
> +/* NOTE: hvsock_mutex must be held when the below helper functions, whose
> + * names begin with __ hvsock, are invoked.
> + */
> +static void __hvsock_insert_bound(struct list_head *list,
> +				  struct hvsock_sock *hvsk)
> +{
> +	sock_hold(&hvsk->sk);
> +	list_add(&hvsk->bound_list, list);
> +}
> +
> +static void __hvsock_insert_connected(struct list_head *list,
> +				      struct hvsock_sock *hvsk)
> +{
> +	sock_hold(&hvsk->sk);
> +	list_add(&hvsk->connected_list, list);
> +}
> +
> +static void __hvsock_remove_bound(struct hvsock_sock *hvsk)
> +{
> +	list_del_init(&hvsk->bound_list);
> +	sock_put(&hvsk->sk);
> +}
> +
> +static void __hvsock_remove_connected(struct hvsock_sock *hvsk)
> +{
> +	list_del_init(&hvsk->connected_list);
> +	sock_put(&hvsk->sk);
> +}
> +
> +static struct sock *__hvsock_find_bound_socket(const struct sockaddr_hv *addr)
> +{
> +	struct hvsock_sock *hvsk;
> +
> +	list_for_each_entry(hvsk, &hvsock_bound_list, bound_list) {
> +		if (uuid_equals(addr->shv_service_id,
> +				hvsk->local_addr.shv_service_id))
> +			return hvsock_to_sk(hvsk);
> +	}
> +	return NULL;
> +}
> +
> +static struct sock *__hvsock_find_connected_socket_by_channel(
> +	const struct vmbus_channel *channel)
> +{
> +	struct hvsock_sock *hvsk;
> +
> +	list_for_each_entry(hvsk, &hvsock_connected_list, connected_list) {
> +		if (hvsk->channel == channel)
> +			return hvsock_to_sk(hvsk);
> +	}
> +	return NULL;
> +}
> +
> +static bool __hvsock_in_bound_list(struct hvsock_sock *hvsk)
> +{
> +	return !list_empty(&hvsk->bound_list);
> +}
> +
> +static bool __hvsock_in_connected_list(struct hvsock_sock *hvsk)
> +{
> +	return !list_empty(&hvsk->connected_list);
> +}
> +
> +static void hvsock_insert_connected(struct hvsock_sock *hvsk)
> +{
> +	__hvsock_insert_connected(&hvsock_connected_list, hvsk);
> +}
> +
> +static
> +void hvsock_enqueue_accept(struct sock *listener, struct sock *connected)
> +{
> +	struct hvsock_sock *hvlistener;
> +	struct hvsock_sock *hvconnected;
> +
> +	hvlistener = sk_to_hvsock(listener);
> +	hvconnected = sk_to_hvsock(connected);
> +
> +	sock_hold(connected);
> +	sock_hold(listener);
> +
> +	mutex_lock(&hvlistener->accept_queue_mutex);
> +	list_add_tail(&hvconnected->accept_queue, &hvlistener->accept_queue);
> +	listener->sk_ack_backlog++;
> +	mutex_unlock(&hvlistener->accept_queue_mutex);
> +}
> +
> +static struct sock *hvsock_dequeue_accept(struct sock *listener)
> +{
> +	struct hvsock_sock *hvlistener;
> +	struct hvsock_sock *hvconnected;
> +
> +	hvlistener = sk_to_hvsock(listener);
> +
> +	mutex_lock(&hvlistener->accept_queue_mutex);
> +
> +	if (list_empty(&hvlistener->accept_queue)) {
> +		mutex_unlock(&hvlistener->accept_queue_mutex);
> +		return NULL;
> +	}
> +
> +	hvconnected = list_entry(hvlistener->accept_queue.next,
> +				 struct hvsock_sock, accept_queue);
> +
> +	list_del_init(&hvconnected->accept_queue);
> +	listener->sk_ack_backlog--;
> +
> +	mutex_unlock(&hvlistener->accept_queue_mutex);
> +
> +	sock_put(listener);
> +	/* The caller will need a reference on the connected socket so we let
> +	 * it call sock_put().
> +	 */
> +
> +	return hvsock_to_sk(hvconnected);
> +}
> +
> +static bool hvsock_is_accept_queue_empty(struct sock *sk)
> +{
> +	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
> +	int ret;
> +
> +	mutex_lock(&hvsk->accept_queue_mutex);
> +	ret = list_empty(&hvsk->accept_queue);
> +	mutex_unlock(&hvsk->accept_queue_mutex);
> +
> +	return ret;
> +}
> +
> +static void hvsock_addr_init(struct sockaddr_hv *addr, uuid_le service_id)
> +{
> +	memset(addr, 0, sizeof(*addr));
> +	addr->shv_family = AF_HYPERV;
> +	addr->shv_service_id = service_id;
> +}
> +
> +static int hvsock_addr_validate(const struct sockaddr_hv *addr)
> +{
> +	if (!addr)
> +		return -EFAULT;
> +
> +	if (addr->shv_family != AF_HYPERV)
> +		return -EAFNOSUPPORT;
> +
> +	if (addr->reserved != 0)
> +		return -EINVAL;
> +
> +	if (!uuid_equals(addr->shv_vm_id, NULL_UUID_LE))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static bool hvsock_addr_bound(const struct sockaddr_hv *addr)
> +{
> +	return !uuid_equals(addr->shv_service_id, SHV_SERVICE_ID_ANY);
> +}
> +
> +static int hvsock_addr_cast(const struct sockaddr *addr, size_t len,
> +			    struct sockaddr_hv **out_addr)
> +{
> +	if (len < sizeof(**out_addr))
> +		return -EFAULT;
> +
> +	*out_addr = (struct sockaddr_hv *)addr;
> +	return hvsock_addr_validate(*out_addr);
> +}
> +
> +static int __hvsock_do_bind(struct hvsock_sock *hvsk,
> +			    struct sockaddr_hv *addr)
> +{
> +	struct sockaddr_hv hv_addr;
> +	int ret = 0;
> +
> +	hvsock_addr_init(&hv_addr, addr->shv_service_id);
> +
> +	mutex_lock(&hvsock_mutex);
> +
> +	if (uuid_equals(addr->shv_service_id, SHV_SERVICE_ID_ANY)) {
> +		do {
> +			uuid_le_gen(&hv_addr.shv_service_id);
> +		} while (__hvsock_find_bound_socket(&hv_addr));
> +	} else {
> +		if (__hvsock_find_bound_socket(&hv_addr)) {
> +			ret = -EADDRINUSE;
> +			goto out;
> +		}
> +	}
> +
> +	hvsock_addr_init(&hvsk->local_addr, hv_addr.shv_service_id);
> +	__hvsock_insert_bound(&hvsock_bound_list, hvsk);
> +
> +out:
> +	mutex_unlock(&hvsock_mutex);
> +
> +	return ret;
> +}
> +
> +static int __hvsock_bind(struct sock *sk, struct sockaddr_hv *addr)
> +{
> +	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
> +	int ret;
> +
> +	if (hvsock_addr_bound(&hvsk->local_addr))
> +		return -EINVAL;
> +
> +	switch (sk->sk_socket->type) {
> +	case SOCK_STREAM:
> +		ret = __hvsock_do_bind(hvsk, addr);
> +		break;
> +
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +/* Autobind this socket to the local address if necessary. */
> +static int hvsock_auto_bind(struct hvsock_sock *hvsk)
> +{
> +	struct sock *sk = hvsock_to_sk(hvsk);
> +	struct sockaddr_hv local_addr;
> +
> +	if (hvsock_addr_bound(&hvsk->local_addr))
> +		return 0;
> +	hvsock_addr_init(&local_addr, SHV_SERVICE_ID_ANY);
> +	return __hvsock_bind(sk, &local_addr);
> +}
> +
> +static void hvsock_sk_destruct(struct sock *sk)
> +{
> +	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
> +	struct vmbus_channel *channel = hvsk->channel;
> +
> +	if (!channel)
> +		return;
> +
> +	vmbus_hvsock_device_unregister(channel);
> +}
> +
> +static void __hvsock_release(struct sock *sk)
> +{
> +	struct hvsock_sock *hvsk;
> +	struct sock *pending;
> +
> +	hvsk = sk_to_hvsock(sk);
> +
> +	mutex_lock(&hvsock_mutex);
> +	if (__hvsock_in_bound_list(hvsk))
> +		__hvsock_remove_bound(hvsk);
> +
> +	if (__hvsock_in_connected_list(hvsk))
> +		__hvsock_remove_connected(hvsk);
> +	mutex_unlock(&hvsock_mutex);
> +
> +	lock_sock(sk);
> +	sock_orphan(sk);
> +	sk->sk_shutdown = SHUTDOWN_MASK;
> +
> +	/* Clean up any sockets that never were accepted. */
> +	while ((pending = hvsock_dequeue_accept(sk)) != NULL) {
> +		__hvsock_release(pending);
> +		sock_put(pending);
> +	}
> +
> +	release_sock(sk);
> +	sock_put(sk);
> +}
> +
> +static int hvsock_release(struct socket *sock)
> +{
> +	/* If accept() is interrupted by a signal, the temporary socket
> +	 * struct's sock->sk is NULL.
> +	 */
> +	if (sock->sk) {
> +		__hvsock_release(sock->sk);
> +		sock->sk = NULL;
> +	}
> +
> +	sock->state = SS_FREE;
> +	return 0;
> +}
> +
> +static struct sock *__hvsock_create(struct net *net, struct socket *sock,
> +				    gfp_t priority, unsigned short type)
> +{
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +
> +	sk = sk_alloc(net, AF_HYPERV, priority, &hvsock_proto, 0);
> +	if (!sk)
> +		return NULL;
> +
> +	sock_init_data(sock, sk);
> +
> +	/* sk->sk_type is normally set in sock_init_data, but only if sock is
> +	 * non-NULL. We make sure that our sockets always have a type by
> +	 * setting it here if needed.
> +	 */
> +	if (!sock)
> +		sk->sk_type = type;
> +
> +	hvsk = sk_to_hvsock(sk);
> +	hvsock_addr_init(&hvsk->local_addr, SHV_SERVICE_ID_ANY);
> +	hvsock_addr_init(&hvsk->remote_addr, SHV_SERVICE_ID_ANY);
> +
> +	sk->sk_destruct = hvsock_sk_destruct;
> +
> +	/* Looks stream-based socket doesn't need this. */
> +	sk->sk_backlog_rcv = NULL;
> +
> +	sk->sk_state = 0;
> +	sock_reset_flag(sk, SOCK_DONE);
> +
> +	INIT_LIST_HEAD(&hvsk->bound_list);
> +	INIT_LIST_HEAD(&hvsk->connected_list);
> +
> +	INIT_LIST_HEAD(&hvsk->accept_queue);
> +	mutex_init(&hvsk->accept_queue_mutex);
> +
> +	hvsk->peer_shutdown = 0;
> +
> +	hvsk->recv.data_len = 0;
> +	hvsk->recv.data_offset = 0;
> +
> +	return sk;
> +}
> +
> +static int hvsock_bind(struct socket *sock, struct sockaddr *addr,
> +		       int addr_len)
> +{
> +	struct sockaddr_hv *hv_addr;
> +	struct sock *sk;
> +	int ret;
> +
> +	sk = sock->sk;
> +
> +	if (hvsock_addr_cast(addr, addr_len, &hv_addr) != 0)
> +		return -EINVAL;
> +
> +	lock_sock(sk);
> +	ret = __hvsock_bind(sk, hv_addr);
> +	release_sock(sk);
> +
> +	return ret;
> +}
> +
> +static int hvsock_getname(struct socket *sock,
> +			  struct sockaddr *addr, int *addr_len, int peer)
> +{
> +	struct sockaddr_hv *hv_addr;
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +	int ret;
> +
> +	sk = sock->sk;
> +	hvsk = sk_to_hvsock(sk);
> +	ret = 0;
> +
> +	lock_sock(sk);
> +
> +	if (peer) {
> +		if (sock->state != SS_CONNECTED) {
> +			ret = -ENOTCONN;
> +			goto out;
> +		}
> +		hv_addr = &hvsk->remote_addr;
> +	} else {
> +		hv_addr = &hvsk->local_addr;
> +	}
> +
> +	__sockaddr_check_size(sizeof(*hv_addr));
> +
> +	memcpy(addr, hv_addr, sizeof(*hv_addr));
> +	*addr_len = sizeof(*hv_addr);
> +
> +out:
> +	release_sock(sk);
> +	return ret;
> +}
> +
> +static int hvsock_shutdown(struct socket *sock, int mode)
> +{
> +	struct sock *sk;
> +
> +	if (mode < SHUT_RD || mode > SHUT_RDWR)
> +		return -EINVAL;
> +	/* This maps:
> +	 * SHUT_RD   (0) -> RCV_SHUTDOWN  (1)
> +	 * SHUT_WR   (1) -> SEND_SHUTDOWN (2)
> +	 * SHUT_RDWR (2) -> SHUTDOWN_MASK (3)
> +	 */
> +	++mode;
> +
> +	if (sock->state == SS_UNCONNECTED)
> +		return -ENOTCONN;
> +
> +	sock->state = SS_DISCONNECTING;
> +
> +	sk = sock->sk;
> +
> +	lock_sock(sk);
> +
> +	sk->sk_shutdown |= mode;
> +	sk->sk_state_change(sk);
> +
> +	/* TODO: how to send a FIN if we haven't done that? */
> +	if (mode & SEND_SHUTDOWN)
> +		;
> +
> +	release_sock(sk);
> +
> +	return 0;
> +}
> +
> +static void get_ringbuffer_rw_status(struct vmbus_channel *channel,
> +				     bool *can_read, bool *can_write)
> +{
> +	u32 avl_read_bytes, avl_write_bytes, dummy;
> +
> +	if (can_read) {
> +		hv_get_ringbuffer_availbytes(&channel->inbound,
> +					     &avl_read_bytes,
> +					     &dummy);
> +		*can_read = avl_read_bytes >= HVSOCK_MIN_PKT_LEN;
> +	}
> +
> +	/* We write into the ringbuffer only when we're able to write a
> +	 * a payload of 4096 bytes (the actual written payload's length may be
> +	 * less than 4096).
> +	 */
> +	if (can_write) {
> +		hv_get_ringbuffer_availbytes(&channel->outbound,
> +					     &dummy,
> +					     &avl_write_bytes);
> +		*can_write = avl_write_bytes > HVSOCK_PKT_LEN(PAGE_SIZE);
> +	}
> +}
> +
> +static unsigned int hvsock_poll(struct file *file, struct socket *sock,
> +				poll_table *wait)
> +{
> +	struct vmbus_channel *channel;
> +	bool can_read, can_write;
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +	unsigned int mask;
> +
> +	sk = sock->sk;
> +	hvsk = sk_to_hvsock(sk);
> +
> +	poll_wait(file, sk_sleep(sk), wait);
> +	mask = 0;
> +
> +	if (sk->sk_err)
> +		/* Signify that there has been an error on this socket. */
> +		mask |= POLLERR;
> +
> +	/* INET sockets treat local write shutdown and peer write shutdown as a
> +	 * case of POLLHUP set.
> +	 */
> +	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
> +	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> +	     (hvsk->peer_shutdown & SEND_SHUTDOWN))) {
> +		mask |= POLLHUP;
> +	}
> +
> +	if (sk->sk_shutdown & RCV_SHUTDOWN ||
> +	    hvsk->peer_shutdown & SEND_SHUTDOWN) {
> +		mask |= POLLRDHUP;
> +	}
> +
> +	lock_sock(sk);
> +
> +	/* Listening sockets that have connections in their accept
> +	 * queue can be read.
> +	 */
> +	if (sk->sk_state == SS_LISTEN && !hvsock_is_accept_queue_empty(sk))
> +		mask |= POLLIN | POLLRDNORM;
> +
> +	/* The mutex is to against hvsock_open_connection() */
> +	mutex_lock(&hvsock_mutex);
> +
> +	channel = hvsk->channel;
> +	if (channel) {
> +		/* If there is something in the queue then we can read */
> +		get_ringbuffer_rw_status(channel, &can_read, &can_write);
> +
> +		if (!can_read && hvsk->recv.data_len > 0)
> +			can_read = true;
> +
> +		if (!(sk->sk_shutdown & RCV_SHUTDOWN) && can_read)
> +			mask |= POLLIN | POLLRDNORM;
> +	} else {
> +		can_read = false;
> +		can_write = false;
> +	}
> +
> +	mutex_unlock(&hvsock_mutex);
> +
> +	/* Sockets whose connections have been closed terminated should
> +	 * also be considered read, and we check the shutdown flag for that.
> +	 */
> +	if (sk->sk_shutdown & RCV_SHUTDOWN ||
> +	    hvsk->peer_shutdown & SEND_SHUTDOWN) {
> +		mask |= POLLIN | POLLRDNORM;
> +	}
> +
> +	/* Connected sockets that can produce data can be written. */
> +	if (sk->sk_state == SS_CONNECTED && can_write &&
> +	    !(sk->sk_shutdown & SEND_SHUTDOWN)) {
> +		/* Remove POLLWRBAND since INET sockets are not setting it.
> +		 */
> +		mask |= POLLOUT | POLLWRNORM;
> +	}
> +
> +	/* Simulate INET socket poll behaviors, which sets
> +	 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
> +	 * but local send is not shutdown.
> +	 */
> +	if (sk->sk_state == SS_UNCONNECTED &&
> +	    !(sk->sk_shutdown & SEND_SHUTDOWN))
> +		mask |= POLLOUT | POLLWRNORM;
> +
> +	release_sock(sk);
> +
> +	return mask;
> +}
> +
> +/* This function runs in the tasklet context of process_chn_event() */
> +static void hvsock_on_channel_cb(void *ctx)
> +{
> +	struct sock *sk = (struct sock *)ctx;
> +	struct hvsock_sock *hvsk = sk_to_hvsock(sk);
> +	struct vmbus_channel *channel = hvsk->channel;
> +	bool can_read, can_write;
> +
> +	if (!channel) {
> +		WARN_ONCE(1, "NULL channel! There is a programming bug.\n");
> +		return;
> +	}
> +
> +	get_ringbuffer_rw_status(channel, &can_read, &can_write);
> +
> +	if (can_read)
> +		sk->sk_data_ready(sk);
> +
> +	if (can_write)
> +		sk->sk_write_space(sk);
> +}
> +
> +static void hvsock_close_connection(struct vmbus_channel *channel)
> +{
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +
> +	mutex_lock(&hvsock_mutex);
> +
> +	sk = __hvsock_find_connected_socket_by_channel(channel);
> +
> +	/* The guest has already closed the connection? */
> +	if (!sk)
> +		goto out;
> +
> +	sk->sk_socket->state = SS_UNCONNECTED;
> +	sk->sk_state = SS_UNCONNECTED;
> +	sock_set_flag(sk, SOCK_DONE);
> +
> +	hvsk = sk_to_hvsock(sk);
> +	hvsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
> +
> +	sk->sk_state_change(sk);
> +out:
> +	mutex_unlock(&hvsock_mutex);
> +}
> +
> +static int hvsock_open_connection(struct vmbus_channel *channel)
> +{
> +	struct hvsock_sock *hvsk, *new_hvsk;
> +	struct sockaddr_hv hv_addr;
> +	struct sock *sk, *new_sk;
> +
> +	uuid_le *instance, *service_id;
> +	int ret;
> +
> +	instance = &channel->offermsg.offer.if_instance;
> +	service_id = &channel->offermsg.offer.if_type;
> +
> +	hvsock_addr_init(&hv_addr, *instance);
> +
> +	mutex_lock(&hvsock_mutex);
> +
> +	sk = __hvsock_find_bound_socket(&hv_addr);
> +
> +	if (sk) {
> +		/* It is from the guest client's connect() */
> +		if (sk->sk_state != SS_CONNECTING) {
> +			ret = -ENXIO;
> +			goto out;
> +		}
> +
> +		hvsk = sk_to_hvsock(sk);
> +		hvsk->channel = channel;
> +		set_channel_read_state(channel, false);
> +		vmbus_set_chn_rescind_callback(channel,
> +					       hvsock_close_connection);
> +		ret = vmbus_open(channel, VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND,
> +				 VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV, NULL, 0,
> +				 hvsock_on_channel_cb, sk);
> +		if (ret != 0) {
> +			hvsk->channel = NULL;
> +			goto out;
> +		}
> +
> +		set_channel_pending_send_size(channel,
> +					      HVSOCK_PKT_LEN(PAGE_SIZE));
> +		sk->sk_state = SS_CONNECTED;
> +		sk->sk_socket->state = SS_CONNECTED;
> +		hvsock_insert_connected(hvsk);
> +		sk->sk_state_change(sk);
> +		goto out;
> +	}
> +
> +	/* Now we suppose it is from a host client's connect() */
> +	hvsock_addr_init(&hv_addr, *service_id);
> +	sk = __hvsock_find_bound_socket(&hv_addr);
> +
> +	/* No guest server listening? Well, let's ignore the offer */
> +	if (!sk || sk->sk_state != SS_LISTEN) {
> +		ret = -ENXIO;
> +		goto out;
> +	}
> +
> +	if (sk->sk_ack_backlog >= sk->sk_max_ack_backlog) {
> +		ret = -EMFILE;
> +		goto out;
> +	}
> +
> +	new_sk = __hvsock_create(sock_net(sk), NULL, GFP_KERNEL, sk->sk_type);
> +	if (!new_sk) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	new_hvsk = sk_to_hvsock(new_sk);
> +	new_sk->sk_state = SS_CONNECTING;
> +	hvsock_addr_init(&new_hvsk->local_addr, *service_id);
> +	hvsock_addr_init(&new_hvsk->remote_addr, *instance);
> +
> +	set_channel_read_state(channel, false);
> +	new_hvsk->channel = channel;
> +	vmbus_set_chn_rescind_callback(channel, hvsock_close_connection);
> +	ret = vmbus_open(channel, VMBUS_RINGBUFFER_SIZE_HVSOCK_SEND,
> +			 VMBUS_RINGBUFFER_SIZE_HVSOCK_RECV, NULL, 0,
> +			 hvsock_on_channel_cb, new_sk);
> +	if (ret != 0) {
> +		new_hvsk->channel = NULL;
> +		sock_put(new_sk);
> +		goto out;
> +	}
> +	set_channel_pending_send_size(channel, HVSOCK_PKT_LEN(PAGE_SIZE));
> +
> +	new_sk->sk_state = SS_CONNECTED;
> +	hvsock_insert_connected(new_hvsk);
> +	hvsock_enqueue_accept(sk, new_sk);
> +	sk->sk_state_change(sk);
> +out:
> +	mutex_unlock(&hvsock_mutex);
> +	return ret;
> +}
> +
> +static void hvsock_connect_timeout(struct work_struct *work)
> +{
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +
> +	hvsk = container_of(work, struct hvsock_sock, dwork.work);
> +	sk = hvsock_to_sk(hvsk);
> +
> +	lock_sock(sk);
> +	if ((sk->sk_state == SS_CONNECTING) &&
> +	    (sk->sk_shutdown != SHUTDOWN_MASK)) {
> +		sk->sk_state = SS_UNCONNECTED;
> +		sk->sk_err = ETIMEDOUT;
> +		sk->sk_error_report(sk);
> +	}
> +	release_sock(sk);
> +
> +	sock_put(sk);
> +}
> +
> +static int hvsock_connect(struct socket *sock, struct sockaddr *addr,
> +			  int addr_len, int flags)
> +{
> +	struct sockaddr_hv *remote_addr;
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +
> +	DEFINE_WAIT(wait);
> +	long timeout;
> +
> +	int ret = 0;
> +
> +	sk = sock->sk;
> +	hvsk = sk_to_hvsock(sk);
> +
> +	lock_sock(sk);
> +
> +	switch (sock->state) {
> +	case SS_CONNECTED:
> +		ret = -EISCONN;
> +		goto out;
> +	case SS_DISCONNECTING:
> +		ret = -EINVAL;
> +		goto out;
> +	case SS_CONNECTING:
> +		/* This continues on so we can move sock into the SS_CONNECTED
> +		 * state once the connection has completed (at which point err
> +		 * will be set to zero also).  Otherwise, we will either wait
> +		 * for the connection or return -EALREADY should this be a
> +		 * non-blocking call.
> +		 */
> +		ret = -EALREADY;
> +		break;
> +	default:
> +		if ((sk->sk_state == SS_LISTEN) ||
> +		    hvsock_addr_cast(addr, addr_len, &remote_addr) != 0) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* Set the remote address that we are connecting to. */
> +		memcpy(&hvsk->remote_addr, remote_addr,
> +		       sizeof(hvsk->remote_addr));
> +
> +		ret = hvsock_auto_bind(hvsk);
> +		if (ret)
> +			goto out;
> +
> +		sk->sk_state = SS_CONNECTING;
> +
> +		ret = vmbus_send_tl_connect_request(
> +					&hvsk->local_addr.shv_service_id,
> +					&hvsk->remote_addr.shv_service_id);
> +		if (ret < 0)
> +			goto out;
> +
> +		/* Mark sock as connecting and set the error code to in
> +		 * progress in case this is a non-blocking connect.
> +		 */
> +		sock->state = SS_CONNECTING;
> +		ret = -EINPROGRESS;
> +	}
> +

CAA Putting the connection wait into a separate function if possible 
would look cleaner.
Checkout llc/af_llc.c
> +	/* The receive path will handle all communication until we are able to
> +	 * enter the connected state.  Here we wait for the connection to be
> +	 * completed or a notification of an error.
> +	 */
> +	timeout = 30 * HZ;
> +	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> +
> +	while (sk->sk_state != SS_CONNECTED && sk->sk_err == 0) {
> +		if (flags & O_NONBLOCK) {
> +			/* If we're not going to block, we schedule a timeout
> +			 * function to generate a timeout on the connection
> +			 * attempt, in case the peer doesn't respond in a
> +			 * timely manner. We hold on to the socket until the
> +			 * timeout fires.
> +			 */
> +			sock_hold(sk);
> +			INIT_DELAYED_WORK(&hvsk->dwork,
> +					  hvsock_connect_timeout);
> +			schedule_delayed_work(&hvsk->dwork, timeout);
> +
> +			/* Skip ahead to preserve error code set above. */
> +			goto out_wait;
> +		}
> +
> +		release_sock(sk);
> +		timeout = schedule_timeout(timeout);
> +		lock_sock(sk);
> +
> +		if (signal_pending(current)) {
> +			ret = sock_intr_errno(timeout);
> +			goto out_wait_error;
> +		} else if (timeout == 0) {
> +			ret = -ETIMEDOUT;
> +			goto out_wait_error;
> +		}
> +
> +		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> +	}
> +
> +	ret = sk->sk_err ? -sk->sk_err : 0;
> +
> +out_wait_error:
> +	if (ret < 0) {
> +		sk->sk_state = SS_UNCONNECTED;
> +		sock->state = SS_UNCONNECTED;
> +	}
> +out_wait:
> +	finish_wait(sk_sleep(sk), &wait);
> +out:
> +	release_sock(sk);
> +	return ret;
> +}
> +
> +static
> +int hvsock_accept(struct socket *sock, struct socket *newsock, int flags)
> +{
> +	struct hvsock_sock *hvconnected;
> +	struct sock *connected;
> +	struct sock *listener;
> +
> +	DEFINE_WAIT(wait);
> +	long timeout;
> +
> +	int ret = 0;
> +
> +	listener = sock->sk;
> +
> +	lock_sock(listener);
> +
> +	if (sock->type != SOCK_STREAM) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	if (listener->sk_state != SS_LISTEN) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Wait for children sockets to appear; these are the new sockets
> +	 * created upon connection establishment.
> +	 */
> +	timeout = sock_sndtimeo(listener, flags & O_NONBLOCK);
CAA cleaner if wait in a separate function
> +	prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
> +
> +	while ((connected = hvsock_dequeue_accept(listener)) == NULL &&
> +	       listener->sk_err == 0) {
> +		release_sock(listener);
> +		timeout = schedule_timeout(timeout);
> +		lock_sock(listener);
> +
> +		if (signal_pending(current)) {
> +			ret = sock_intr_errno(timeout);
> +			goto out_wait;
> +		} else if (timeout == 0) {
> +			ret = -EAGAIN;
> +			goto out_wait;
> +		}
> +
> +		prepare_to_wait(sk_sleep(listener), &wait, TASK_INTERRUPTIBLE);
> +	}
> +
> +	if (listener->sk_err)
> +		ret = -listener->sk_err;
> +
> +	if (connected) {
> +		lock_sock(connected);
> +		hvconnected = sk_to_hvsock(connected);
> +
> +		/* If the listener socket has received an error, then we should
> +		 * reject this socket and return.  Note that we simply mark the
> +		 * socket rejected, drop our reference, and let the cleanup
> +		 * function handle the cleanup; the fact that we found it in
> +		 * the listener's accept queue guarantees that the cleanup
> +		 * function hasn't run yet.
> +		 */
> +		if (ret) {
> +			release_sock(connected);
> +			sock_put(connected);
> +			goto out_wait;
> +		}
> +
> +		newsock->state = SS_CONNECTED;
> +		sock_graft(connected, newsock);
> +		release_sock(connected);
> +		sock_put(connected);
> +	}
> +
> +out_wait:
> +	finish_wait(sk_sleep(listener), &wait);
> +out:
> +	release_sock(listener);
> +	return ret;
> +}
> +
> +static int hvsock_listen(struct socket *sock, int backlog)
> +{
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +	int ret = 0;
> +
> +	sk = sock->sk;
> +	lock_sock(sk);
> +
> +	if (sock->type != SOCK_STREAM) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	if (sock->state != SS_UNCONNECTED) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (backlog <= 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	/* This is an artificial limit */
> +	if (backlog > 128)
> +		backlog = 128;
> +
> +	hvsk = sk_to_hvsock(sk);
> +	if (!hvsock_addr_bound(&hvsk->local_addr)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	sk->sk_ack_backlog = 0;
> +	sk->sk_max_ack_backlog = backlog;
> +	sk->sk_state = SS_LISTEN;
> +out:
> +	release_sock(sk);
> +	return ret;
> +}
> +
CAA why don't you set .setsockopt and getsockopt to sock_no_setsockopt 
and sock_no_getsockopt
> +static int hvsock_setsockopt(struct socket *sock,
> +			     int level,
> +			     int optname,
> +			     char __user *optval, unsigned int optlen)
> +{
> +	return -ENOPROTOOPT;
> +}
> +
> +static int hvsock_getsockopt(struct socket *sock,
> +			     int level,
> +			     int optname,
> +			     char __user *optval, int __user *optlen)
> +{
> +	return -ENOPROTOOPT;
> +}
> +
> +static int hvsock_send_data(struct vmbus_channel *channel,
> +			    struct hvsock_sock *hvsk,
> +			    size_t to_write)
> +{
> +	hvsk->send.hdr.pkt_type = 1;
> +	hvsk->send.hdr.data_size = to_write;
> +	return vmbus_sendpacket(channel, &hvsk->send.hdr,
> +				sizeof(hvsk->send.hdr) + to_write,
> +				0, VM_PKT_DATA_INBAND, 0);
> +}
> +
> +static int hvsock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
> +{
CAA Again compartmentalize this a bit if possible. example llc/af_llc.c
> +	struct vmbus_channel *channel;
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +
> +	size_t total_to_write = len;
> +	size_t total_written = 0;
> +
> +	bool can_write;
> +	long timeout;
> +	int ret = 0;
> +
> +	DEFINE_WAIT(wait);
> +
> +	if (len == 0)
> +		return -EINVAL;
> +
> +	if (msg->msg_flags & ~MSG_DONTWAIT) {
> +		pr_err("%s: unsupported flags=0x%x\n", __func__,
> +		       msg->msg_flags);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	sk = sock->sk;
> +	hvsk = sk_to_hvsock(sk);
> +	channel = hvsk->channel;
> +
> +	lock_sock(sk);
> +
> +	/* Callers should not provide a destination with stream sockets. */
> +	if (msg->msg_namelen) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	/* Send data only if both sides are not shutdown in the direction. */
> +	if (sk->sk_shutdown & SEND_SHUTDOWN ||
> +	    hvsk->peer_shutdown & RCV_SHUTDOWN) {
> +		ret = -EPIPE;
> +		goto out;
> +	}
> +
> +	if (sk->sk_state != SS_CONNECTED ||
> +	    !hvsock_addr_bound(&hvsk->local_addr)) {
> +		ret = -ENOTCONN;
> +		goto out;
> +	}
> +
> +	if (!hvsock_addr_bound(&hvsk->remote_addr)) {
> +		ret = -EDESTADDRREQ;
> +		goto out;
> +	}
> +
> +	timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +
> +	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> +
> +	while (total_to_write > 0) {
> +		size_t to_write;
> +
> +		while (1) {
> +			get_ringbuffer_rw_status(channel, NULL, &can_write);
> +
> +			if (can_write || sk->sk_err != 0 ||
> +			    (sk->sk_shutdown & SEND_SHUTDOWN) ||
> +			    (hvsk->peer_shutdown & RCV_SHUTDOWN))
> +				break;
> +
> +			/* Don't wait for non-blocking sockets. */
> +			if (timeout == 0) {
> +				ret = -EAGAIN;
> +				goto out_wait;
> +			}
> +
> +			release_sock(sk);
> +
> +			timeout = schedule_timeout(timeout);
> +
> +			lock_sock(sk);
> +			if (signal_pending(current)) {
> +				ret = sock_intr_errno(timeout);
> +				goto out_wait;
> +			} else if (timeout == 0) {
> +				ret = -EAGAIN;
> +				goto out_wait;
> +			}
> +
> +			prepare_to_wait(sk_sleep(sk), &wait,
> +					TASK_INTERRUPTIBLE);
> +		}
> +
> +		/* These checks occur both as part of and after the loop
> +		 * conditional since we need to check before and after
> +		 * sleeping.
> +		 */
> +		if (sk->sk_err) {
> +			ret = -sk->sk_err;
> +			goto out_wait;
> +		} else if ((sk->sk_shutdown & SEND_SHUTDOWN) ||
> +			   (hvsk->peer_shutdown & RCV_SHUTDOWN)) {
> +			ret = -EPIPE;
> +			goto out_wait;
> +		}
> +
> +		/* Note: that write will only write as many bytes as possible
> +		 * in the ringbuffer. It is the caller's responsibility to
> +		 * check how many bytes we actually wrote.
> +		 */
> +		do {
> +			to_write = min_t(size_t, HVSOCK_SND_BUF_SZ,
> +					 total_to_write);
> +			ret = memcpy_from_msg(hvsk->send.buf, msg, to_write);
> +			if (ret != 0)
> +				goto out_wait;
> +
> +			ret = hvsock_send_data(channel, hvsk, to_write);
> +			if (ret != 0)
> +				goto out_wait;
> +
> +			total_written += to_write;
> +			total_to_write -= to_write;
> +		} while (total_to_write > 0);
> +	}
> +out_wait:
> +	if (total_written > 0)
> +		ret = total_written;
> +
> +	finish_wait(sk_sleep(sk), &wait);
> +out:
> +	release_sock(sk);
> +
> +	/* ret is a bigger-than-0 total_written or a negative err code. */
> +	if (ret == 0) {
> +		WARN(1, "unexpected return value of 0\n");
> +		ret = -EIO;
> +	}
> +
> +	return ret;
> +}
> +
> +static int hvsock_recv_data(struct vmbus_channel *channel,
> +			    struct hvsock_sock *hvsk,
> +			    size_t *payload_len)
> +{
> +	u32 buffer_actual_len;
> +	u64 dummy_req_id;
> +	int ret;
> +
> +	ret = vmbus_recvpacket(channel, &hvsk->recv.hdr,
> +			       sizeof(hvsk->recv.hdr) + sizeof(hvsk->recv.buf),
> +			       &buffer_actual_len, &dummy_req_id);
> +	if (ret != 0 || buffer_actual_len <= sizeof(hvsk->recv.hdr))
> +		*payload_len = 0;
> +	else
> +		*payload_len = hvsk->recv.hdr.data_size;
> +
> +	return ret;
> +}
> +
> +static int hvsock_recvmsg(struct socket *sock, struct msghdr *msg,
> +			  size_t len, int flags)
> +{
> +	struct vmbus_channel *channel;
> +	struct hvsock_sock *hvsk;
> +	struct sock *sk;
> +
> +	size_t total_to_read = len;
> +	size_t copied;
> +
> +	bool can_read;
> +	long timeout;
> +
> +	int ret = 0;
> +
> +	DEFINE_WAIT(wait);
> +
> +	sk = sock->sk;
> +	hvsk = sk_to_hvsock(sk);
> +	channel = hvsk->channel;
> +
> +	lock_sock(sk);
> +
> +	if (sk->sk_state != SS_CONNECTED) {
> +		/* Recvmsg is supposed to return 0 if a peer performs an
> +		 * orderly shutdown. Differentiate between that case and when a
> +		 * peer has not connected or a local shutdown occurred with the
> +		 * SOCK_DONE flag.
> +		 */
> +		if (sock_flag(sk, SOCK_DONE))
> +			ret = 0;
> +		else
> +			ret = -ENOTCONN;
> +
> +		goto out;
> +	}
> +
> +	/* We ignore msg->addr_name/len. */
> +	if (flags & ~MSG_DONTWAIT) {
> +		pr_err("%s: unsupported flags=0x%x\n", __func__, flags);
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	/* We don't check peer_shutdown flag here since peer may actually shut
> +	 * down, but there can be data in the queue that a local socket can
> +	 * receive.
> +	 */
> +	if (sk->sk_shutdown & RCV_SHUTDOWN) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* It is valid on Linux to pass in a zero-length receive buffer.  This
> +	 * is not an error.  We may as well bail out now.
> +	 */
> +	if (!len) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
> +	copied = 0;
> +
> +	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
> +
> +	while (1) {
> +		bool need_refill = hvsk->recv.data_len == 0;
> +
> +		if (need_refill)
> +			get_ringbuffer_rw_status(channel, &can_read, NULL);
> +		else
> +			can_read = true;
> +
> +		if (can_read) {
> +			size_t payload_len;
> +
> +			if (need_refill) {
> +				ret = hvsock_recv_data(channel, hvsk,
> +						       &payload_len);
> +				if (ret != 0 || payload_len == 0 ||
> +				    payload_len > HVSOCK_RCV_BUF_SZ) {
> +					ret = -EIO;
> +					goto out_wait;
> +				}
> +
> +				hvsk->recv.data_len = payload_len;
> +				hvsk->recv.data_offset = 0;
> +			}
> +
> +			if (hvsk->recv.data_len <= total_to_read) {
> +				ret = memcpy_to_msg(msg, hvsk->recv.buf +
> +						    hvsk->recv.data_offset,
> +						    hvsk->recv.data_len);
> +				if (ret != 0)
> +					break;
> +
> +				copied += hvsk->recv.data_len;
> +				total_to_read -= hvsk->recv.data_len;
> +				hvsk->recv.data_len = 0;
> +				hvsk->recv.data_offset = 0;
> +
> +				if (total_to_read == 0)
> +					break;
> +			} else {
> +				ret = memcpy_to_msg(msg, hvsk->recv.buf +
> +						    hvsk->recv.data_offset,
> +						    total_to_read);
> +				if (ret != 0)
> +					break;
> +
> +				copied += total_to_read;
> +				hvsk->recv.data_len -= total_to_read;
> +				hvsk->recv.data_offset += total_to_read;
> +				total_to_read = 0;
> +				break;
> +			}
> +		} else {
> +			if (sk->sk_err || (sk->sk_shutdown & RCV_SHUTDOWN) ||
> +			    (hvsk->peer_shutdown & SEND_SHUTDOWN))
> +				break;
> +
> +			/* Don't wait for non-blocking sockets. */
> +			if (timeout == 0) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +
> +			if (copied > 0)
> +				break;
> +
> +			release_sock(sk);
> +			timeout = schedule_timeout(timeout);
> +			lock_sock(sk);
> +
> +			if (signal_pending(current)) {
> +				ret = sock_intr_errno(timeout);
> +				break;
> +			} else if (timeout == 0) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +
> +			prepare_to_wait(sk_sleep(sk), &wait,
> +					TASK_INTERRUPTIBLE);
> +		}
> +	}
> +
> +	if (sk->sk_err)
> +		ret = -sk->sk_err;
> +	else if (sk->sk_shutdown & RCV_SHUTDOWN)
> +		ret = 0;
> +
> +	if (copied > 0) {
> +		ret = copied;
> +
> +		/* If the other side has shutdown for sending and there
> +		 * is nothing more to read, then we modify the socket
> +		 * state.
> +		 */
> +		if ((hvsk->peer_shutdown & SEND_SHUTDOWN) &&
> +		    hvsk->recv.data_len == 0) {
> +			get_ringbuffer_rw_status(channel, &can_read, NULL);
> +			if (!can_read) {
> +				sk->sk_state = SS_UNCONNECTED;
> +				sock_set_flag(sk, SOCK_DONE);
> +				sk->sk_state_change(sk);
> +			}
> +		}
> +	}
> +out_wait:
> +	finish_wait(sk_sleep(sk), &wait);
> +out:
> +	release_sock(sk);
> +	return ret;
> +}
> +
> +static const struct proto_ops hvsock_ops = {
> +	.family = PF_HYPERV,
> +	.owner = THIS_MODULE,
> +	.release = hvsock_release,
> +	.bind = hvsock_bind,
> +	.connect = hvsock_connect,
> +	.socketpair = sock_no_socketpair,
> +	.accept = hvsock_accept,
> +	.getname = hvsock_getname,
> +	.poll = hvsock_poll,
> +	.ioctl = sock_no_ioctl,
> +	.listen = hvsock_listen,
> +	.shutdown = hvsock_shutdown,
> +	.setsockopt = hvsock_setsockopt,
> +	.getsockopt = hvsock_getsockopt,
> +	.sendmsg = hvsock_sendmsg,
> +	.recvmsg = hvsock_recvmsg,
> +	.mmap = sock_no_mmap,
> +	.sendpage = sock_no_sendpage,
> +};
> +
> +static int hvsock_create(struct net *net, struct socket *sock,
> +			 int protocol, int kern)
> +{
> +	if (!capable(CAP_SYS_ADMIN) && !capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	if (protocol != 0 && protocol != SHV_PROTO_RAW)
> +		return -EPROTONOSUPPORT;
> +
> +	switch (sock->type) {
> +	case SOCK_STREAM:
> +		sock->ops = &hvsock_ops;
> +		break;
> +	default:
> +		return -ESOCKTNOSUPPORT;
> +	}
> +
> +	sock->state = SS_UNCONNECTED;
> +
> +	return __hvsock_create(net, sock, GFP_KERNEL, 0) ? 0 : -ENOMEM;
> +}
> +
> +static const struct net_proto_family hvsock_family_ops = {
> +	.family = AF_HYPERV,
> +	.create = hvsock_create,
> +	.owner = THIS_MODULE,
> +};
> +
> +static int hvsock_probe(struct hv_device *hdev,
> +			const struct hv_vmbus_device_id *dev_id)
> +{
> +	struct vmbus_channel *channel = hdev->channel;
> +
> +	/* We ignore the error return code to suppress the unnecessary
> +	 * error message in vmbus_probe(): on error the host will rescind
> +	 * the offer in 30 seconds and we can do cleanup at that time.
> +	 */
> +	(void)hvsock_open_connection(channel);
> +
> +	return 0;
> +}
> +
> +static int hvsock_remove(struct hv_device *hdev)
> +{
> +	struct vmbus_channel *channel = hdev->channel;
> +
> +	vmbus_close(channel);
> +
> +	return 0;
> +}
> +
> +/* It's not really used. See vmbus_match() and vmbus_probe(). */
> +static const struct hv_vmbus_device_id id_table[] = {
> +	{},
> +};
> +
> +static struct hv_driver hvsock_drv = {
> +	.name		= "hv_sock",
> +	.hvsock		= true,
> +	.id_table	= id_table,
> +	.probe		= hvsock_probe,
> +	.remove		= hvsock_remove,
> +};
> +
> +static int __init hvsock_init(void)+{
> +	int ret;
> +
> +	/* Hyper-V Sockets requires at least VMBus 4.0 */
> +	if ((vmbus_proto_version >> 16) < 4) {
> +		pr_err("failed to load: VMBus 4 or later is required\n");
> +		return -ENODEV;
> +	}
> +
> +	ret = vmbus_driver_register(&hvsock_drv);
> +	if (ret) {
> +		pr_err("failed to register hv_sock driver\n");
> +		return ret;
> +	}
> +
> +	ret = proto_register(&hvsock_proto, 0);
> +	if (ret) {
> +		pr_err("failed to register protocol\n");
> +		goto unreg_hvsock_drv;
> +	}
> +
> +	ret = sock_register(&hvsock_family_ops);
> +	if (ret) {
> +		pr_err("failed to register address family\n");
> +		goto unreg_proto;
> +	}
> +
> +	return 0;
> +
> +unreg_proto:
> +	proto_unregister(&hvsock_proto);
> +unreg_hvsock_drv:
> +	vmbus_driver_unregister(&hvsock_drv);
> +	return ret;
> +}
> +
> +static void __exit hvsock_exit(void)+{
> +	sock_unregister(AF_HYPERV);
> +	proto_unregister(&hvsock_proto);
> +	vmbus_driver_unregister(&hvsock_drv);
> +}
> +
> +module_init(hvsock_init);
> +module_exit(hvsock_exit);
> +
> +MODULE_DESCRIPTION("Hyper-V Sockets");
> +MODULE_LICENSE("Dual BSD/GPL");

^ permalink raw reply

* Re: [PATCH net-next 3/8] fs/quota: use nla_put_u64_64bit()
From: David Miller @ 2016-04-26 16:24 UTC (permalink / raw)
  To: jack
  Cc: nicolas.dichtel, netdev, sd, johannes, kvalo, linux-wireless,
	jack, linux-kernel, pshelar, dev, jhs, philipp.reisner,
	lars.ellenberg, drbd-dev
In-Reply-To: <20160426110848.GD27612@quack2.suse.cz>

From: Jan Kara <jack@suse.cz>
Date: Tue, 26 Apr 2016 13:08:48 +0200

> On Tue 26-04-16 10:06:13, Nicolas Dichtel wrote:
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> 
> OK, so I somewhat miss a description of what will this do to the netlink
> message so that I can judge whether the change is fine for the userspace
> counterpart parsing these messages. AFAIU this changes the message format
> by adding a QUOTA_NL_A_PAD field before each 64-bit field which needs an
> alignment, am I guessing right? Thus when the userspace counterpart uses
> genlmsg_parse() it should just silently ignore these attributes if I read
> the documentation right. Did I understand this correctly?

All userspace components using netlink should always ignore attributes
they do not recognize in dumps.

This is one of the most basic principles of netlink.

^ permalink raw reply

* Re: [PATCH net-next 0/8] netlink: align attributes when needed (patchset #3)
From: David Miller @ 2016-04-26 16:25 UTC (permalink / raw)
  To: lars.ellenberg-63ez5xqkn6DQT0dZR+AlfA
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, sd-y1jBWg8GRStKuXlAQpz2QA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, jhs-jkUAjuhPggJWk0Htik3J/w,
	jack-IBi9RG/b67k, nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	philipp.reisner-63ez5xqkn6DQT0dZR+AlfA,
	johannes-cdvu00un1VgdHxzADdlk8Q, kvalo-sgV2jX0FEOL9JmXXK+q4OQ,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20160426115427.GB20950-w1SgEEioFePxa46PmUWvFg@public.gmane.org>

From: Lars Ellenberg <lars.ellenberg@linbit.com>
Date: Tue, 26 Apr 2016 13:54:27 +0200

> On Tue, Apr 26, 2016 at 10:06:10AM +0200, Nicolas Dichtel wrote:
>> 
>> This is the continuation (series #3) of the work done to align netlink
>> attributes when these attributes contain some 64-bit fields.
>> 
>> It's the last patchset from what I've seen.
>> 
>> The last user of nla_put_u64() is block/drbd. This module does not use
>> standard netlink API (see all the stuff in include/linux/genl_magic_struct.h
>> and include/linux/genl_magic_func.h). I didn't modify it because it's seems
>> hard to do it whithout testing and fully understanding the context
> 
> Something like this should just work.

Unfortunately we had problems using unspec, that's why an explicit new
padding attribute is added for each netlink attribute set.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply

* Re: [PATCH 1/6] bus: Add shared MDIO bus framework
From: David Miller @ 2016-04-26 16:26 UTC (permalink / raw)
  To: andrew-g2DYL2Zd6BY
  Cc: pramod.kumar-dY08KVG/lbpWk0Htik3J/w,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, catalin.marinas-5wv7dgnIgG8,
	will.deacon-5wv7dgnIgG8, yamada.masahiro-uWyLwvC0a2jby3iVrkZq2A,
	wens-jdAy2FN1RRM, mark.rutland-5wv7dgnIgG8,
	devicetree-u79uwXL29TY76Z2rM5mHXA, pawel.moll-5wv7dgnIgG8,
	arnd-r2nGTMty4D4, suzuki.poulose-5wv7dgnIgG8,
	netdev-u79uwXL29TY76Z2rM5mHXA, punit.agrawal-5wv7dgnIgG8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	anup.patel-dY08KVG/lbpWk0Htik3J/w
In-Reply-To: <20160426121335.GC11668-g2DYL2Zd6BY@public.gmane.org>

From: Andrew Lunn <andrew-g2DYL2Zd6BY@public.gmane.org>
Date: Tue, 26 Apr 2016 14:13:35 +0200

> On Tue, Apr 26, 2016 at 02:03:27PM +0530, Pramod Kumar wrote:
>> As you can see from above points, trying to re-use Linux Ethernet MDIO mux
>> framework for non-Ethernet PHYs is not the right way.
> 
> And as i pointed out, all your arguments are wrong, bar one. And i
> doubt that one argument is sufficient to duplicate a lot of code which
> already exists and does 95% of what you need.

+1
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [net-next PATCH V2 0/5] samples/bpf: Improve user experience
From: Jesper Dangaard Brouer @ 2016-04-26 16:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kbuild, bblanco, Jesper Dangaard Brouer, naveen.n.rao,
	borkmann, alexei.starovoitov

It is a steep learning curve getting started with using the eBPF
examples in samples/bpf/.  There are several dependencies, and
specific versions of these dependencies.  Invoking make in the correct
manor is also slightly obscure.

This patchset cleanup, document and hopefully improves the first time
user experience with the eBPF samples directory by auto-detecting
certain scenarios.

V2:
 - Adjusted recommend minimum versions to 3.7.1
 - Included clang build instructions
 - New patch adding CLANG variable and validation of command

---

Jesper Dangaard Brouer (5):
      samples/bpf: add back functionality to redefine LLC command
      samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported
      samples/bpf: add a README file to get users started
      samples/bpf: allow make to be run from samples/bpf/ directory
      samples/bpf: like LLC also verify and allow redefining CLANG command


 samples/bpf/Makefile   |   37 +++++++++++++++++++++-
 samples/bpf/README.rst |   80 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 115 insertions(+), 2 deletions(-)
 create mode 100644 samples/bpf/README.rst

^ permalink raw reply

* [net-next PATCH V2 1/5] samples/bpf: add back functionality to redefine LLC command
From: Jesper Dangaard Brouer @ 2016-04-26 16:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kbuild, bblanco, Jesper Dangaard Brouer, naveen.n.rao,
	borkmann, alexei.starovoitov
In-Reply-To: <20160426162650.22962.20516.stgit@firesoul>

It is practical to be-able-to redefine the location of the LLVM
command 'llc', because not all distros have a LLVM version with bpf
target support.  Thus, it is sometimes required to compile LLVM from
source, and sometimes it is not desired to overwrite the distros
default LLVM version.

This feature was removed with 128d1514be35 ("samples/bpf: Use llc in
PATH, rather than a hardcoded value").

Add this features back. Note that it is possible to redefine the LLC
on the make command like:

 make samples/bpf/ LLC=~/git/llvm/build/bin/llc

Fixes: 128d1514be35 ("samples/bpf: Use llc in PATH, rather than a hardcoded value")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 744dd7a16144..5bae9536f100 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -81,10 +81,14 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 
+# Allows pointing LLC to a LLVM backend with bpf support, redefine on cmdline:
+#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
+LLC ?= llc
+
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
 # useless for BPF samples.
 $(obj)/%.o: $(src)/%.c
 	clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
 		-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
-		-O2 -emit-llvm -c $< -o -| llc -march=bpf -filetype=obj -o $@
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@

^ permalink raw reply related

* [net-next PATCH V2 2/5] samples/bpf: Makefile verify LLVM compiler avail and bpf target is supported
From: Jesper Dangaard Brouer @ 2016-04-26 16:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kbuild, bblanco, Jesper Dangaard Brouer, naveen.n.rao,
	borkmann, alexei.starovoitov
In-Reply-To: <20160426162650.22962.20516.stgit@firesoul>

Make compiling samples/bpf more user friendly, by detecting if LLVM
compiler tool 'llc' is available, and also detect if the 'bpf' target
is available in this version of LLVM.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 5bae9536f100..45859c99f573 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -85,6 +85,24 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
 LLC ?= llc
 
+# Verify LLVM compiler is available and bpf target is supported
+.PHONY: verify_cmd_llc verify_target_bpf
+
+verify_cmd_llc:
+	@if ! (which "${LLC}" > /dev/null 2>&1); then \
+		echo "*** ERROR: Cannot find LLVM tool 'llc' (${LLC})" ;\
+		exit 1; \
+	else true; fi
+
+verify_target_bpf: verify_cmd_llc
+	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+		exit 2; \
+	else true; fi
+
+$(src)/*.c: verify_target_bpf
+
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
 # useless for BPF samples.

^ permalink raw reply related

* [net-next PATCH V2 3/5] samples/bpf: add a README file to get users started
From: Jesper Dangaard Brouer @ 2016-04-26 16:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kbuild, bblanco, Jesper Dangaard Brouer, naveen.n.rao,
	borkmann, alexei.starovoitov
In-Reply-To: <20160426162650.22962.20516.stgit@firesoul>

Getting started with using examples in samples/bpf/ is not
straightforward.  There are several dependencies, and specific
versions of these dependencies.

Just compiling the example tool is also slightly obscure, e.g. one
need to call make like:

 make samples/bpf/

Do notice the "/" slash after the directory name.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/README.rst |   77 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 samples/bpf/README.rst

diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst
new file mode 100644
index 000000000000..c7ccf553af0d
--- /dev/null
+++ b/samples/bpf/README.rst
@@ -0,0 +1,77 @@
+eBPF sample programs
+====================
+
+This kernel samples/bpf directory contains a mini eBPF library, test
+stubs, verifier test-suite and examples for using eBPF.
+
+Build dependencies
+==================
+
+Compiling requires having installed:
+ * clang
+ * llvm >= version 3.7.1
+
+Note that LLVM's tool 'llc' must support target 'bpf', list with command::
+
+ $ llc --version
+ LLVM (http://llvm.org/):
+  LLVM version 3.x.y
+  [...]
+  Host CPU: xxx
+
+  Registered Targets:
+    [...]
+    bpf        - BPF (host endian)
+    bpfeb      - BPF (big endian)
+    bpfel      - BPF (little endian)
+    [...]
+
+Kernel headers
+--------------
+
+There are usually dependencies to header files of the current kernel.
+To avoid installing devel kernel headers system wide, as a normal
+user, simply call::
+
+ make headers_install
+
+This will creates a local "usr/include" directory in the git/build top
+level directory, that the make system automatically pickup first.
+
+Compiling
+=========
+
+For compiling goto kernel top level build directory and run make like::
+
+ make samples/bpf/
+
+Do notice the "/" slash after the directory name.
+
+Manually compiling LLVM with 'bpf' support
+------------------------------------------
+
+In some LLVM versions the BPF target were marked experimental. They
+needed the 'cmake .. -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=BPF'.  Since
+version 3.7.1, LLVM adds a proper LLVM backend target for the BPF
+bytecode architecture.
+
+By default llvm will build all non-experimental backends including bpf.
+To generate a smaller llc binary one can use::
+
+ -DLLVM_TARGETS_TO_BUILD="BPF;X86"
+
+Quick sniplet for manually compiling LLVM and clang
+(build dependencies are cmake and gcc-c++)::
+
+ $ git clone http://llvm.org/git/llvm.git
+ $ cd llvm/tools
+ $ git clone --depth 1 http://llvm.org/git/clang.git
+ $ cd ..; mkdir build; cd build
+ $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86"
+ $ make -j $(getconf _NPROCESSORS_ONLN)
+
+It is also possible to point make to the newly compile 'llc' command
+via redefining LLC on the make command line::
+
+ make samples/bpf/ LLC=~/git/llvm/build/bin/llc
+


^ permalink raw reply related

* [net-next PATCH V2 4/5] samples/bpf: allow make to be run from samples/bpf/ directory
From: Jesper Dangaard Brouer @ 2016-04-26 16:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kbuild, bblanco, Jesper Dangaard Brouer, naveen.n.rao,
	borkmann, alexei.starovoitov
In-Reply-To: <20160426162650.22962.20516.stgit@firesoul>

It is not intuitive that 'make' must be run from the top level
directory with argument "samples/bpf/" to compile these eBPF samples.

Introduce a kbuild make file trick that allow make to be run from the
"samples/bpf/" directory itself.  It basically change to the top level
directory and call "make samples/bpf/" with the "/" slash after the
directory name.

Also add a clean target that only cleans this directory, by taking
advantage of the kbuild external module setting M=$PWD.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile   |    8 ++++++++
 samples/bpf/README.rst |    3 +++
 2 files changed, 11 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 45859c99f573..dd63521832d8 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -85,6 +85,14 @@ HOSTLOADLIBES_test_overhead += -lelf -lrt
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
 LLC ?= llc
 
+# Trick to allow make to be run from this directory
+all:
+	$(MAKE) -C ../../ $$PWD/
+
+clean:
+	$(MAKE) -C ../../ M=$$PWD clean
+	@rm -f *~
+
 # Verify LLVM compiler is available and bpf target is supported
 .PHONY: verify_cmd_llc verify_target_bpf
 
diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst
index c7ccf553af0d..1ec4b08a7b40 100644
--- a/samples/bpf/README.rst
+++ b/samples/bpf/README.rst
@@ -47,6 +47,9 @@ For compiling goto kernel top level build directory and run make like::
 
 Do notice the "/" slash after the directory name.
 
+It is also possible to call make from this directory.  This will just
+hide the the invocation of make as above with the appended "/".
+
 Manually compiling LLVM with 'bpf' support
 ------------------------------------------
 


^ permalink raw reply related

* [net-next PATCH V2 5/5] samples/bpf: like LLC also verify and allow redefining CLANG command
From: Jesper Dangaard Brouer @ 2016-04-26 16:27 UTC (permalink / raw)
  To: netdev
  Cc: linux-kbuild, bblanco, Jesper Dangaard Brouer, naveen.n.rao,
	borkmann, alexei.starovoitov
In-Reply-To: <20160426162650.22962.20516.stgit@firesoul>

Users are likely to manually compile both LLVM 'llc' and 'clang'
tools.  Thus, also allow redefining CLANG and verify command exist.

Makefile implementation wise, the target that verify the command have
been generalized.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/Makefile   |   23 +++++++++++++----------
 samples/bpf/README.rst |    6 +++---
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index dd63521832d8..c02ea9d2a248 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -81,9 +81,10 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 
-# Allows pointing LLC to a LLVM backend with bpf support, redefine on cmdline:
-#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc
+# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
+#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
 LLC ?= llc
+CLANG ?= clang
 
 # Trick to allow make to be run from this directory
 all:
@@ -94,15 +95,17 @@ clean:
 	@rm -f *~
 
 # Verify LLVM compiler is available and bpf target is supported
-.PHONY: verify_cmd_llc verify_target_bpf
+.PHONY: verify_cmd_llc verify_target_bpf $(CLANG) $(LLC)
 
-verify_cmd_llc:
-	@if ! (which "${LLC}" > /dev/null 2>&1); then \
-		echo "*** ERROR: Cannot find LLVM tool 'llc' (${LLC})" ;\
-		exit 1; \
-	else true; fi
+verify_cmds: $(CLANG) $(LLC)
+	@for TOOL in $^ ; do \
+		if ! (which "$${TOOL}" > /dev/null 2>&1); then \
+			echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+			exit 1; \
+		else true; fi; \
+	done
 
-verify_target_bpf: verify_cmd_llc
+verify_target_bpf: verify_cmds
 	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
 		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
 		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
@@ -115,6 +118,6 @@ $(src)/*.c: verify_target_bpf
 # But, there is no easy way to fix it, so just exclude it since it is
 # useless for BPF samples.
 $(obj)/%.o: $(src)/%.c
-	clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+	$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
 		-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
 		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst
index 1ec4b08a7b40..74897dbe6458 100644
--- a/samples/bpf/README.rst
+++ b/samples/bpf/README.rst
@@ -73,8 +73,8 @@ Quick sniplet for manually compiling LLVM and clang
  $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86"
  $ make -j $(getconf _NPROCESSORS_ONLN)
 
-It is also possible to point make to the newly compile 'llc' command
-via redefining LLC on the make command line::
+It is also possible to point make to the newly compile 'llc' or
+'clang' command via redefining LLC or CLANG on the make command line::
 
- make samples/bpf/ LLC=~/git/llvm/build/bin/llc
+ make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
 

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox