Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH V2 3/3] net-next: dsa: add new driver for qca8xxx family
From: John Crispin @ 2016-09-14 15:47 UTC (permalink / raw)
  To: David S. Miller, Andrew Lunn, Florian Fainelli
  Cc: netdev, linux-kernel, qsdk-review
In-Reply-To: <1473849542-3298-4-git-send-email-john@phrozen.org>



On 14/09/2016 12:39, John Crispin wrote:
> +static void
> +qca8k_fdb_add(struct dsa_switch *ds, int port,
> +	      const struct switchdev_obj_port_fdb *fdb,
> +	      struct switchdev_trans *trans)
> +{
> +	struct qca8k_priv *priv = qca8k_to_priv(ds);
> +	u16 port_mask = BIT(port);
> +	u16 vid = fdb->vid;
> +
> +	if (!vid)
> +		vid = 1;
> +
> +	qca8k_fdb_write(priv, vid, port_mask, fdb->addr,
> +			QCA8K_ATU_STATUS_STATIC);
> +
> +	qca8k_fdb_access(priv, QCA8K_FDB_LOAD, -1);
> +}
> +
> +static int
> +qca8k_fdb_del(struct dsa_switch *ds, int port,
> +	      const struct switchdev_obj_port_fdb *fdb)
> +{
> +	struct qca8k_priv *priv = qca8k_to_priv(ds);
> +	u16 port_mask = BIT(port);
> +
> +	qca8k_fdb_write(priv, fdb->vid, port_mask, fdb->addr, 0);
> +
> +	return qca8k_fdb_access(priv, QCA8K_FDB_PURGE, -1);
> +}
> +

while adding MDB support i noticed that

1) the code above is still not properly locked
2) i can unify the FDB and MDB code

i will send a V3 with those 2 changes tomorrow.

	John

^ permalink raw reply

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
From: Alexei Starovoitov @ 2016-09-14 15:55 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Daniel Mack, Pablo Neira Ayuso, htejun-b10kYP2dOMg,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <57D937B9.2090100-FeC+5ew28dpmcu3hnIyYJQ@public.gmane.org>

On Wed, Sep 14, 2016 at 01:42:49PM +0200, Daniel Borkmann wrote:
> >As I said, I'm open to discussing that. In order to make it work for L3,
> >the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> >Alexei, any idea how much work that would be?
> 
> Not much. You simply need to declare your own struct bpf_verifier_ops
> with a get_func_proto() handler that handles BPF_FUNC_skb_load_bytes,
> and verifier in do_check() loop would need to handle that these ld_abs/
> ld_ind are rejected for BPF_PROG_TYPE_CGROUP_SOCKET.

yep. that part is solvable.
I'm still torn between l2 and l3.
On one side it sux to lose l2 information. yet we don't have a use case
to look into l2 for our container monitoring, so the only thing
lack of l2 will do is confuse byte accounting, since instead of
skb->len, we'd need to do skb->len + ETH_HLEN...
but I guess vlan handling messes it up as well.
On the other side doing it at socket level we can drop these checks:
+       if (!sk || !sk_fullsock(sk))
+               return 0;
+
+       if (sk->sk_family != AF_INET &&
+           sk->sk_family != AF_INET6)
+               return 0;
which will make it even faster when it's on.
So I don't mind either l2 and l3. I guess if l3 approach will prove
to be limiting, we can add l2 later?

^ permalink raw reply

* RE: [PATCH v4 01/16] vmxnet3: Move PCI Id to pci_ids.h
From: Adit Ranadive @ 2016-09-14 16:00 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, pv-drivers,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <20160914110854.GA15800-Hxa29pjIrETlQW142y8m19+IiqhCXseY@public.gmane.org>

On Wed, Sep 14, 2016 at 04:09:12 -0700, Yuval Shaia wrote:
> Please update vmxnet3_drv.c accordingly.

Any reason why? I don't think we need to. Vmxnet3 should just pick up the moved
PCI device id from pci_ids.h file.

- Adit
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net v1] fib_rules: interface group matching
From: Vincent Bernat @ 2016-09-14 16:01 UTC (permalink / raw)
  To: David Ahern; +Cc: David S. Miller, Nicolas Dichtel, Wilson Kok, netdev
In-Reply-To: <c4f00f4e-aad3-f3d4-3c7c-8fe39c8f1720@cumulusnetworks.com>

 ❦ 14 septembre 2016 17:25 CEST, David Ahern <dsa@cumulusnetworks.com> :

>> I could just give more time to VRF. I also had some concerns over
>> performance with the way Netfilter integration is done, but I understand
>> that I could just stay away from POSTROUTING rules which is the only
>> hook executed twice?

> With the changes that were committed this past weekend, the VRF code
> is now setup where I can set a flag on a per VRF basis to disable the
> extra rx and tx processing - ie., no network taps, no netfilter, no
> qdisc, etc. Drops the overhead of VRF to ~3% maybe a bit less. I need
> to think about the user api a bit more and formalize the patch. Given
> my other commitments that probably won't happen until mid-October. But
> in terms of a building block, the overhead of VRF is continuing to
> drop.

Fine by me. We can drop my patch.

Thanks!
-- 
Program defensively.
            - The Elements of Programming Style (Kernighan & Plauger)

^ permalink raw reply

* [PATCH 1/2] net/ibm/emac: add set mac addr callback
From: Ivan Mikhaylov @ 2016-09-14 16:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev

add realization for mac address set and remove dummy callback.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
---
 drivers/net/ethernet/ibm/emac/core.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index 4c9771d..2dfc603 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -980,6 +980,33 @@ static void emac_set_multicast_list(struct net_device *ndev)
 	__emac_set_multicast_list(dev);
 }
 
+static int emac_set_mac_address(struct net_device *ndev, void *sa)
+{
+	struct emac_instance *dev = netdev_priv(ndev);
+	struct sockaddr *addr = sa;
+	struct emac_regs __iomem *p = dev->emacp;
+
+	if (!is_valid_ether_addr(addr->sa_data))
+	       return -EADDRNOTAVAIL;
+
+	mutex_lock(&dev->link_lock);
+
+	memcpy(ndev->dev_addr, addr->sa_data, ndev->addr_len);
+
+	emac_rx_disable(dev);
+	emac_tx_disable(dev);
+	out_be32(&p->iahr, (ndev->dev_addr[0] << 8) | ndev->dev_addr[1]);
+	out_be32(&p->ialr, (ndev->dev_addr[2] << 24) |
+		(ndev->dev_addr[3] << 16) | (ndev->dev_addr[4] << 8) |
+		ndev->dev_addr[5]);
+	emac_tx_enable(dev);
+	emac_rx_enable(dev);
+
+	mutex_unlock(&dev->link_lock);
+
+	return 0;
+}
+
 static int emac_resize_rx_ring(struct emac_instance *dev, int new_mtu)
 {
 	int rx_sync_size = emac_rx_sync_size(new_mtu);
@@ -2686,7 +2713,7 @@ static const struct net_device_ops emac_netdev_ops = {
 	.ndo_do_ioctl		= emac_ioctl,
 	.ndo_tx_timeout		= emac_tx_timeout,
 	.ndo_validate_addr	= eth_validate_addr,
-	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_set_mac_address	= emac_set_mac_address,
 	.ndo_start_xmit		= emac_start_xmit,
 	.ndo_change_mtu		= eth_change_mtu,
 };
@@ -2699,7 +2726,7 @@ static const struct net_device_ops emac_gige_netdev_ops = {
 	.ndo_do_ioctl		= emac_ioctl,
 	.ndo_tx_timeout		= emac_tx_timeout,
 	.ndo_validate_addr	= eth_validate_addr,
-	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_set_mac_address	= emac_set_mac_address,
 	.ndo_start_xmit		= emac_start_xmit_sg,
 	.ndo_change_mtu		= emac_change_mtu,
 };
-- 
2.7.4 (Apple Git-66)

^ permalink raw reply related

* [PATCH 2/2] net/ibm/emac: add mutex to 'set multicast list'
From: Ivan Mikhaylov @ 2016-09-14 16:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev

for preventing race conditions within ioctl calls.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
---
 drivers/net/ethernet/ibm/emac/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index 2dfc603..7af09cb 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -977,7 +977,10 @@ static void emac_set_multicast_list(struct net_device *ndev)
 		dev->mcast_pending = 1;
 		return;
 	}
+
+	mutex_lock(&dev->link_lock);
 	__emac_set_multicast_list(dev);
+	mutex_unlock(&dev->link_lock);
 }
 
 static int emac_set_mac_address(struct net_device *ndev, void *sa)
-- 
2.7.4 (Apple Git-66)

^ permalink raw reply related

* Re: [PATCH] net/mlx4_en: fix off by one in error handling
From: Sebastian Ott @ 2016-09-14 16:08 UTC (permalink / raw)
  To: Tariq Toukan; +Cc: Yishai Hadas, Tariq Toukan, netdev, linux-rdma, linux-kernel
In-Reply-To: <b2197c51-57a9-ab01-e940-9850a39484d6@gmail.com>

On Wed, 14 Sep 2016, Tariq Toukan wrote:
> On 14/09/2016 4:53 PM, Sebastian Ott wrote:
> > On Wed, 14 Sep 2016, Tariq Toukan wrote:
> > > On 14/09/2016 2:09 PM, Sebastian Ott wrote:
> > > > If an error occurs in mlx4_init_eq_table the index used in the
> > > > err_out_unmap label is one too big which results in a panic in
> > > > mlx4_free_eq. This patch fixes the index in the error path.
> > > You are right, but your change below does not cover all cases.
> > > The full solution looks like this:
> > >
> > > @@ -1260,7 +1260,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
> > >                                               eq);
> > >                  }
> > >                  if (err)
> > > -                       goto err_out_unmap;
> > > +                       goto err_out_unmap_excluded;
> > In this case a call to mlx4_create_eq failed. Do you really have to call
> > mlx4_free_eq for this index again?
>
> We agree on this part, that's why here we should goto the _excluded_ label.
> For all other parts, we should not exclude the eq in the highest index, and
> thus we goto the _non_excluded_ label.

But that's exactly what the original patch does. If the failure is within
the for loop at index i, we do the cleanup starting at index i-1. If the
failure is after the for loop then i == dev->caps.num_comp_vectors + 1
and we do the cleanup starting at index i == dev->caps.num_comp_vectors.

In the latter case your patch would have an out of bounds array access.

Regards,
Sebastian

^ permalink raw reply

* Re: [PATCH v4 01/16] vmxnet3: Move PCI Id to pci_ids.h
From: Yuval Shaia @ 2016-09-14 16:24 UTC (permalink / raw)
  To: Adit Ranadive
  Cc: dledford@redhat.com, linux-rdma@vger.kernel.org, pv-drivers,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <DM2PR0501MB844852987661859C6654E40C5F10@DM2PR0501MB844.namprd05.prod.outlook.com>

On Wed, Sep 14, 2016 at 04:00:25PM +0000, Adit Ranadive wrote:
> On Wed, Sep 14, 2016 at 04:09:12 -0700, Yuval Shaia wrote:
> > Please update vmxnet3_drv.c accordingly.
> 
> Any reason why? I don't think we need to. Vmxnet3 should just pick up the moved
> PCI device id from pci_ids.h file.

So now you need to include it from vmxnet3_drv.c.
Same with pvrdma_main.c

> 
> - Adit
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-09-14 16:30 UTC (permalink / raw)
  To: David Ahern; +Cc: Mahesh Bandewar, netdev, Eric Dumazet, David Miller
In-Reply-To: <1177f2c5-6c2d-32e1-1628-f272d6b0b5c6@cumulusnetworks.com>

Hi David, thanks for the comments.

On Tue, Sep 13, 2016 at 8:24 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 9/12/16 12:01 PM, Mahesh Bandewar wrote:
>
>> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
>> +                           u16 proto)
>> +{
>> +     struct ipvl_addr *addr;
>> +     struct net_device *sdev;
>> +
>> +     addr = ipvlan_skb_to_addr(skb, dev);
>> +     if (!addr)
>> +             goto out;
>> +
>> +     sdev = addr->master->dev;
>> +     switch (proto) {
>> +     case AF_INET:
>> +     {
>> +             int err;
>> +             struct iphdr *ip4h = ip_hdr(skb);
>> +
>> +             err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
>> +                                        ip4h->tos, sdev);
>> +             if (unlikely(err))
>> +                     goto out;
>> +             break;
>> +     }
>> +     case AF_INET6:
>> +     {
>> +             struct dst_entry *dst;
>> +             struct ipv6hdr *ip6h = ipv6_hdr(skb);
>> +             int flags = RT6_LOOKUP_F_HAS_SADDR;
>> +             struct flowi6 fl6 = {
>> +                     .flowi6_iif   = sdev->ifindex,
>> +                     .daddr        = ip6h->daddr,
>> +                     .saddr        = ip6h->saddr,
>> +                     .flowlabel    = ip6_flowinfo(ip6h),
>> +                     .flowi6_mark  = skb->mark,
>> +                     .flowi6_proto = ip6h->nexthdr,
>> +             };
>> +
>> +             skb_dst_drop(skb);
>> +             dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
>> +             skb_dst_set(skb, dst);
>> +             break;
>> +     }
>> +     default:
>> +             break;
>> +     }
>
> Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?
>
I can but it's small enough to have it together. Also 'proto' is a
parameter to the handler and makes it easier However do you see any
issue having just one function?

>
>> +
>> +out:
>> +     return skb;
>> +}
>> +
>> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
>> +                          const struct nf_hook_state *state)
>> +{
>> +     struct ipvl_addr *addr;
>> +     unsigned int len;
>> +
>> +     addr = ipvlan_skb_to_addr(skb, skb->dev);
>> +     if (!addr)
>> +             goto out;
>> +
>> +     skb->dev = addr->master->dev;
>> +     len = skb->len + ETH_HLEN;
>> +     ipvlan_count_rx(addr->master, len, true, false);
>> +out:
>> +     return NF_ACCEPT;
>> +}
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
>> index 18b4e8c7f68a..d02be277e1db 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -9,24 +9,65 @@
>>
>>  #include "ipvlan.h"
>>
>> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
>> +     {
>> +             .hook     = ipvlan_nf_input,
>> +             .pf       = NFPROTO_IPV4,
>> +             .hooknum  = NF_INET_LOCAL_IN,
>> +             .priority = INT_MAX,
>> +     },
>> +     {
>> +             .hook     = ipvlan_nf_input,
>> +             .pf       = NFPROTO_IPV6,
>> +             .hooknum  = NF_INET_LOCAL_IN,
>> +             .priority = INT_MAX,
>> +     },
>> +};
>> +
>> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
>> +     .l3mdev_l3_rcv = ipvlan_l3_rcv,
>> +};
>> +
>>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
>>  {
>>       ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>>  }
>>
>> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>>  {
>>       struct ipvl_dev *ipvlan;
>> +     int err = 0;
>>
>> +     ASSERT_RTNL();
>>       if (port->mode != nval) {
>> +             if (nval == IPVLAN_MODE_L3S) {
>> +                     port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
>> +                     port->dev->priv_flags |= IFF_L3MDEV_MASTER;
>> +                     if (!port->ipt_hook_added) {
>> +                             err = _nf_register_hooks(ipvl_nfops,
>> +                                                     ARRAY_SIZE(ipvl_nfops));
>
> That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.
>
Do you mean per slave device?  It's registering it for a port (so only
once) depending on the mode. If the mode != l3s it wont bother
registering to keep current modes as they are.
I'm sure dst idea could work as well (as you have suggested) but
l3mdev + ipt-hook is simpler and probably far less intrusive.

>
>> +                             if (!err)
>> +                                     port->ipt_hook_added = true;
>> +                             else
>> +                                     return err;
>> +                     }
>> +             } else {
>> +                     port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
>> +                     port->dev->l3mdev_ops = NULL;
>> +                     if (port->ipt_hook_added)
>> +                             _nf_unregister_hooks(ipvl_nfops,
>> +                                                  ARRAY_SIZE(ipvl_nfops));
>> +                     port->ipt_hook_added = false;
>> +             }
>
>

^ permalink raw reply

* Re: [PATCH v3 net 1/1] net sched actions: fix GETing actions
From: Cong Wang @ 2016-09-14 16:30 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: David Miller, Linux Kernel Network Developers
In-Reply-To: <61972f5c-b2b7-efca-4068-1b3af9aabbf4@mojatatu.com>

On Wed, Sep 14, 2016 at 4:33 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 16-09-13 03:47 PM, Jamal Hadi Salim wrote:
>>
>> On 16-09-13 12:20 PM, Cong Wang wrote:
>>>
>>> On Mon, Sep 12, 2016 at 4:07 PM, Jamal Hadi Salim <jhs@mojatatu.com>
>>> wrote:
>
>
> [..]
>>>
>>> I am still trying to understand this piece, so here you hold the refcnt
>>> for the same action used by the later iteration? Otherwise there is
>>> almost none user inbetween hold and release...
>>>
>>> The comment you add is not clear to me, we use RTNL/RCU to
>>> sync destroy and replace, so how could that happen?
>>>
>>
>> I was worried about the destroy() hitting an error in that function.
>> If an action already existed and all we asked for was to
>> replace some attribute it would be deleted. It was the way the code was
>> before your changes so i just restored it to its original form.
>>
>
> And I have verified this is needed. I went and made gact return
> a failure if you replace something. I added a gact action; then
> when i replaced it failed. And when it failed it replace the existing
> action.

Oh, I see, the comments actually mislead me. ;) You are trying to
fix the rollback for failure path here... Makes sense to me now.

Thanks for verifying it.

^ permalink raw reply

* Re: [PATCH v2 net 1/1] net sched actions: fix GETing actions
From: Cong Wang @ 2016-09-14 16:32 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: David Miller, Linux Kernel Network Developers
In-Reply-To: <1473675692-8490-1-git-send-email-jhs@emojatatu.com>

On Mon, Sep 12, 2016 at 3:21 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> From: Jamal Hadi Salim <jhs@mojatatu.com>
>
> With the batch changes that translated transient actions into
> a temporary list we lost in the translation the fact that
> tcf_action_destroy() will eventually delete the action from
> the permanent location if the refcount is zero.
>
> Example of what broke:
> ...add a gact action to drop
> sudo $TC actions add action drop index 10
> ...now retrieve it, looks good
> sudo $TC actions get action gact index 10
> ...retrieve it again and find it is gone!
> sudo $TC actions get action gact index 10
>
> Fixes:
> 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
> 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
> f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")
>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

Maybe the changelog/comment needs slightly improved,
but that is a very minor issue. So,

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* Re: [PATCH v4 12/16] IB/pvrdma: Add Queue Pair support
From: Yuval Shaia @ 2016-09-14 16:32 UTC (permalink / raw)
  To: Adit Ranadive
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	pv-drivers-pghWNbHTmq7QT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, jhansen-pghWNbHTmq7QT0dZR+AlfA,
	asarwade-pghWNbHTmq7QT0dZR+AlfA,
	georgezhang-pghWNbHTmq7QT0dZR+AlfA,
	bryantan-pghWNbHTmq7QT0dZR+AlfA
In-Reply-To: <1473655766-31628-13-git-send-email-aditr-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>

No more comments.
Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On Sun, Sep 11, 2016 at 09:49:22PM -0700, Adit Ranadive wrote:
> This patch adds the ability to create, modify, query and destroy QPs. The
> PVRDMA device supports RC, UD and GSI QPs.
> 
> Reviewed-by: Jorgen Hansen <jhansen-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
> Reviewed-by: George Zhang <georgezhang-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Aditya Sarwade <asarwade-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
> Reviewed-by: Bryan Tan <bryantan-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Adit Ranadive <aditr-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>
> ---
> Changes v3->v4:
>  - Removed an unnecessary switch case.
>  - Unified the returns in pvrdma_create_qp to use one exit point.
>  - Renamed pvrdma_flush_cqe to _pvrdma_flush_cqe since we need a lock to
>  be held when calling this.
>  - Updated to use wrapper for UAR write for QP.
>  - Updated conversion function to func_name(dst, src) format.
>  - Renamed max_gs to max_sg.
>  - Renamed cap variable to req_cap in pvrdma_set_sq/rq_size.
>  - Changed dev_warn to dev_warn_ratelimited in pvrdma_post_send/recv.
>  - Added nesting locking for flushing CQs when destroying/resetting a QP.
>  - Added missing ret value.
> 
> Changes v2->v3:
>  - Removed boolean in pvrdma_cmd_post.
> ---
>  drivers/infiniband/hw/pvrdma/pvrdma_qp.c | 980 +++++++++++++++++++++++++++++++
>  1 file changed, 980 insertions(+)
>  create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_qp.c
> 
> diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_qp.c b/drivers/infiniband/hw/pvrdma/pvrdma_qp.c
> new file mode 100644
> index 0000000..4163186
> --- /dev/null
> +++ b/drivers/infiniband/hw/pvrdma/pvrdma_qp.c
> @@ -0,0 +1,980 @@
> +/*
> + * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of EITHER the GNU General Public License
> + * version 2 as published by the Free Software Foundation or the BSD
> + * 2-Clause License. This program is distributed in the hope that it
> + * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
> + * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
> + * See the GNU General Public License version 2 for more details at
> + * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program available in the file COPYING in the main
> + * directory of this source tree.
> + *
> + * The BSD 2-Clause License
> + *
> + *     Redistribution and use in source and binary forms, with or
> + *     without modification, are permitted provided that the following
> + *     conditions are met:
> + *
> + *      - Redistributions of source code must retain the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer.
> + *
> + *      - Redistributions in binary form must reproduce the above
> + *        copyright notice, this list of conditions and the following
> + *        disclaimer in the documentation and/or other materials
> + *        provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> + * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
> + * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
> + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
> + * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> + * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
> + * OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <asm/page.h>
> +#include <linux/io.h>
> +#include <linux/wait.h>
> +#include <rdma/ib_addr.h>
> +#include <rdma/ib_smi.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "pvrdma.h"
> +#include "pvrdma_user.h"
> +
> +static inline void get_cqs(struct pvrdma_qp *qp, struct pvrdma_cq **send_cq,
> +			   struct pvrdma_cq **recv_cq)
> +{
> +	*send_cq = to_vcq(qp->ibqp.send_cq);
> +	*recv_cq = to_vcq(qp->ibqp.recv_cq);
> +}
> +
> +static void pvrdma_lock_cqs(struct pvrdma_cq *scq, struct pvrdma_cq *rcq,
> +			    unsigned long *scq_flags,
> +			    unsigned long *rcq_flags)
> +	__acquires(scq->cq_lock) __acquires(rcq->cq_lock)
> +{
> +	if (scq == rcq) {
> +		spin_lock_irqsave(&scq->cq_lock, *scq_flags);
> +		__acquire(rcq->cq_lock);
> +	} else if (scq->cq_handle < rcq->cq_handle) {
> +		spin_lock_irqsave(&scq->cq_lock, *scq_flags);
> +		spin_lock_irqsave_nested(&rcq->cq_lock, *rcq_flags,
> +					 SINGLE_DEPTH_NESTING);
> +	} else {
> +		spin_lock_irqsave(&rcq->cq_lock, *rcq_flags);
> +		spin_lock_irqsave_nested(&scq->cq_lock, *scq_flags,
> +					 SINGLE_DEPTH_NESTING);
> +	}
> +}
> +
> +static void pvrdma_unlock_cqs(struct pvrdma_cq *scq, struct pvrdma_cq *rcq,
> +			      unsigned long *scq_flags,
> +			      unsigned long *rcq_flags)
> +	__releases(scq->cq_lock) __releases(rcq->cq_lock)
> +{
> +	if (scq == rcq) {
> +		__release(rcq->cq_lock);
> +		spin_unlock_irqrestore(&scq->cq_lock, *scq_flags);
> +	} else if (scq->cq_handle < rcq->cq_handle) {
> +		spin_unlock_irqrestore(&rcq->cq_lock, *rcq_flags);
> +		spin_unlock_irqrestore(&scq->cq_lock, *scq_flags);
> +	} else {
> +		spin_unlock_irqrestore(&scq->cq_lock, *scq_flags);
> +		spin_unlock_irqrestore(&rcq->cq_lock, *rcq_flags);
> +	}
> +}
> +
> +static void pvrdma_reset_qp(struct pvrdma_qp *qp)
> +{
> +	struct pvrdma_cq *scq, *rcq;
> +	unsigned long scq_flags, rcq_flags;
> +
> +	/* Clean up cqes */
> +	get_cqs(qp, &scq, &rcq);
> +	pvrdma_lock_cqs(scq, rcq, &scq_flags, &rcq_flags);
> +
> +	_pvrdma_flush_cqe(qp, scq);
> +	if (scq != rcq)
> +		_pvrdma_flush_cqe(qp, rcq);
> +
> +	pvrdma_unlock_cqs(scq, rcq, &scq_flags, &rcq_flags);
> +
> +	/*
> +	 * Reset queuepair. The checks are because usermode queuepairs won't
> +	 * have kernel ringstates.
> +	 */
> +	if (qp->rq.ring) {
> +		atomic_set(&qp->rq.ring->cons_head, 0);
> +		atomic_set(&qp->rq.ring->prod_tail, 0);
> +	}
> +	if (qp->sq.ring) {
> +		atomic_set(&qp->sq.ring->cons_head, 0);
> +		atomic_set(&qp->sq.ring->prod_tail, 0);
> +	}
> +}
> +
> +static int pvrdma_set_rq_size(struct pvrdma_dev *dev,
> +			      struct ib_qp_cap *req_cap,
> +			      struct pvrdma_qp *qp)
> +{
> +	if (req_cap->max_recv_wr > dev->dsr->caps.max_qp_wr ||
> +	    req_cap->max_recv_sge > dev->dsr->caps.max_sge) {
> +		dev_warn(&dev->pdev->dev, "recv queue size invalid\n");
> +		return -EINVAL;
> +	}
> +
> +	qp->rq.wqe_cnt = roundup_pow_of_two(max(1U, req_cap->max_recv_wr));
> +	qp->rq.max_sg = roundup_pow_of_two(max(1U, req_cap->max_recv_sge));
> +
> +	/* Write back */
> +	req_cap->max_recv_wr = qp->rq.wqe_cnt;
> +	req_cap->max_recv_sge = qp->rq.max_sg;
> +
> +	qp->rq.wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_rq_wqe_hdr) +
> +					     sizeof(struct ib_sge) *
> +					     qp->rq.max_sg);
> +	qp->npages_recv = (qp->rq.wqe_cnt * qp->rq.wqe_size + PAGE_SIZE - 1) /
> +			  PAGE_SIZE;
> +
> +	return 0;
> +}
> +
> +static int pvrdma_set_sq_size(struct pvrdma_dev *dev, struct ib_qp_cap *req_cap,
> +			      enum ib_qp_type type, struct pvrdma_qp *qp)
> +{
> +	if (req_cap->max_send_wr > dev->dsr->caps.max_qp_wr ||
> +	    req_cap->max_send_sge > dev->dsr->caps.max_sge) {
> +		dev_warn(&dev->pdev->dev, "send queue size invalid\n");
> +		return -EINVAL;
> +	}
> +
> +	qp->sq.wqe_cnt = roundup_pow_of_two(max(1U, req_cap->max_send_wr));
> +	qp->sq.max_sg = roundup_pow_of_two(max(1U, req_cap->max_send_sge));
> +
> +	/* Write back */
> +	req_cap->max_send_wr = qp->sq.wqe_cnt;
> +	req_cap->max_send_sge = qp->sq.max_sg;
> +
> +	qp->sq.wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_sq_wqe_hdr) +
> +					     sizeof(struct ib_sge) *
> +					     qp->sq.max_sg);
> +	/* Note: one extra page for the header. */
> +	qp->npages_send = 1 + (qp->sq.wqe_cnt * qp->sq.wqe_size +
> +			       PAGE_SIZE - 1) / PAGE_SIZE;
> +
> +	return 0;
> +}
> +
> +/**
> + * pvrdma_create_qp - create queue pair
> + * @pd: protection domain
> + * @init_attr: queue pair attributes
> + * @udata: user data
> + *
> + * @return: the ib_qp pointer on success, otherwise returns an errno.
> + */
> +struct ib_qp *pvrdma_create_qp(struct ib_pd *pd,
> +			       struct ib_qp_init_attr *init_attr,
> +			       struct ib_udata *udata)
> +{
> +	struct pvrdma_qp *qp = NULL;
> +	struct pvrdma_dev *dev = to_vdev(pd->device);
> +	struct pvrdma_cq *send_cq, *recv_cq;
> +	union pvrdma_cmd_req req;
> +	union pvrdma_cmd_resp rsp;
> +	struct pvrdma_cmd_create_qp *cmd = &req.create_qp;
> +	struct pvrdma_cmd_create_qp_resp *resp = &rsp.create_qp_resp;
> +	struct pvrdma_create_qp ucmd;
> +	unsigned long flags;
> +	int ret;
> +
> +	if (init_attr->create_flags) {
> +		dev_warn(&dev->pdev->dev,
> +			 "invalid create queuepair flags %#x\n",
> +			 init_attr->create_flags);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	if (init_attr->qp_type != IB_QPT_RC &&
> +	    init_attr->qp_type != IB_QPT_UD &&
> +	    init_attr->qp_type != IB_QPT_GSI) {
> +		dev_warn(&dev->pdev->dev, "queuepair type %d not supported\n",
> +			 init_attr->qp_type);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	if (!atomic_add_unless(&dev->num_qps, 1, dev->dsr->caps.max_qp))
> +		return ERR_PTR(-ENOMEM);
> +
> +	switch (init_attr->qp_type) {
> +	case IB_QPT_GSI:
> +		if (init_attr->port_num == 0 ||
> +		    init_attr->port_num > pd->device->phys_port_cnt ||
> +		    udata) {
> +			dev_warn(&dev->pdev->dev, "invalid queuepair attrs\n");
> +			ret = -EINVAL;
> +			goto err_qp;
> +		}
> +		/* fall through */
> +	case IB_QPT_RC:
> +	case IB_QPT_UD:
> +		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
> +		if (!qp) {
> +			ret = -ENOMEM;
> +			goto err_qp;
> +		}
> +
> +		spin_lock_init(&qp->sq.lock);
> +		spin_lock_init(&qp->rq.lock);
> +		mutex_init(&qp->mutex);
> +		atomic_set(&qp->refcnt, 1);
> +		init_waitqueue_head(&qp->wait);
> +
> +		qp->state = IB_QPS_RESET;
> +
> +		if (pd->uobject && udata) {
> +			dev_dbg(&dev->pdev->dev,
> +				"create queuepair from user space\n");
> +
> +			if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd))) {
> +				ret = -EFAULT;
> +				goto err_qp;
> +			}
> +
> +			/* set qp->sq.wqe_cnt, shift, buf_size.. */
> +			qp->rumem = ib_umem_get(pd->uobject->context,
> +						ucmd.rbuf_addr,
> +						ucmd.rbuf_size, 0, 0);
> +			if (IS_ERR(qp->rumem)) {
> +				ret = PTR_ERR(qp->rumem);
> +				goto err_qp;
> +			}
> +
> +			qp->sumem = ib_umem_get(pd->uobject->context,
> +						ucmd.sbuf_addr,
> +						ucmd.sbuf_size, 0, 0);
> +			if (IS_ERR(qp->sumem)) {
> +				ib_umem_release(qp->rumem);
> +				ret = PTR_ERR(qp->sumem);
> +				goto err_qp;
> +			}
> +
> +			qp->npages_send = ib_umem_page_count(qp->sumem);
> +			qp->npages_recv = ib_umem_page_count(qp->rumem);
> +			qp->npages = qp->npages_send + qp->npages_recv;
> +		} else {
> +			qp->is_kernel = true;
> +
> +			send_cq = to_vcq(init_attr->send_cq);
> +			recv_cq = to_vcq(init_attr->recv_cq);
> +
> +			ret = pvrdma_set_sq_size(to_vdev(pd->device),
> +						 &init_attr->cap,
> +						 init_attr->qp_type, qp);
> +			if (ret)
> +				goto err_qp;
> +
> +			ret = pvrdma_set_rq_size(to_vdev(pd->device),
> +						 &init_attr->cap, qp);
> +			if (ret)
> +				goto err_qp;
> +
> +			qp->npages = qp->npages_send + qp->npages_recv;
> +
> +			/* Skip header page. */
> +			qp->sq.offset = PAGE_SIZE;
> +
> +			/* Recv queue pages are after send pages. */
> +			qp->rq.offset = qp->npages_send * PAGE_SIZE;
> +		}
> +
> +		if (qp->npages < 0 || qp->npages > PVRDMA_PAGE_DIR_MAX_PAGES) {
> +			dev_warn(&dev->pdev->dev,
> +				 "overflow pages in queuepair\n");
> +			ret = -EINVAL;
> +			goto err_umem;
> +		}
> +
> +		ret = pvrdma_page_dir_init(dev, &qp->pdir, qp->npages,
> +					   qp->is_kernel);
> +		if (ret) {
> +			dev_warn(&dev->pdev->dev,
> +				 "could not allocate page directory\n");
> +			goto err_umem;
> +		}
> +
> +		if (!qp->is_kernel) {
> +			pvrdma_page_dir_insert_umem(&qp->pdir, qp->sumem, 0);
> +			pvrdma_page_dir_insert_umem(&qp->pdir, qp->rumem,
> +						    qp->npages_send);
> +		} else {
> +			/* Ring state is always the first page. */
> +			qp->sq.ring = qp->pdir.pages[0];
> +			qp->rq.ring = &qp->sq.ring[1];
> +		}
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err_qp;
> +	}
> +
> +	/* Not supported */
> +	init_attr->cap.max_inline_data = 0;
> +
> +	memset(cmd, 0, sizeof(*cmd));
> +	cmd->hdr.cmd = PVRDMA_CMD_CREATE_QP;
> +	cmd->pd_handle = to_vpd(pd)->pd_handle;
> +	cmd->send_cq_handle = to_vcq(init_attr->send_cq)->cq_handle;
> +	cmd->recv_cq_handle = to_vcq(init_attr->recv_cq)->cq_handle;
> +	cmd->max_send_wr = init_attr->cap.max_send_wr;
> +	cmd->max_recv_wr = init_attr->cap.max_recv_wr;
> +	cmd->max_send_sge = init_attr->cap.max_send_sge;
> +	cmd->max_recv_sge = init_attr->cap.max_recv_sge;
> +	cmd->max_inline_data = init_attr->cap.max_inline_data;
> +	cmd->sq_sig_all = (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) ? 1 : 0;
> +	cmd->qp_type = ib_qp_type_to_pvrdma(init_attr->qp_type);
> +	cmd->access_flags = IB_ACCESS_LOCAL_WRITE;
> +	cmd->total_chunks = qp->npages;
> +	cmd->send_chunks = qp->npages_send - 1;
> +	cmd->pdir_dma = qp->pdir.dir_dma;
> +
> +	dev_dbg(&dev->pdev->dev, "create queuepair with %d, %d, %d, %d\n",
> +		cmd->max_send_wr, cmd->max_recv_wr, cmd->max_send_sge,
> +		cmd->max_recv_sge);
> +
> +	ret = pvrdma_cmd_post(dev, &req, &rsp);
> +	if (ret < 0 || resp->hdr.ack != PVRDMA_CMD_CREATE_QP_RESP) {
> +		dev_warn(&dev->pdev->dev, "could not create queuepair\n");
> +		goto err_pdir;
> +	}
> +
> +	/* max_send_wr/_recv_wr/_send_sge/_recv_sge/_inline_data */
> +	qp->qp_handle = resp->qpn;
> +	qp->port = init_attr->port_num;
> +	qp->ibqp.qp_num = resp->qpn;
> +	spin_lock_irqsave(&dev->qp_tbl_lock, flags);
> +	dev->qp_tbl[qp->qp_handle % dev->dsr->caps.max_qp] = qp;
> +	spin_unlock_irqrestore(&dev->qp_tbl_lock, flags);
> +
> +	return &qp->ibqp;
> +
> +err_pdir:
> +	pvrdma_page_dir_cleanup(dev, &qp->pdir);
> +err_umem:
> +	if (pd->uobject && udata) {
> +		if (qp->rumem)
> +			ib_umem_release(qp->rumem);
> +		if (qp->sumem)
> +			ib_umem_release(qp->sumem);
> +	}
> +err_qp:
> +	kfree(qp);
> +	atomic_dec(&dev->num_qps);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static void pvrdma_free_qp(struct pvrdma_qp *qp)
> +{
> +	struct pvrdma_dev *dev = to_vdev(qp->ibqp.device);
> +	struct pvrdma_cq *scq;
> +	struct pvrdma_cq *rcq;
> +	unsigned long flags, scq_flags, rcq_flags;
> +
> +	/* In case cq is polling */
> +	get_cqs(qp, &scq, &rcq);
> +	pvrdma_lock_cqs(scq, rcq, &scq_flags, &rcq_flags);
> +
> +	_pvrdma_flush_cqe(qp, scq);
> +	if (scq != rcq)
> +		_pvrdma_flush_cqe(qp, rcq);
> +
> +	spin_lock_irqsave(&dev->qp_tbl_lock, flags);
> +	dev->qp_tbl[qp->qp_handle] = NULL;
> +	spin_unlock_irqrestore(&dev->qp_tbl_lock, flags);
> +
> +	pvrdma_unlock_cqs(scq, rcq, &scq_flags, &rcq_flags);
> +
> +	atomic_dec(&qp->refcnt);
> +	wait_event(qp->wait, !atomic_read(&qp->refcnt));
> +
> +	pvrdma_page_dir_cleanup(dev, &qp->pdir);
> +
> +	kfree(qp);
> +
> +	atomic_dec(&dev->num_qps);
> +}
> +
> +/**
> + * pvrdma_destroy_qp - destroy a queue pair
> + * @qp: the queue pair to destroy
> + *
> + * @return: 0 on success.
> + */
> +int pvrdma_destroy_qp(struct ib_qp *qp)
> +{
> +	struct pvrdma_qp *vqp = to_vqp(qp);
> +	union pvrdma_cmd_req req;
> +	struct pvrdma_cmd_destroy_qp *cmd = &req.destroy_qp;
> +	int ret;
> +
> +	memset(cmd, 0, sizeof(*cmd));
> +	cmd->hdr.cmd = PVRDMA_CMD_DESTROY_QP;
> +	cmd->qp_handle = vqp->qp_handle;
> +
> +	ret = pvrdma_cmd_post(to_vdev(qp->device), &req, NULL);
> +	if (ret < 0) {
> +		struct pvrdma_dev *dev = to_vdev(qp->device);
> +
> +		dev_warn(&dev->pdev->dev, "destroy queuepair failed\n");
> +	}
> +
> +	pvrdma_free_qp(vqp);
> +
> +	return 0;
> +}
> +
> +/**
> + * pvrdma_modify_qp - modify queue pair attributes
> + * @ibqp: the queue pair
> + * @attr: the new queue pair's attributes
> + * @attr_mask: attributes mask
> + * @udata: user data
> + *
> + * @returns 0 on success, otherwise returns an errno.
> + */
> +int pvrdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
> +		     int attr_mask, struct ib_udata *udata)
> +{
> +	struct pvrdma_dev *dev = to_vdev(ibqp->device);
> +	struct pvrdma_qp *qp = to_vqp(ibqp);
> +	union pvrdma_cmd_req req;
> +	union pvrdma_cmd_resp rsp;
> +	struct pvrdma_cmd_modify_qp *cmd = &req.modify_qp;
> +	int cur_state, next_state;
> +	int ret;
> +
> +	/* Sanity checking. Should need lock here */
> +	mutex_lock(&qp->mutex);
> +	cur_state = (attr_mask & IB_QP_CUR_STATE) ? attr->cur_qp_state :
> +		qp->state;
> +	next_state = (attr_mask & IB_QP_STATE) ? attr->qp_state : cur_state;
> +
> +	if (!ib_modify_qp_is_ok(cur_state, next_state, ibqp->qp_type,
> +				attr_mask, IB_LINK_LAYER_ETHERNET)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (attr_mask & IB_QP_PORT) {
> +		if (attr->port_num == 0 ||
> +		    attr->port_num > ibqp->device->phys_port_cnt) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	if (attr_mask & IB_QP_MIN_RNR_TIMER) {
> +		if (attr->min_rnr_timer > 31) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	if (attr_mask & IB_QP_PKEY_INDEX) {
> +		if (attr->pkey_index >= dev->dsr->caps.max_pkeys) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +	}
> +
> +	if (attr_mask & IB_QP_QKEY)
> +		qp->qkey = attr->qkey;
> +
> +	if (cur_state == next_state && cur_state == IB_QPS_RESET) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	qp->state = next_state;
> +	memset(cmd, 0, sizeof(*cmd));
> +	cmd->hdr.cmd = PVRDMA_CMD_MODIFY_QP;
> +	cmd->qp_handle = qp->qp_handle;
> +	cmd->attr_mask = ib_qp_attr_mask_to_pvrdma(attr_mask);
> +	cmd->attrs.qp_state = ib_qp_state_to_pvrdma(attr->qp_state);
> +	cmd->attrs.cur_qp_state =
> +		ib_qp_state_to_pvrdma(attr->cur_qp_state);
> +	cmd->attrs.path_mtu = ib_mtu_to_pvrdma(attr->path_mtu);
> +	cmd->attrs.path_mig_state =
> +		ib_mig_state_to_pvrdma(attr->path_mig_state);
> +	cmd->attrs.qkey = attr->qkey;
> +	cmd->attrs.rq_psn = attr->rq_psn;
> +	cmd->attrs.sq_psn = attr->sq_psn;
> +	cmd->attrs.dest_qp_num = attr->dest_qp_num;
> +	cmd->attrs.qp_access_flags =
> +		ib_access_flags_to_pvrdma(attr->qp_access_flags);
> +	cmd->attrs.pkey_index = attr->pkey_index;
> +	cmd->attrs.alt_pkey_index = attr->alt_pkey_index;
> +	cmd->attrs.en_sqd_async_notify = attr->en_sqd_async_notify;
> +	cmd->attrs.sq_draining = attr->sq_draining;
> +	cmd->attrs.max_rd_atomic = attr->max_rd_atomic;
> +	cmd->attrs.max_dest_rd_atomic = attr->max_dest_rd_atomic;
> +	cmd->attrs.min_rnr_timer = attr->min_rnr_timer;
> +	cmd->attrs.port_num = attr->port_num;
> +	cmd->attrs.timeout = attr->timeout;
> +	cmd->attrs.retry_cnt = attr->retry_cnt;
> +	cmd->attrs.rnr_retry = attr->rnr_retry;
> +	cmd->attrs.alt_port_num = attr->alt_port_num;
> +	cmd->attrs.alt_timeout = attr->alt_timeout;
> +	ib_qp_cap_to_pvrdma(&cmd->attrs.cap, &attr->cap);
> +	ib_ah_attr_to_pvrdma(&cmd->attrs.ah_attr, &attr->ah_attr);
> +	ib_ah_attr_to_pvrdma(&cmd->attrs.alt_ah_attr, &attr->alt_ah_attr);
> +	ret = pvrdma_cmd_post(dev, &req, &rsp);
> +	if (ret < 0 || rsp.hdr.ack != PVRDMA_CMD_MODIFY_QP_RESP ||
> +	    rsp.hdr.err > 0) {
> +		dev_warn(&dev->pdev->dev,
> +			 "could not modify queuepair\n");
> +		if (ret == 0) {
> +			if (rsp.hdr.ack != PVRDMA_CMD_MODIFY_QP_RESP)
> +				ret = -EINVAL;
> +			else
> +				ret = rsp.hdr.err;
> +		}
> +	}
> +
> +	if (ret == 0 && next_state == IB_QPS_RESET)
> +		pvrdma_reset_qp(qp);
> +
> +out:
> +	mutex_unlock(&qp->mutex);
> +
> +	return ret;
> +}
> +
> +static inline void *get_sq_wqe(struct pvrdma_qp *qp, int n)
> +{
> +	return pvrdma_page_dir_get_ptr(&qp->pdir,
> +				       qp->sq.offset + n * qp->sq.wqe_size);
> +}
> +
> +static inline void *get_rq_wqe(struct pvrdma_qp *qp, int n)
> +{
> +	return pvrdma_page_dir_get_ptr(&qp->pdir,
> +				       qp->rq.offset + n * qp->rq.wqe_size);
> +}
> +
> +static int set_reg_seg(struct pvrdma_sq_wqe_hdr *wqe_hdr, struct ib_reg_wr *wr)
> +{
> +	struct pvrdma_user_mr *mr = to_vmr(wr->mr);
> +
> +	wqe_hdr->wr.fast_reg.iova_start = mr->ibmr.iova;
> +	wqe_hdr->wr.fast_reg.pl_pdir_dma = mr->pdir.dir_dma;
> +	wqe_hdr->wr.fast_reg.page_shift = mr->page_shift;
> +	wqe_hdr->wr.fast_reg.page_list_len = mr->npages;
> +	wqe_hdr->wr.fast_reg.length = mr->ibmr.length;
> +	wqe_hdr->wr.fast_reg.access_flags = wr->access;
> +	wqe_hdr->wr.fast_reg.rkey = wr->key;
> +
> +	return pvrdma_page_dir_insert_page_list(&mr->pdir, mr->pages,
> +						mr->npages);
> +}
> +
> +/**
> + * pvrdma_post_send - post send work request entries on a QP
> + * @ibqp: the QP
> + * @wr: work request list to post
> + * @bad_wr: the first bad WR returned
> + *
> + * @return: 0 on success, otherwise errno returned.
> + */
> +int pvrdma_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
> +		     struct ib_send_wr **bad_wr)
> +{
> +	struct pvrdma_qp *qp = to_vqp(ibqp);
> +	struct pvrdma_dev *dev = to_vdev(ibqp->device);
> +	unsigned long flags;
> +	struct pvrdma_sq_wqe_hdr *wqe_hdr;
> +	struct ib_sge *sge;
> +	int i, index;
> +	int nreq;
> +	int ret;
> +
> +	/*
> +	 * In states lower than RTS, we can fail immediately. In other states,
> +	 * just post and let the device figure it out.
> +	 */
> +	if (qp->state < IB_QPS_RTS) {
> +		*bad_wr = wr;
> +		return -EINVAL;
> +	}
> +
> +	spin_lock_irqsave(&qp->sq.lock, flags);
> +
> +	index = pvrdma_idx(&qp->sq.ring->prod_tail, qp->sq.wqe_cnt);
> +	for (nreq = 0; wr; nreq++, wr = wr->next) {
> +		unsigned int tail;
> +
> +		if (unlikely(!pvrdma_idx_ring_has_space(
> +				qp->sq.ring, qp->sq.wqe_cnt, &tail))) {
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "send queue is full\n");
> +			*bad_wr = wr;
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		if (unlikely(wr->num_sge > qp->sq.max_sg || wr->num_sge < 0)) {
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "send SGE overflow\n");
> +			*bad_wr = wr;
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (unlikely(wr->opcode < 0)) {
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "invalid send opcode\n");
> +			*bad_wr = wr;
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		/*
> +		 * Only support UD, RC.
> +		 * Need to check opcode table for thorough checking.
> +		 * opcode		_UD	_UC	_RC
> +		 * _SEND		x	x	x
> +		 * _SEND_WITH_IMM	x	x	x
> +		 * _RDMA_WRITE			x	x
> +		 * _RDMA_WRITE_WITH_IMM		x	x
> +		 * _LOCAL_INV			x	x
> +		 * _SEND_WITH_INV		x	x
> +		 * _RDMA_READ				x
> +		 * _ATOMIC_CMP_AND_SWP			x
> +		 * _ATOMIC_FETCH_AND_ADD		x
> +		 * _MASK_ATOMIC_CMP_AND_SWP		x
> +		 * _MASK_ATOMIC_FETCH_AND_ADD		x
> +		 * _REG_MR				x
> +		 *
> +		 */
> +		if (qp->ibqp.qp_type != IB_QPT_UD &&
> +		    qp->ibqp.qp_type != IB_QPT_RC &&
> +			wr->opcode != IB_WR_SEND) {
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "unsupported queuepair type\n");
> +			*bad_wr = wr;
> +			ret = -EINVAL;
> +			goto out;
> +		} else if (qp->ibqp.qp_type == IB_QPT_UD ||
> +			   qp->ibqp.qp_type == IB_QPT_GSI) {
> +			if (wr->opcode != IB_WR_SEND &&
> +			    wr->opcode != IB_WR_SEND_WITH_IMM) {
> +				dev_warn_ratelimited(&dev->pdev->dev,
> +						     "invalid send opcode\n");
> +				*bad_wr = wr;
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +		}
> +
> +		wqe_hdr = (struct pvrdma_sq_wqe_hdr *)get_sq_wqe(qp, index);
> +		memset(wqe_hdr, 0, sizeof(*wqe_hdr));
> +		wqe_hdr->wr_id = wr->wr_id;
> +		wqe_hdr->num_sge = wr->num_sge;
> +		wqe_hdr->opcode = ib_wr_opcode_to_pvrdma(wr->opcode);
> +		wqe_hdr->send_flags = ib_send_flags_to_pvrdma(wr->send_flags);
> +		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
> +		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
> +			wqe_hdr->ex.imm_data = wr->ex.imm_data;
> +
> +		switch (qp->ibqp.qp_type) {
> +		case IB_QPT_GSI:
> +		case IB_QPT_UD:
> +			if (unlikely(!ud_wr(wr)->ah)) {
> +				dev_warn_ratelimited(&dev->pdev->dev,
> +						     "invalid address handle\n");
> +				*bad_wr = wr;
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +
> +			/*
> +			 * Use qkey from qp context if high order bit set,
> +			 * otherwise from work request.
> +			 */
> +			wqe_hdr->wr.ud.remote_qpn = ud_wr(wr)->remote_qpn;
> +			wqe_hdr->wr.ud.remote_qkey =
> +				ud_wr(wr)->remote_qkey & 0x80000000 ?
> +				qp->qkey : ud_wr(wr)->remote_qkey;
> +			wqe_hdr->wr.ud.av = to_vah(ud_wr(wr)->ah)->av;
> +
> +			break;
> +		case IB_QPT_RC:
> +			switch (wr->opcode) {
> +			case IB_WR_RDMA_READ:
> +			case IB_WR_RDMA_WRITE:
> +			case IB_WR_RDMA_WRITE_WITH_IMM:
> +				wqe_hdr->wr.rdma.remote_addr =
> +					rdma_wr(wr)->remote_addr;
> +				wqe_hdr->wr.rdma.rkey = rdma_wr(wr)->rkey;
> +				break;
> +			case IB_WR_LOCAL_INV:
> +			case IB_WR_SEND_WITH_INV:
> +				wqe_hdr->ex.invalidate_rkey =
> +					wr->ex.invalidate_rkey;
> +				break;
> +			case IB_WR_ATOMIC_CMP_AND_SWP:
> +			case IB_WR_ATOMIC_FETCH_AND_ADD:
> +				wqe_hdr->wr.atomic.remote_addr =
> +					atomic_wr(wr)->remote_addr;
> +				wqe_hdr->wr.atomic.rkey = atomic_wr(wr)->rkey;
> +				wqe_hdr->wr.atomic.compare_add =
> +					atomic_wr(wr)->compare_add;
> +				if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP)
> +					wqe_hdr->wr.atomic.swap =
> +						atomic_wr(wr)->swap;
> +				break;
> +			case IB_WR_REG_MR:
> +				ret = set_reg_seg(wqe_hdr, reg_wr(wr));
> +				if (ret < 0) {
> +					dev_warn_ratelimited(&dev->pdev->dev,
> +							     "Failed to set fast register work request\n");
> +					*bad_wr = wr;
> +					goto out;
> +				}
> +				break;
> +			default:
> +				break;
> +			}
> +
> +			break;
> +		default:
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "invalid queuepair type\n");
> +			ret = -EINVAL;
> +			*bad_wr = wr;
> +			goto out;
> +		}
> +
> +		sge = (struct ib_sge *)(wqe_hdr + 1);
> +		for (i = 0; i < wr->num_sge; i++) {
> +			/* Need to check wqe_size 0 or max size */
> +			sge->addr = wr->sg_list[i].addr;
> +			sge->length = wr->sg_list[i].length;
> +			sge->lkey = wr->sg_list[i].lkey;
> +			sge++;
> +		}
> +
> +		/* Make sure wqe is written before index update */
> +		smp_wmb();
> +
> +		index++;
> +		if (unlikely(index >= qp->sq.wqe_cnt))
> +			index = 0;
> +		/* Update shared sq ring */
> +		pvrdma_idx_ring_inc(&qp->sq.ring->prod_tail,
> +				    qp->sq.wqe_cnt);
> +	}
> +
> +	ret = 0;
> +
> +out:
> +	spin_unlock_irqrestore(&qp->sq.lock, flags);
> +
> +	if (!ret)
> +		pvrdma_write_uar_qp(dev, PVRDMA_UAR_QP_SEND | qp->qp_handle);
> +
> +	return ret;
> +}
> +
> +/**
> + * pvrdma_post_receive - post receive work request entries on a QP
> + * @ibqp: the QP
> + * @wr: the work request list to post
> + * @bad_wr: the first bad WR returned
> + *
> + * @return: 0 on success, otherwise errno returned.
> + */
> +int pvrdma_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
> +		     struct ib_recv_wr **bad_wr)
> +{
> +	struct pvrdma_dev *dev = to_vdev(ibqp->device);
> +	unsigned long flags;
> +	struct pvrdma_qp *qp = to_vqp(ibqp);
> +	struct pvrdma_rq_wqe_hdr *wqe_hdr;
> +	struct ib_sge *sge;
> +	int index, nreq;
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * In the RESET state, we can fail immediately. For other states,
> +	 * just post and let the device figure it out.
> +	 */
> +	if (qp->state == IB_QPS_RESET) {
> +		*bad_wr = wr;
> +		return -EINVAL;
> +	}
> +
> +	spin_lock_irqsave(&qp->rq.lock, flags);
> +
> +	index = pvrdma_idx(&qp->rq.ring->prod_tail, qp->rq.wqe_cnt);
> +	for (nreq = 0; wr; nreq++, wr = wr->next) {
> +		unsigned int tail;
> +
> +		if (unlikely(wr->num_sge > qp->rq.max_sg ||
> +			     wr->num_sge < 0)) {
> +			ret = -EINVAL;
> +			*bad_wr = wr;
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "recv SGE overflow\n");
> +			goto out;
> +		}
> +
> +		if (unlikely(!pvrdma_idx_ring_has_space(
> +				qp->rq.ring, qp->rq.wqe_cnt, &tail))) {
> +			ret = -ENOMEM;
> +			*bad_wr = wr;
> +			dev_warn_ratelimited(&dev->pdev->dev,
> +					     "recv queue full\n");
> +			goto out;
> +		}
> +
> +		wqe_hdr = (struct pvrdma_rq_wqe_hdr *)get_rq_wqe(qp, index);
> +		wqe_hdr->wr_id = wr->wr_id;
> +		wqe_hdr->num_sge = wr->num_sge;
> +		wqe_hdr->total_len = 0;
> +
> +		sge = (struct ib_sge *)(wqe_hdr + 1);
> +		for (i = 0; i < wr->num_sge; i++) {
> +			sge->addr = wr->sg_list[i].addr;
> +			sge->length = wr->sg_list[i].length;
> +			sge->lkey = wr->sg_list[i].lkey;
> +			sge++;
> +		}
> +
> +		/* Make sure wqe is written before index update */
> +		smp_wmb();
> +
> +		index++;
> +		if (unlikely(index >= qp->rq.wqe_cnt))
> +			index = 0;
> +		/* Update shared rq ring */
> +		pvrdma_idx_ring_inc(&qp->rq.ring->prod_tail,
> +				    qp->rq.wqe_cnt);
> +	}
> +
> +	spin_unlock_irqrestore(&qp->rq.lock, flags);
> +
> +	pvrdma_write_uar_qp(dev, PVRDMA_UAR_QP_RECV | qp->qp_handle);
> +
> +	return ret;
> +
> +out:
> +	spin_unlock_irqrestore(&qp->rq.lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * pvrdma_query_qp - query a queue pair's attributes
> + * @ibqp: the queue pair to query
> + * @attr: the queue pair's attributes
> + * @attr_mask: attributes mask
> + * @init_attr: initial queue pair attributes
> + *
> + * @returns 0 on success, otherwise returns an errno.
> + */
> +int pvrdma_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
> +		    int attr_mask, struct ib_qp_init_attr *init_attr)
> +{
> +	struct pvrdma_dev *dev = to_vdev(ibqp->device);
> +	struct pvrdma_qp *qp = to_vqp(ibqp);
> +	union pvrdma_cmd_req req;
> +	union pvrdma_cmd_resp rsp;
> +	struct pvrdma_cmd_query_qp *cmd = &req.query_qp;
> +	struct pvrdma_cmd_query_qp_resp *resp = &rsp.query_qp_resp;
> +	int ret = 0;
> +
> +	mutex_lock(&qp->mutex);
> +
> +	if (qp->state == IB_QPS_RESET) {
> +		attr->qp_state = IB_QPS_RESET;
> +		goto out;
> +	}
> +
> +	memset(cmd, 0, sizeof(*cmd));
> +	cmd->hdr.cmd = PVRDMA_CMD_QUERY_QP;
> +	cmd->qp_handle = qp->qp_handle;
> +	cmd->attr_mask = ib_qp_attr_mask_to_pvrdma(attr_mask);
> +
> +	ret = pvrdma_cmd_post(dev, &req, &rsp);
> +	if (ret < 0 || resp->hdr.ack != PVRDMA_CMD_QUERY_QP_RESP) {
> +		dev_warn(&dev->pdev->dev, "could not query queuepair\n");
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	attr->qp_state = pvrdma_qp_state_to_ib(resp->attrs.qp_state);
> +	attr->cur_qp_state =
> +		pvrdma_qp_state_to_ib(resp->attrs.cur_qp_state);
> +	attr->path_mtu = pvrdma_mtu_to_ib(resp->attrs.path_mtu);
> +	attr->path_mig_state =
> +		pvrdma_mig_state_to_ib(resp->attrs.path_mig_state);
> +	attr->qkey = resp->attrs.qkey;
> +	attr->rq_psn = resp->attrs.rq_psn;
> +	attr->sq_psn = resp->attrs.sq_psn;
> +	attr->dest_qp_num = resp->attrs.dest_qp_num;
> +	attr->qp_access_flags =
> +		pvrdma_access_flags_to_ib(resp->attrs.qp_access_flags);
> +	attr->pkey_index = resp->attrs.pkey_index;
> +	attr->alt_pkey_index = resp->attrs.alt_pkey_index;
> +	attr->en_sqd_async_notify = resp->attrs.en_sqd_async_notify;
> +	attr->sq_draining = resp->attrs.sq_draining;
> +	attr->max_rd_atomic = resp->attrs.max_rd_atomic;
> +	attr->max_dest_rd_atomic = resp->attrs.max_dest_rd_atomic;
> +	attr->min_rnr_timer = resp->attrs.min_rnr_timer;
> +	attr->port_num = resp->attrs.port_num;
> +	attr->timeout = resp->attrs.timeout;
> +	attr->retry_cnt = resp->attrs.retry_cnt;
> +	attr->rnr_retry = resp->attrs.rnr_retry;
> +	attr->alt_port_num = resp->attrs.alt_port_num;
> +	attr->alt_timeout = resp->attrs.alt_timeout;
> +	pvrdma_qp_cap_to_ib(&attr->cap, &resp->attrs.cap);
> +	pvrdma_ah_attr_to_ib(&attr->ah_attr, &resp->attrs.ah_attr);
> +	pvrdma_ah_attr_to_ib(&attr->alt_ah_attr, &resp->attrs.alt_ah_attr);
> +
> +	qp->state = attr->qp_state;
> +
> +	ret = 0;
> +
> +out:
> +	attr->cur_qp_state = attr->qp_state;
> +
> +	init_attr->event_handler = qp->ibqp.event_handler;
> +	init_attr->qp_context = qp->ibqp.qp_context;
> +	init_attr->send_cq = qp->ibqp.send_cq;
> +	init_attr->recv_cq = qp->ibqp.recv_cq;
> +	init_attr->srq = qp->ibqp.srq;
> +	init_attr->xrcd = NULL;
> +	init_attr->cap = attr->cap;
> +	init_attr->sq_sig_type = 0;
> +	init_attr->qp_type = qp->ibqp.qp_type;
> +	init_attr->create_flags = 0;
> +	init_attr->port_num = qp->port;
> +
> +	mutex_unlock(&qp->mutex);
> +	return ret;
> +}
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 13/16] IB/pvrdma: Add the main driver module for PVRDMA
From: Yuval Shaia @ 2016-09-14 16:38 UTC (permalink / raw)
  To: Adit Ranadive
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	pv-drivers-pghWNbHTmq7QT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-u79uwXL29TY76Z2rM5mHXA, jhansen-pghWNbHTmq7QT0dZR+AlfA,
	asarwade-pghWNbHTmq7QT0dZR+AlfA,
	georgezhang-pghWNbHTmq7QT0dZR+AlfA,
	bryantan-pghWNbHTmq7QT0dZR+AlfA
In-Reply-To: <1473655766-31628-14-git-send-email-aditr-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org>

On Sun, Sep 11, 2016 at 09:49:23PM -0700, Adit Ranadive wrote:
+
> +static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
> +					  unsigned long event)
> +{
> +	struct net_device *netdev;
> +
> +	netdev = dev->netdev;

Please remove the above two lines.

> +	switch (event) {
> +	case NETDEV_REBOOT:
> +	case NETDEV_DOWN:
> +		pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ERR);
> +		break;
> +	case NETDEV_UP:
> +		pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
> +		break;
> +	default:
> +		dev_dbg(&dev->pdev->dev, "ignore netdevice event %ld on %s\n",
> +			event, dev->ib_dev.name);
> +		break;
> +	}
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [06/26] ath: constify local structures
From: Kalle Valo @ 2016-09-14 17:02 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Luis R. Rodriguez, joe, kernel-janitors, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <1473599168-30561-7-git-send-email-Julia.Lawall@lip6.fr>

Julia Lawall <Julia.Lawall@lip6.fr> wrote:
> For structure types defined in the same file or local header files, find
> top-level static structure declarations that have the following
> properties:
> 1. Never reassigned.
> 2. Address never taken
> 3. Not passed to a top-level macro call
> 4. No pointer or array-typed field passed to a function or stored in a
> variable.
> Declare structures having all of these properties as const.
> 
> Done using Coccinelle.
> Based on a suggestion by Joe Perches <joe@perches.com>.
> 
> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

Thanks, 3 patches applied to wireless-drivers-next.git:

8136fd58ad60 ath: constify local structures
1dc80798a8ca iwlegacy: constify local structures
d86e64768859 rtlwifi: rtl818x: constify local structures

-- 
Sent by pwcli
https://patchwork.kernel.org/patch/9325363/

^ permalink raw reply

* Re: [PATCHv2 net 0/6] sctp: fix the transmit err process
From: Marcelo Ricardo Leitner @ 2016-09-14 17:16 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, davem, Vlad Yasevich, daniel
In-Reply-To: <cover.1473789537.git.lucien.xin@gmail.com>

On Wed, Sep 14, 2016 at 02:04:17AM +0800, Xin Long wrote:
> This patchset is to improve the transmit err process and also fix some
> issues.
> 
> After this patchset, once the chunks are enqueued successfully, even
> if the chunks fail to send out, no matter because of nodst or nomem,
> no err retruns back to users any more. Instead, they are taken care
> of by retransmit.
> 
> v1->v2:
>   - add more details to the changelog in patch 1/6
>   - add Fixes: tag in patch 2/6, 3/6
>   - also revert 69b5777f2e57 in patch 3/6

Looks good to me, thanks

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

^ permalink raw reply

* Re: [RFC 00/11] QLogic RDMA Driver (qedr) RFC
From: Jason Gunthorpe @ 2016-09-14 17:17 UTC (permalink / raw)
  To: Amrani, Ram
  Cc: dledford@redhat.com, David Miller, Yuval Mintz, Ariel Elior,
	Michal Kalderon, Rajesh Borundia, linux-rdma@vger.kernel.org,
	netdev
In-Reply-To: <SN1PR07MB2207486633D88B5C65D547F9F8F10@SN1PR07MB2207.namprd07.prod.outlook.com>

On Wed, Sep 14, 2016 at 02:44:47PM +0000, Amrani, Ram wrote:
> > Anything that is used with copy_to/from_user, ib_copy_to/from_udata, 
> > etc, etc must be in a include/uapi header.
> > 
> > Any constant you might want to copy into your userspace provider must 
> > be in include/uapi.
> > 
> I understand you mean something like
> https://www.spinics.net/lists/linux-rdma/msg40414.html ([RFC
> rdma-next 0/3] Move mlx5 vendor specific to UAPI directory)

Yes.

> > Avoid copying headers in your user space and use the standard kernel 
> > names to access your driver's uapi.

> So the user library should be able to compile against these headers
> in their location and not hold identical copies.

Yes, but you will need identical copies for supporting old distros.

> What do you mean by "standard kernel names"?

By that I mean 'identical copies' do not copy the file and then
randomly change it giving things different names or putting different
content in structs.

You will want to submit your user provider to rdma-plumbing to get it
into the distros, we are planning to set it as the vehicle for code
targeting 4.9

Jason

^ permalink raw reply

* Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver
From: Jason Gunthorpe @ 2016-09-14 17:36 UTC (permalink / raw)
  To: Adit Ranadive
  Cc: dledford@redhat.com, linux-rdma@vger.kernel.org, pv-drivers,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <BLUPR0501MB836ED564F2AE7AB9221AC25C5FF0@BLUPR0501MB836.namprd05.prod.outlook.com>

On Mon, Sep 12, 2016 at 10:43:00PM +0000, Adit Ranadive wrote:
> On Mon, Sep 12, 2016 at 11:03:39 -0700, Jason Gunthorpe wrote:
> > On Sun, Sep 11, 2016 at 09:49:10PM -0700, Adit Ranadive wrote:
> > > [2] Libpvrdma User-level library - 
> > > http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
> > 
> > You will probably find that rdma-plumbing will be the best way to get
> > your userspace component into the distributors.
> 
> Hi Jason,
> 
> Sorry I haven't paying attention to that discussion. Do you know how soon 
> distros will pick up the rdma-plumbing stuff?

We desire to use this as the vehical for the userspace included with
the 4.9 kernel.

I anticipate the tree will be running by Oct 1.

Jason

^ permalink raw reply

* Re: Modification to skb->queue_mapping affecting performance
From: Michael Ma @ 2016-09-14 17:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1473830571.22679.4.camel@edumazet-glaptop3.roam.corp.google.com>

2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>
>> I don't intend to install multiple qdisc - the only reason that I'm
>> doing this now is to leverage MQ to workaround the lock contention,
>> and based on the profile this all worked. However to simplify the way
>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>> HTB class only belongs to one TXQ, which also requires mapping skb to
>> TXQ using some rules (here I'm using priority but I assume it's
>> straightforward to use other information such as classid). And the
>> problem I found here is that when using priority to infer the TXQ so
>> that queue_mapping is changed, bandwidth is affected significantly -
>> the only thing I can guess is that due to queue switch, there are more
>> cache misses assuming processor cores have a static mapping to all the
>> queues. Any suggestion on what to do next for the investigation?
>>
>> I would also guess that this should be a common problem if anyone
>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>> receiver side and classful qdisc is used on IFB, but haven't really
>> found a similar thread here...
>
> But why are you changing the queue ?
>
> NIC already does the proper RSS thing, meaning all packets of one flow
> should land on one RX queue. No need to ' classify yourself and risk
> lock contention'
>
> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
> problem.
>
> Do you really need to rate limit flows ? Not clear what are your goals,
> why for example you use HTB to begin with.
>
Yes. My goal is to set different min/max bandwidth limits for
different processes, so we started with HTB. However with HTB the
qdisc root lock contention caused some unintended correlation between
flows in different classes. For example if some flows belonging to one
class have large amount of small packets, other flows in a different
class will get their effective bandwidth reduced because they'll wait
longer for the root lock. Using MQ this can be avoided because I'll
just put flows belonging to one class to its dedicated TXQ. Then
classes within one HTB on a TXQ will still have the lock contention
problem but classes in different HTB will use different root locks so
the contention doesn't exist.

This also means that I'll need to classify packets to different
TXQ/HTB based on some skb metadata (essentially similar to what mqprio
is doing). So TXQ might need to be switched to achieve this.

^ permalink raw reply

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
From: David Ahern @ 2016-09-14 18:04 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: Mahesh Bandewar, netdev, Eric Dumazet, David Miller
In-Reply-To: <CAF2d9jgEOsSEQXOVJWJsw=RMdr27YiTdcMPrZ6P2EopqU8wtkw@mail.gmail.com>

On 9/14/16 10:30 AM, Mahesh Bandewar (महेश बंडेवार) wrote:
> Hi David, thanks for the comments.
> 
> On Tue, Sep 13, 2016 at 8:24 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>> On 9/12/16 12:01 PM, Mahesh Bandewar wrote:
>>
>>> +struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
>>> +                           u16 proto)
>>> +{
>>> +     struct ipvl_addr *addr;
>>> +     struct net_device *sdev;
>>> +
>>> +     addr = ipvlan_skb_to_addr(skb, dev);
>>> +     if (!addr)
>>> +             goto out;
>>> +
>>> +     sdev = addr->master->dev;
>>> +     switch (proto) {
>>> +     case AF_INET:
>>> +     {
>>> +             int err;
>>> +             struct iphdr *ip4h = ip_hdr(skb);
>>> +
>>> +             err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
>>> +                                        ip4h->tos, sdev);
>>> +             if (unlikely(err))
>>> +                     goto out;
>>> +             break;
>>> +     }
>>> +     case AF_INET6:
>>> +     {
>>> +             struct dst_entry *dst;
>>> +             struct ipv6hdr *ip6h = ipv6_hdr(skb);
>>> +             int flags = RT6_LOOKUP_F_HAS_SADDR;
>>> +             struct flowi6 fl6 = {
>>> +                     .flowi6_iif   = sdev->ifindex,
>>> +                     .daddr        = ip6h->daddr,
>>> +                     .saddr        = ip6h->saddr,
>>> +                     .flowlabel    = ip6_flowinfo(ip6h),
>>> +                     .flowi6_mark  = skb->mark,
>>> +                     .flowi6_proto = ip6h->nexthdr,
>>> +             };
>>> +
>>> +             skb_dst_drop(skb);
>>> +             dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
>>> +             skb_dst_set(skb, dst);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             break;
>>> +     }
>>
>> Nit: why not put the above in separate per-version functions (ipvlan_ip_rcv and ipvlan_ip6_rcv) similar to what is done for ipvlan_process_outbound?
>>
> I can but it's small enough to have it together. Also 'proto' is a
> parameter to the handler and makes it easier However do you see any
> issue having just one function?

no, just readability comment about putting the case statements in helper functions.

> 
>>
>>> +
>>> +out:
>>> +     return skb;
>>> +}
>>> +
>>> +unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
>>> +                          const struct nf_hook_state *state)
>>> +{
>>> +     struct ipvl_addr *addr;
>>> +     unsigned int len;
>>> +
>>> +     addr = ipvlan_skb_to_addr(skb, skb->dev);
>>> +     if (!addr)
>>> +             goto out;
>>> +
>>> +     skb->dev = addr->master->dev;
>>> +     len = skb->len + ETH_HLEN;
>>> +     ipvlan_count_rx(addr->master, len, true, false);
>>> +out:
>>> +     return NF_ACCEPT;
>>> +}
>>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
>>> index 18b4e8c7f68a..d02be277e1db 100644
>>> --- a/drivers/net/ipvlan/ipvlan_main.c
>>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>>> @@ -9,24 +9,65 @@
>>>
>>>  #include "ipvlan.h"
>>>
>>> +static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
>>> +     {
>>> +             .hook     = ipvlan_nf_input,
>>> +             .pf       = NFPROTO_IPV4,
>>> +             .hooknum  = NF_INET_LOCAL_IN,
>>> +             .priority = INT_MAX,
>>> +     },
>>> +     {
>>> +             .hook     = ipvlan_nf_input,
>>> +             .pf       = NFPROTO_IPV6,
>>> +             .hooknum  = NF_INET_LOCAL_IN,
>>> +             .priority = INT_MAX,
>>> +     },
>>> +};
>>> +
>>> +static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
>>> +     .l3mdev_l3_rcv = ipvlan_l3_rcv,
>>> +};
>>> +
>>>  static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
>>>  {
>>>       ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
>>>  }
>>>
>>> -static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>>> +static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>>>  {
>>>       struct ipvl_dev *ipvlan;
>>> +     int err = 0;
>>>
>>> +     ASSERT_RTNL();
>>>       if (port->mode != nval) {
>>> +             if (nval == IPVLAN_MODE_L3S) {
>>> +                     port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
>>> +                     port->dev->priv_flags |= IFF_L3MDEV_MASTER;
>>> +                     if (!port->ipt_hook_added) {
>>> +                             err = _nf_register_hooks(ipvl_nfops,
>>> +                                                     ARRAY_SIZE(ipvl_nfops));
>>
>> That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.
>>
> Do you mean per slave device?  It's registering it for a port (so only
> once) depending on the mode. If the mode != l3s it wont bother
> registering to keep current modes as they are.

My reading of the patch the register is called for each ipvlan newlink that uses l3s mode. Adding this debug patch 

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index d02be277e1db..3f733f1e18ae 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -46,6 +46,7 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
                        if (!port->ipt_hook_added) {
                                err = _nf_register_hooks(ipvl_nfops,
                                                        ARRAY_SIZE(ipvl_nfops));
+pr_warn("called _nf_register_hooks for dev %s: err %d\n", port->dev->name, err);
                                if (!err)
                                        port->ipt_hook_added = true;
                                else
@@ -54,9 +55,11 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
                } else {
                        port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
                        port->dev->l3mdev_ops = NULL;
-                       if (port->ipt_hook_added)
+                       if (port->ipt_hook_added) {
+pr_warn("calling _nf_unregister_hooks for dev %s\n", port->dev->name);
                                _nf_unregister_hooks(ipvl_nfops,
                                                     ARRAY_SIZE(ipvl_nfops));
+                       }
                        port->ipt_hook_added = false;
                }
                list_for_each_entry(ipvlan, &port->ipvlans, pnode) {



and building, installing and testing I see this:

$ ip link add ipvl1 link eth1 type ipvlan mode l3s
-->  called _nf_register_hooks for dev eth1: err 0

$ ip link add ipvl2 link eth1 type ipvlan mode l3s
--> no message generated

$ ip link add ipvl3 link eth2 type ipvlan mode l3s
--> called _nf_register_hooks for dev eth2: err 0


But there is more. If I delete all 3 ipvlan devices the nf_unregister is not called. Unload the ipvlan module and panic:

[  181.135061] BUG: unable to handle kernel paging request at ffffffffa002cfca
[  181.137710] IP: [<ffffffffa002cfca>] 0xffffffffa002cfca
[  181.139574] PGD 1a07067 PUD 1a08063 PMD 1387e4067 PTE 0
[  181.140964] Oops: 0010 [#1] SMP
[  181.141684] Modules linked in: 8021q garp mrp stp llc vrf [last unloaded: ipvlan]
[  181.143678] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-rc6+ #96
[  181.145092] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[  181.147340] task: ffff88013f196180 task.stack: ffff88013f19c000
[  181.148510] RIP: 0010:[<ffffffffa002cfca>]  [<ffffffffa002cfca>] 0xffffffffa002cfca
[  181.150044] RSP: 0018:ffff88013fc83bd0  EFLAGS: 00010a12
[  181.151102] RAX: ffff88013a781c88 RBX: ffff88013fc83c08 RCX: 0000000000000000
[  181.152460] RDX: ffff88013fc83c38 RSI: ffff88013ab15600 RDI: 0000000000000000
[  181.153781] RBP: ffff88013fc83bf8 R08: 0000000000004b61 R09: 0000000000000004
[  181.155107] R10: 0000000000000000 R11: ffffea00044d9c80 R12: ffffffff81a89510
[  181.156426] R13: ffff88013ab15600 R14: ffff88013fc83c38 R15: ffff88013ab15600
[  181.157742] FS:  0000000000000000(0000) GS:ffff88013fc80000(0000) knlGS:0000000000000000
[  181.159232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  181.160303] CR2: ffffffffa002cfca CR3: 000000013a5de000 CR4: 00000000000406e0
[  181.161588] Stack:
[  181.161954]  ffffffff813f8b47 ffff88013ab15600 ffff88013ab15600 ffff88013fc83c38
[  181.163353]  ffff88013874ac4e ffff88013fc83c28 ffffffff813f8b8c ffff88013a781c88
[  181.164722]  ffff88013ab15600 ffffffff81a88b00 ffff88013874ac4e ffff88013fc83c88
[  181.166094] Call Trace:
[  181.166532]  <IRQ>
[  181.166885]  [<ffffffff813f8b47>] ? nf_iterate+0x41/0x5b
[  181.167880]  [<ffffffff813f8b8c>] nf_hook_slow+0x2b/0x94
[  181.168803]  [<ffffffff81400c6e>] ip_local_deliver+0xa4/0xbf
[  181.169748]  [<ffffffff81400644>] ? xfrm4_policy_check.constprop.8+0x52/0x52
[  181.170910]  [<ffffffff81400a71>] ip_rcv_finish+0x2ed/0x34a
[  181.171841]  [<ffffffff81400f02>] ip_rcv+0x279/0x2fb
...


Also, another sequence:
$ ip link add ipvl1 link eth1 type ipvlan mode l3s
-->  called _nf_register_hooks for dev eth1: err 0

$ ip link add ipvl2 link eth1 type ipvlan mode l3s
--> no message generated

$ ip link set ipvl2 type ipvlan mode l3
--> calling _nf_unregister_hooks for dev eth1

that means the hooks are not there for ipvl1. I can remove the module with no panic.

^ permalink raw reply related

* Re: [RFC 02/11] Add RoCE driver framework
From: Mintz, Yuval @ 2016-09-14 18:25 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Yuval Mintz, Mark Bloch, Ram Amrani,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, David Miller,
	Ariel Elior, Michal Kalderon, Rajesh Borundia,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev
In-Reply-To: <20160914130032.GB26069-2ukJVAZIZ/Y@public.gmane.org>

> > > > >> >> +uint debug;
> > > > >> >> +module_param(debug, uint, 0);
> > > > > >>> +MODULE_PARM_DESC(debug, "Default debug msglevel");
> > > > >>
> > > > >> >Why are you adding this as a module parameter?
> > > > >>
> > > > >>  I believe this is mostly to follow same line as qede which also defines
> > > > > > 'debug' module parameter for allowing easy user control of debug
> > > > > > prints [& specifically for probe prints, which can't be controlled
> > > > > > otherwise].
> > > >
> > > > > Can you give us an example where dynamic debug and tracing infrastructures
> > > > > are not enough?
> > > >
> > > > > AFAIK, most of these debug module parameters are legacy copy/paste
> > > > > code which is useless in real life scenarios.
> > > >
> > > > Define 'enough'; Using dynamic debug you can provide all the necessary
> > > > information and at an even better granularity that's achieved by suggested
> >  > > infrastructure,  but is harder for an end-user to use. Same goes for tracing.
> > > >
> > > > The 'debug' option provides an easy grouping for prints related to a specific
> > > > area in the driver.
> > >
> > > It is hard to agree with you that user which knows how-to load modules
> > > with parameters won't success to enable debug prints.
> >
> > I think you're giving too much credit to the end-user. :-D
> >
> > > In addition, global increase in debug level for whole driver will create
> > > printk storm in dmesg and give nothing to debuggability.
> >
> > So basically, what you're claiming is that ethtool 'msglvl' setting for devices
> > is completely obselete. While this *might* be true, we use it extensively
> > in our qede and qed drivers; The debug module parameter merely provides
> > a manner of setting the debug value prior to initial probe for all interfaces.
> > qedr follows the same practice.

> Thanks for this excellent example. Ethtool 'msglvl' adds this
> dynamically, while your DEBUG argument works for loading module
> only.

> If you want dynamic prints, you have two options:
> 1. Add support of ethtool to whole RDMA stack.
> 2. Use dynamic tracing infrastructure.

> Which option do you prefer?
Option 3 - continuing this discussion. :-)

Perhaps I misread your intentions - I thought that by dynamic debug
you meant that all debug in RDMA should be pr_debug() based, and
therefore my objection regarding the ease with which users can
configure it. 
If all you meant was 'dynamically set' as opposed to 'statically set'
then I agree that having that sort of configurability is preferable
[Even though end-user would still probably prefer a module
parameter for reproductions; As the name implies, 'debug' isn't
meant to be used in other situations].

The other thing to consider are the probe-time prints.
Problem is, you wouldn't have a control node between probe 
and until after the probing would be over, so it would be a bit
hard to configure that.
You can always think of some generic method of fixing that as well
[sysfs node for the entire system for probe-time prints, perhaps?]

Do notice you would be harming user-experience of reproductions
though - as it would have to follow different mechanisms to open
debug prints of various qed* components.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks
From: Andy Lutomirski @ 2016-09-14 18:27 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel@vger.kernel.org, Alexei Starovoitov, Arnd Bergmann,
	Casey Schaufler, Daniel Borkmann, Daniel Mack, David Drysdale,
	David S . Miller, Elena Reshetova, Eric W . Biederman,
	James Morris, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Will Drewry,
	kernel-hardening@lists.openwall.com, Linux API <linu
In-Reply-To: <20160914072415.26021-19-mic@digikod.net>

On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün <mic@digikod.net> wrote:
> Add a new flag CGRP_NO_NEW_PRIVS for each cgroup. This flag is initially
> set for all cgroup except the root. The flag is clear when a new process
> without the no_new_privs flags is attached to the cgroup.
>
> If a cgroup is landlocked, then any new attempt, from an unprivileged
> process, to attach a process without no_new_privs to this cgroup will
> be denied.

Until and unless everyone can agree on a way to properly namespace,
delegate, etc cgroups, I think that trying to add unprivileged
semantics to cgroups is nuts.  Given the big thread about cgroup v2,
no-internal-tasks, etc, I just don't see how this approach can be
viable.

Can we try to make landlock work completely independently of cgroups
so that it doesn't get stuck and so that programs can use it without
worrying about cgroup v1 vs v2, interactions with cgroup managers,
cgroup managers that (supposedly?) will start migrating processes
around piecemeal and almost certainly blowing up landlock in the
process, etc?

I have no problem with looking at prototypes for how landlock +
cgroups would work, but I can't imagine the result being mergeable.

^ permalink raw reply

* Re: [RFC v3 19/22] landlock: Add interrupted origin
From: Andy Lutomirski @ 2016-09-14 18:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel@vger.kernel.org, Alexei Starovoitov, Arnd Bergmann,
	Casey Schaufler, Daniel Borkmann, Daniel Mack, David Drysdale,
	David S . Miller, Elena Reshetova, Eric W . Biederman,
	James Morris, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Will Drewry,
	kernel-hardening@lists.openwall.com, Linux API <linu
In-Reply-To: <20160914072415.26021-20-mic@digikod.net>

On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün <mic@digikod.net> wrote:
> This third origin of hook call should cover all possible trigger paths
> (e.g. page fault). Landlock eBPF programs can then take decisions
> accordingly.
>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Kees Cook <keescook@chromium.org>
> ---


>
> +       if (unlikely(in_interrupt())) {

IMO security hooks have no business being called from interrupts.
Aren't they all synchronous things done by tasks?  Interrupts are
driver things.

Are you trying to check for page faults and such?

--Andy

^ permalink raw reply

* Re: [RFC v3 11/22] seccomp,landlock: Handle Landlock hooks per process hierarchy
From: Andy Lutomirski @ 2016-09-14 18:43 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel@vger.kernel.org, Alexei Starovoitov, Arnd Bergmann,
	Casey Schaufler, Daniel Borkmann, Daniel Mack, David Drysdale,
	David S . Miller, Elena Reshetova, Eric W . Biederman,
	James Morris, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Will Drewry,
	kernel-hardening@lists.openwall.com, Linux API <linu
In-Reply-To: <20160914072415.26021-12-mic@digikod.net>

On Wed, Sep 14, 2016 at 12:24 AM, Mickaël Salaün <mic@digikod.net> wrote:
> A Landlock program will be triggered according to its subtype/origin
> bitfield. The LANDLOCK_FLAG_ORIGIN_SECCOMP value will trigger the
> Landlock program when a seccomp filter will return RET_LANDLOCK.
> Moreover, it is possible to return a 16-bit cookie which will be
> readable by the Landlock programs in its context.

Are you envisioning that the filters will return RET_LANDLOCK most of
the time or rarely?  If it's most of the time, then maybe this could
be simplified a bit by unconditionally calling the landlock filter and
letting the landlock filter access a struct seccomp_data if needed.


>
> Only seccomp filters loaded from the same thread and before a Landlock
> program can trigger it through LANDLOCK_FLAG_ORIGIN_SECCOMP. Multiple
> Landlock programs can be triggered by one or more seccomp filters. This
> way, each RET_LANDLOCK (with specific cookie) will trigger all the
> allowed Landlock programs once.

This interface seems somewhat awkward.  Should we not have a way to
atomicaly install a whole pile of landlock filters and associated
seccomp filter all at once?

--Andy

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: deprecate eth_change_mtu, remove usage
From: Jarod Wilson @ 2016-09-14 18:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: David S. Miller, netdev
In-Reply-To: <201609140519.GHFnOaaT%fengguang.wu@intel.com>

On Wed, Sep 14, 2016 at 05:59:14AM +0800, kbuild test robot wrote:
> Hi Jarod,
> 
> [auto build test ERROR on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Jarod-Wilson/net-centralize-net_device-min-max-MTU-checking/20160913-042130
> config: mips-xway_defconfig (attached as .config)
> compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=mips 
> 
> All error/warnings (new ones prefixed by >>):
> 
>    drivers/net/ethernet/lantiq_etop.c: In function 'ltq_etop_change_mtu':
> >> drivers/net/ethernet/lantiq_etop.c:524:7: error: 'ret' undeclared (first use in this function)
>      if (!ret) {
>           ^
>    drivers/net/ethernet/lantiq_etop.c:524:7: note: each undeclared identifier is reported only once for each function it appears in
> >> drivers/net/ethernet/lantiq_etop.c:534:1: warning: control reaches end of non-void function [-Wreturn-type]
>     }
>     ^

Crap. Did build-test, but only x86_64, so I missed this apparently
mips-specific breakage. Locally fixed up, but waiting to see if anyone has
any other feedback about these two patches before I go ahead with v2. Also
have a ~15-part drivers/net/ethernet/ cleanup series ready and waiting,
pending some traction with the core infra.

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply

* Re: [PATCHv2 next 3/3] ipvlan: Introduce l3s mode
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-09-14 18:47 UTC (permalink / raw)
  To: David Ahern; +Cc: Mahesh Bandewar, netdev, Eric Dumazet, David Miller
In-Reply-To: <14bd5ab7-9206-1399-01c3-4b4ed06ce6f9@cumulusnetworks.com>

On Wed, Sep 14, 2016 at 11:04 AM, David Ahern <dsa@cumulusnetworks.com> wrote:
[...]
>>>> +     ASSERT_RTNL();
>>>>       if (port->mode != nval) {
>>>> +             if (nval == IPVLAN_MODE_L3S) {
>>>> +                     port->dev->l3mdev_ops = &ipvl_l3mdev_ops;
>>>> +                     port->dev->priv_flags |= IFF_L3MDEV_MASTER;
>>>> +                     if (!port->ipt_hook_added) {
>>>> +                             err = _nf_register_hooks(ipvl_nfops,
>>>> +                                                     ARRAY_SIZE(ipvl_nfops));
>>>
>>> That's clever. The hooks are not device based so why do the register for each device? Alternatively, you could use a static dst like VRF does for Tx. In the ipvlan rcv function set the dst input handler to send the packet back to the ipvlan driver via dst->input. From there send the packet through the netfilter hooks and then do the real lookup, update the dst and call its input function. I have working code for VRF driver somewhere that shows how to do this.
>>>
>> Do you mean per slave device?  It's registering it for a port (so only
>> once) depending on the mode. If the mode != l3s it wont bother
>> registering to keep current modes as they are.
>
> My reading of the patch the register is called for each ipvlan newlink that uses l3s mode. Adding this debug patch
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index d02be277e1db..3f733f1e18ae 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -46,6 +46,7 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>                         if (!port->ipt_hook_added) {
>                                 err = _nf_register_hooks(ipvl_nfops,
>                                                         ARRAY_SIZE(ipvl_nfops));
> +pr_warn("called _nf_register_hooks for dev %s: err %d\n", port->dev->name, err);
>                                 if (!err)
>                                         port->ipt_hook_added = true;
>                                 else
> @@ -54,9 +55,11 @@ static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
>                 } else {
>                         port->dev->priv_flags &= ~IFF_L3MDEV_MASTER;
>                         port->dev->l3mdev_ops = NULL;
> -                       if (port->ipt_hook_added)
> +                       if (port->ipt_hook_added) {
> +pr_warn("calling _nf_unregister_hooks for dev %s\n", port->dev->name);
>                                 _nf_unregister_hooks(ipvl_nfops,
>                                                      ARRAY_SIZE(ipvl_nfops));
> +                       }
>                         port->ipt_hook_added = false;
>                 }
>                 list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
>
>
>
> and building, installing and testing I see this:
>
> $ ip link add ipvl1 link eth1 type ipvlan mode l3s
> -->  called _nf_register_hooks for dev eth1: err 0
>
> $ ip link add ipvl2 link eth1 type ipvlan mode l3s
> --> no message generated
>
> $ ip link add ipvl3 link eth2 type ipvlan mode l3s
> --> called _nf_register_hooks for dev eth2: err 0
>
The first newlink (l3s) will register the hook and subsequent newlink
calls will skip registration. This is why you did not see the message
for the ipvl2 link creation. However lpvl3 is a added on top of eth2
while the first two link were added on eth1. So this is new port
creation and hence the driver tried to register the hook again.

>
> But there is more. If I delete all 3 ipvlan devices the nf_unregister is not called. Unload the ipvlan module and panic:
>
> [  181.135061] BUG: unable to handle kernel paging request at ffffffffa002cfca
> [  181.137710] IP: [<ffffffffa002cfca>] 0xffffffffa002cfca
> [  181.139574] PGD 1a07067 PUD 1a08063 PMD 1387e4067 PTE 0
> [  181.140964] Oops: 0010 [#1] SMP
> [  181.141684] Modules linked in: 8021q garp mrp stp llc vrf [last unloaded: ipvlan]
> [  181.143678] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-rc6+ #96
> [  181.145092] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [  181.147340] task: ffff88013f196180 task.stack: ffff88013f19c000
> [  181.148510] RIP: 0010:[<ffffffffa002cfca>]  [<ffffffffa002cfca>] 0xffffffffa002cfca
> [  181.150044] RSP: 0018:ffff88013fc83bd0  EFLAGS: 00010a12
> [  181.151102] RAX: ffff88013a781c88 RBX: ffff88013fc83c08 RCX: 0000000000000000
> [  181.152460] RDX: ffff88013fc83c38 RSI: ffff88013ab15600 RDI: 0000000000000000
> [  181.153781] RBP: ffff88013fc83bf8 R08: 0000000000004b61 R09: 0000000000000004
> [  181.155107] R10: 0000000000000000 R11: ffffea00044d9c80 R12: ffffffff81a89510
> [  181.156426] R13: ffff88013ab15600 R14: ffff88013fc83c38 R15: ffff88013ab15600
> [  181.157742] FS:  0000000000000000(0000) GS:ffff88013fc80000(0000) knlGS:0000000000000000
> [  181.159232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  181.160303] CR2: ffffffffa002cfca CR3: 000000013a5de000 CR4: 00000000000406e0
> [  181.161588] Stack:
> [  181.161954]  ffffffff813f8b47 ffff88013ab15600 ffff88013ab15600 ffff88013fc83c38
> [  181.163353]  ffff88013874ac4e ffff88013fc83c28 ffffffff813f8b8c ffff88013a781c88
> [  181.164722]  ffff88013ab15600 ffffffff81a88b00 ffff88013874ac4e ffff88013fc83c88
> [  181.166094] Call Trace:
> [  181.166532]  <IRQ>
> [  181.166885]  [<ffffffff813f8b47>] ? nf_iterate+0x41/0x5b
> [  181.167880]  [<ffffffff813f8b8c>] nf_hook_slow+0x2b/0x94
> [  181.168803]  [<ffffffff81400c6e>] ip_local_deliver+0xa4/0xbf
> [  181.169748]  [<ffffffff81400644>] ? xfrm4_policy_check.constprop.8+0x52/0x52
> [  181.170910]  [<ffffffff81400a71>] ip_rcv_finish+0x2ed/0x34a
> [  181.171841]  [<ffffffff81400f02>] ip_rcv+0x279/0x2fb
> ...
>
Probably creating two ports (on two physical interfaces) is not a
common / use case but that doesn't mean kernel should crash either!
>
> Also, another sequence:
> $ ip link add ipvl1 link eth1 type ipvlan mode l3s
> -->  called _nf_register_hooks for dev eth1: err 0
>
> $ ip link add ipvl2 link eth1 type ipvlan mode l3s
> --> no message generated
>
> $ ip link set ipvl2 type ipvlan mode l3
> --> calling _nf_unregister_hooks for dev eth1
>
> that means the hooks are not there for ipvl1. I can remove the module with no panic.
>
tl;dr you found an issue for sure! The hook registration has to be
global than per port and I'll fix that in the next rev. Thank you for
taking to time to test this.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox