* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Roland Dreier @ 2010-05-12 19:56 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172344.GC12286@mtls03>
> @@ -1017,9 +1020,12 @@ static void ib_sa_add_one(struct ib_device *device)
> sa_dev->end_port = e;
>
> for (i = 0; i <= e - s; ++i) {
> + spin_lock_init(&sa_dev->port[i].ah_lock);
> + if (rdma_port_link_layer(device, i + 1) != IB_LINK_LAYER_INFINIBAND)
> + continue;
Not sure I understand why you move the initialization of the spinlock up
here? It seems we ignore everything that might have to do with spinlock
if this is an IBoE port.
But the larger issue is what if someone calls one of the ib_sa_XXX_query
functions on an IBoE port? Seems we just crash on uninitialized
structures. I guess we're assuming that the kernel is smart enough not
to do that?
Also I'm wondering why you did the "count" stuff to ignore IBoE-only
devices in multicast.c but not sa_query.c?
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
From: Or Gerlitz @ 2010-05-12 20:00 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma, Mike Christie
In-Reply-To: <adaocglulmi.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
Roland Dreier <rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> wrote:
> I have these 3 + Dan Carpenter's fix applied now.
cool
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 04/11] ib_core: IBoE CMA device binding
From: Roland Dreier @ 2010-05-12 20:03 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172403.GE12286@mtls03>
> int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
> - struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr)
> + struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr,
> + int force_grh)
Rather than this change in API, could we just have this function look at
the link layer, and if it's ethernet, then always set the GRH flag?
Seems simpler than requiring the upper layers to do this and then pass
the result in?
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 04/11] ib_core: IBoE CMA device binding
From: Roland Dreier @ 2010-05-12 20:05 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172403.GE12286@mtls03>
> +static void iboe_mcast_work_handler(struct work_struct *work)
> +{
> + struct iboe_mcast_work *mw = container_of(work, struct iboe_mcast_work, work);
> + struct cma_multicast *mc = mw->mc;
> + struct ib_sa_multicast *m = mc->multicast.ib;
> +
> + mc->multicast.ib->context = mc;
> + cma_ib_mc_handler(0, m);
> + kref_put(&mc->mcref, release_mc);
> + kfree(mw);
> +}
I'm having a hard time working out why the iboe case needs to schedule
to a work queue here since its already in process context, right? It
seems it would be really preferable to avoid all the extra pointer
munging and reference counting, and just call things directly.
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 05/11] ib_core: IBoE UD packet packing support
From: Roland Dreier @ 2010-05-12 20:06 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172411.GF12286@mtls03>
> 2. Fix wrong implementation of ib_ud_header_init(). A different patch was sent
> to Roland.
This patch no longer applies, I guess because you already sent me this
fix (which is now upstream since 2.6.34-rc1).
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 04/11] ib_core: IBoE CMA device binding
From: Roland Dreier @ 2010-05-12 20:14 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172403.GE12286@mtls03>
> Multicast GIDs are always mapped to multicast MACs
> as is done in IPv6. Some helper functions are added to ib_addr.h. IPv4
> multicast is enabled by translating IPv4 multicast addresses to IPv6 multicast
> as described in
> http://www.mail-archive.com/ipng-Va8kgSFw1KyzQ7sTrsxvMkEOCMrvLtNR@public.gmane.org/msg02134.html.
I guess it's a bit unfortunate that the RoCE annex completely ignored
how to map multicast GIDs to ethernet addresses (I suppose as part of
the larger decision to ignore address resolution entirely). Anyway,
looking at the email message you reference, it seems to be someone
asking what the right way to map IPv4 multicast addresses to IPv6
addresses is. Is there a more definitive document you can point to?
It seems that unfortunately the way the layering of addresses is done,
there's no way to just use the standard mapping of IPv4 multicast
addresses to Ethernet addresses (since IBoE is does addressing via
the CMA mapping to GIDs followed by an unspecified mapping from GIDs to
Ethernet addresses).
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
From: Roland Dreier @ 2010-05-12 20:28 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172425.GH12286@mtls03>
> Add ib_uverbs_get_eth_l2_addr() to allow ibv_create_ah() to resolve <sgid,
> dgid> to <vlan, dmac> for any gid type. Although user-space might bypass this
> call for link-local gids, it is better not to replicate the kernel resolution
> policy. Port link layer is also returned by ibv_query_port().
A high-level comment/question, followed by some notes about the specific patch.
At the highest level, is having this very low-level command exposed as
part of the kernel uverbs <-> userspace API the right place to split
things? Making the Ethernet address resolution part of the low-level
driver implies that it's not really a generic part of the verbs interface.
Maybe it is generic, and we should have a generic function instead of
calling into the low-level driver. I see the argument that we should
keep the policy in the kernel, although I'm not sure how strong that
argument is -- once we start shipping a kernel with a certain policy
(and I guess OFED has in effect already done that), how could we ever
change that policy? We'll have interoperability issues anyway, so it
seems having userspace and kernel use different policies doesn't cause
much further problems anyway.
Or maybe it is device-specific, and we could wrap it up into the create
AH uverbs call we already have?
Low-level comments:
> +ssize_t ib_uverbs_get_eth_l2_addr(struct ib_uverbs_file *file, const char __user *buf,
> + int in_len, int out_len)
> +{
> + struct ib_uverbs_get_eth_l2_addr cmd;
> + struct ib_uverbs_get_eth_l2_addr_resp resp;
> + int ret;
> + struct ib_pd *pd;
> +
> + if (out_len < sizeof resp)
> + return -ENOSPC;
> +
> + if (copy_from_user(&cmd, buf, sizeof cmd))
> + return -EFAULT;
> +
> + pd = idr_read_pd(cmd.pd_handle, file->ucontext);
> + if (!pd)
> + return -EINVAL;
> +
> + ret = ib_get_eth_l2_addr(pd->device, cmd.port, (union ib_gid *)cmd.gid,
> + cmd.sgid_idx, resp.mac, &resp.vlan_id, &resp.tagged);
> + put_pd_read(pd);
> + if (!ret) {
> + if (copy_to_user((void __user *) (unsigned long) cmd.response,
> + &resp, sizeof resp))
> + return -EFAULT;
This leaks kernel memory contents to userspace since the stack variable
resp is never cleared. Also will cause problems if we ever need to use
the reserved fields for anything.
Also I'm not sure I understand why you pass the PD into this method? It
seems you just use it to get a pointer to the device, but you already
have that from the uverbs_file structure that's passed into all commands.
> +int ib_get_eth_l2_addr(struct ib_device *device, u8 port, union ib_gid *gid,
> + int sgid_idx, u8 *mac, __u16 *vlan_id, u8 *tagged)
> +{
> + if (!device->get_eth_l2_addr)
> + return -ENOSYS;
> +
> + return device->get_eth_l2_addr(device, port, gid, sgid_idx, mac, vlan_id, tagged);
> +}
> +EXPORT_SYMBOL(ib_get_eth_l2_addr);
I don't think we need this wrapper, since uverbs can just call the
get_eth_l2_addr method directly (we already do that for quite a few
other methods, eg alloc_ucontext is a uverbs-only method that has no
in-kernel wrapper). Also the -ENOSYS test isn't necessary, since a
device-driver shouldn't allow this method unless it actually implements
it.
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 08/11] mlx4: Allow interfaces to correspond to each other
From: Roland Dreier @ 2010-05-12 20:30 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172430.GI12286@mtls03>
> +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
> +{
> + return mlx4_find_get_prot_dev(dev, proto, port);
> +}
> +EXPORT_SYMBOL(mlx4_get_prot_dev);
Not sure I understand why you have a wrapper to call another function
with exactly the same parameters? Can't we get rid of this and just
rename mlx4_find_get_prot_dev() to mlx4_get_prot_dev()?
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 09/11] mlx4: Add support for IBoE - address resolution
From: Roland Dreier @ 2010-05-12 20:32 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172436.GJ12286@mtls03>
> + u8 mac_0_1[2];
> + u8 mac_2_5[4];
This looks a bit odd. Any reason why you don't just have "u8 mac[6];"
in this structure?
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Sean Hefty @ 2010-05-12 21:52 UTC (permalink / raw)
To: 'Steve Wise'; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BEB027B.5060500-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
>What does AF_UNSPEC imply about the format of the sockaddr?
That it depends on the usage..? I came up with this based on using ioctl to
manipulate the ARP cache:
SIOCSARP - Add a new entry to the ARP cache or modify an existing entry ...
arp_ha is a generic socket address structure with sa_family set to AF_UNSPEC and
sa_data containing the hardware address (e.g. the 6-byte Ethernet address).
The librdmacm should have enough context to interpret the meaning of the
query_gid response.
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] RDMA/ucma: Copy iWARP route information.
From: Steve Wise @ 2010-05-12 22:03 UTC (permalink / raw)
To: Sean Hefty; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <5A102C804D4348A5BCE504F433888BED-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
Sean Hefty wrote:
>> What does AF_UNSPEC imply about the format of the sockaddr?
>>
>
> That it depends on the usage..? I came up with this based on using ioctl to
> manipulate the ARP cache:
>
> SIOCSARP - Add a new entry to the ARP cache or modify an existing entry ...
> arp_ha is a generic socket address structure with sa_family set to AF_UNSPEC and
> sa_data containing the hardware address (e.g. the 6-byte Ethernet address).
>
> The librdmacm should have enough context to interpret the meaning of the
> query_gid response.
>
> - Sean
>
sounds ok to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH v2] libibverbs: add path record definitions to sa.h
From: Sean Hefty @ 2010-05-12 23:07 UTC (permalink / raw)
To: Hefty, Sean, 'Roland Dreier'; +Cc: linux-rdma
In-Reply-To: <63BD07796ED544AEAC6E41E669DC6EAC-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
Roland,
I'd like to release a new version of librdmacm that can support the user space
SA query feature in 2.6.33, which will also be part of OFED 1.5.2. Currently,
there's a dependency on the path record definition being part of libibverbs. Do
you have any opinions on the best way to handle this? I guess, ideally, I'd
like to see a released version of libibverbs include this, but I can think of
ways around this.
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v2] libibverbs: add path record definitions to sa.h
From: Roland Dreier @ 2010-05-12 23:17 UTC (permalink / raw)
To: Sean Hefty; +Cc: linux-rdma
In-Reply-To: <870D78B1ADDD407388ADBFF921BCFAC6-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
> I'd like to release a new version of librdmacm that can support the user space
> SA query feature in 2.6.33, which will also be part of OFED 1.5.2. Currently,
> there's a dependency on the path record definition being part of libibverbs. Do
> you have any opinions on the best way to handle this? I guess, ideally, I'd
> like to see a released version of libibverbs include this, but I can think of
> ways around this.
I can release an updated version of libibverbs (I have enough stuff
pending that this is probably a good idea anyway). However could you do
some autoconf stuff so librdmacm works against older libibverbs (but
doesn't enable the stuff that can't be done without the missing stuff)?
Or maybe it's not worth it.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH v2] libibverbs: add path record definitions to sa.h
From: Sean Hefty @ 2010-05-12 23:21 UTC (permalink / raw)
To: 'Roland Dreier'; +Cc: linux-rdma
In-Reply-To: <ada39xwu2v9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
>I can release an updated version of libibverbs (I have enough stuff
>pending that this is probably a good idea anyway). However could you do
>some autoconf stuff so librdmacm works against older libibverbs (but
>doesn't enable the stuff that can't be done without the missing stuff)?
>Or maybe it's not worth it.
I'll see if I can figure out how to update the autoconf stuff correctly.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Eli Cohen @ 2010-05-13 6:59 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <ada632svqph.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, May 12, 2010 at 12:56:58PM -0700, Roland Dreier wrote:
> > @@ -1017,9 +1020,12 @@ static void ib_sa_add_one(struct ib_device *device)
> > sa_dev->end_port = e;
> >
> > for (i = 0; i <= e - s; ++i) {
> > + spin_lock_init(&sa_dev->port[i].ah_lock);
> > + if (rdma_port_link_layer(device, i + 1) != IB_LINK_LAYER_INFINIBAND)
> > + continue;
>
> Not sure I understand why you move the initialization of the spinlock up
> here? It seems we ignore everything that might have to do with spinlock
> if this is an IBoE port.
We need the spinlock initialized for get_src_path_mask() which is
called by ib_init_ah_from_path() which in turn is called for IBoE
ports as well.
>
> But the larger issue is what if someone calls one of the ib_sa_XXX_query
> functions on an IBoE port? Seems we just crash on uninitialized
> structures. I guess we're assuming that the kernel is smart enough not
> to do that?
Yes, we're not calling the SA for IBoE.
>
> Also I'm wondering why you did the "count" stuff to ignore IBoE-only
> devices in multicast.c but not sa_query.c?
>
It seems to me the right place to put this logic as the mutlicast code
registers as an IB client and the add_one implemntation is
multicast.c.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 04/11] ib_core: IBoE CMA device binding
From: Eli Cohen @ 2010-05-13 7:24 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <ada1vdgvqe9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, May 12, 2010 at 01:03:42PM -0700, Roland Dreier wrote:
> > int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
> > - struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr)
> > + struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr,
> > + int force_grh)
>
> Rather than this change in API, could we just have this function look at
> the link layer, and if it's ethernet, then always set the GRH flag?
> Seems simpler than requiring the upper layers to do this and then pass
> the result in?
I guess that would keep the function more versatile but changing the
implenetation to check the port's link layer seems OK to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 04/11] ib_core: IBoE CMA device binding
From: Eli Cohen @ 2010-05-13 8:26 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adawrv8ubrh.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, May 12, 2010 at 01:05:06PM -0700, Roland Dreier wrote:
> > +static void iboe_mcast_work_handler(struct work_struct *work)
> > +{
> > + struct iboe_mcast_work *mw = container_of(work, struct iboe_mcast_work, work);
> > + struct cma_multicast *mc = mw->mc;
> > + struct ib_sa_multicast *m = mc->multicast.ib;
> > +
> > + mc->multicast.ib->context = mc;
> > + cma_ib_mc_handler(0, m);
> > + kref_put(&mc->mcref, release_mc);
> > + kfree(mw);
> > +}
>
> I'm having a hard time working out why the iboe case needs to schedule
> to a work queue here since its already in process context, right? It
> seems it would be really preferable to avoid all the extra pointer
> munging and reference counting, and just call things directly.
>
I assume that the caller might attempt to acquire the same lock when
calling join and in the callback. Specifically, ucma_join_multicast()
calls rdma_join_multicast() with file->mut acquired and
ucma_event_handler() does the same.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 05/11] ib_core: IBoE UD packet packing support
From: Eli Cohen @ 2010-05-13 8:27 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adask5wuboz.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, May 12, 2010 at 01:06:36PM -0700, Roland Dreier wrote:
> > 2. Fix wrong implementation of ib_ud_header_init(). A different patch was sent
> > to Roland.
>
> This patch no longer applies, I guess because you already sent me this
> fix (which is now upstream since 2.6.34-rc1).
Right, it's already applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 0/3] Least attached vector support
From: Yevgeny Petrilin @ 2010-05-13 9:25 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hello Roland,
Those patches where submitted a while ago, I cleaned them up a little and generated against your latest git.
They allow to hw driver to choose to which EQ a CQ would be attached, considering the load on its eqs.
Thanks,
Yevgeny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 1/3] ib_core : Default value for automatic completion vector selection
From: Yevgeny Petrilin @ 2010-05-13 9:25 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
By using IB_CQ_VECTOR_LEAST_ATTACHED, we are letting the hw driver to choose
the completion vector that has the least number of CQ's already attached to it.
Signed-off-by: Yevgeny Petrilin <yevgenyp-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
---
include/rdma/ib_verbs.h | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index a585e0f..79b4d8f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1453,6 +1453,13 @@ static inline int ib_post_recv(struct ib_qp *qp,
return qp->device->post_recv(qp, recv_wr, bad_recv_wr);
}
+/*
+ * IB_CQ_VECTOR_LEAST_ATTACHED: The constant specifies that
+ * the CQ will be attached to the completion vector that has
+ * the least number of CQs already attached to it.
+ */
+#define IB_CQ_VECTOR_LEAST_ATTACHED 0xffffffff
+
/**
* ib_create_cq - Creates a CQ on the specified device.
* @device: The device on which to create the CQ.
@@ -1464,7 +1471,8 @@ static inline int ib_post_recv(struct ib_qp *qp,
* the associated completion and event handlers.
* @cqe: The minimum size of the CQ.
* @comp_vector - Completion vector used to signal completion events.
- * Must be >= 0 and < context->num_comp_vectors.
+ * Must be >= 0 and < context->num_comp_vectors
+ * or IB_CQ_VECTOR_LEAST_ATTACHED.
*
* Users can examine the cq structure to determine the actual CQ size.
*/
--
1.6.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 2/3] mlx4: Default value for automatic completion vector selection
From: Yevgeny Petrilin @ 2010-05-13 9:25 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
When the vector number passed to mlx4_cq_alloc is MLX4_LEAST_ATTACHED_VECTOR (0xffffffff),
the driver selects the completion vector that has the least CQ's attached
to it and attaches the CQ to the chosen vector.
The mlx4_ib module receives a cq allocation request with IB_CQ_VECTOR_LEAST_ATTACHED as cq number,
it passes it to mlx4_core as MLX4_LEAST_ATTACHED_VECTOR.
Signed-off-by: Yevgeny Petrilin <yevgenyp-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
---
drivers/infiniband/hw/mlx4/cq.c | 4 +++-
drivers/net/mlx4/cq.c | 27 +++++++++++++++++++++++----
drivers/net/mlx4/mlx4.h | 1 +
include/linux/mlx4/device.h | 2 ++
4 files changed, 29 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index cc2ddd2..ac6b866 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -223,7 +223,9 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
}
err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar,
- cq->db.dma, &cq->mcq, vector, 0);
+ cq->db.dma, &cq->mcq,
+ vector == IB_CQ_VECTOR_LEAST_ATTACHED ?
+ MLX4_LEAST_ATTACHED_VECTOR : vector, 0);
if (err)
goto err_dbmap;
diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c
index 7cd34e9..a6f03f9 100644
--- a/drivers/net/mlx4/cq.c
+++ b/drivers/net/mlx4/cq.c
@@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq,
}
EXPORT_SYMBOL_GPL(mlx4_cq_resize);
+static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv)
+{
+ int i;
+ int index = 0;
+ int min = priv->eq_table.eq[0].load;
+
+ for (i = 1; i < priv->dev.caps.num_comp_vectors; i++) {
+ if (priv->eq_table.eq[i].load < min) {
+ index = i;
+ min = priv->eq_table.eq[i].load;
+ }
+ }
+
+ return index;
+}
+
int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq,
unsigned vector, int collapsed)
@@ -198,10 +214,11 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
u64 mtt_addr;
int err;
- if (vector >= dev->caps.num_comp_vectors)
- return -EINVAL;
+ cq->vector = (vector == MLX4_LEAST_ATTACHED_VECTOR) ?
+ mlx4_find_least_loaded_vector(priv) : vector;
- cq->vector = vector;
+ if (cq->vector >= dev->caps.num_comp_vectors)
+ return -EINVAL;
cq->cqn = mlx4_bitmap_alloc(&cq_table->bitmap);
if (cq->cqn == -1)
@@ -232,7 +249,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
cq_context->flags = cpu_to_be32(!!collapsed << 18);
cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index);
- cq_context->comp_eqn = priv->eq_table.eq[vector].eqn;
+ cq_context->comp_eqn = priv->eq_table.eq[cq->vector].eqn;
cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT;
mtt_addr = mlx4_mtt_addr(dev, mtt);
@@ -245,6 +262,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt,
if (err)
goto err_radix;
+ priv->eq_table.eq[cq->vector].load++;
cq->cons_index = 0;
cq->arm_sn = 1;
cq->uar = uar;
@@ -282,6 +300,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn);
synchronize_irq(priv->eq_table.eq[cq->vector].irq);
+ priv->eq_table.eq[cq->vector].load--;
spin_lock_irq(&cq_table->lock);
radix_tree_delete(&cq_table->tree, cq->cqn);
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index bc72d6e..969a6a7 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -137,6 +137,7 @@ struct mlx4_eq {
u16 irq;
u16 have_irq;
int nent;
+ int load;
struct mlx4_buf_list *page_list;
struct mlx4_mtt mtt;
};
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index e92d1bf..89ae31d 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -171,6 +171,8 @@ enum {
MLX4_NUM_FEXCH = 64 * 1024,
};
+#define MLX4_LEAST_ATTACHED_VECTOR 0xffffffff
+
static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor)
{
return (major << 32) | (minor << 16) | subminor;
--
1.6.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 3/3] mlx4_en: use MLX4_LEAST_ATTACHED_VECTOR for TX cqs
From: Yevgeny Petrilin @ 2010-05-13 9:25 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Instead always setting all the TX cqs to use comletion vector 0,
let the mlx4_core to distribute them to avoid situation where some eqs
are very loaded while other do nothing.
Signed-off-by: Yevgeny Petrilin <yevgenyp-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
---
drivers/net/mlx4/en_cq.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/mlx4/en_cq.c b/drivers/net/mlx4/en_cq.c
index 21786ad..f3dc8b7 100644
--- a/drivers/net/mlx4/en_cq.c
+++ b/drivers/net/mlx4/en_cq.c
@@ -56,7 +56,7 @@ int mlx4_en_create_cq(struct mlx4_en_priv *priv,
cq->vector = ring % mdev->dev->caps.num_comp_vectors;
} else {
cq->buf_size = sizeof(struct mlx4_cqe);
- cq->vector = 0;
+ cq->vector = MLX4_LEAST_ATTACHED_VECTOR;
}
cq->ring = ring;
--
1.6.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 0/9] mm: generic adaptive large memory allocation APIs
From: Changli Gao @ 2010-05-13 9:49 UTC (permalink / raw)
To: akpm
Cc: Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier, Sean Hefty,
Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
linux-ext4, linux-fsdevel, linux-mm, containers, Changli Gao
generic adaptive large memory allocation APIs
kv*alloc are used to allocate large contiguous memory and the users don't mind
whether the memory is physically or virtually contiguous. The allocator always
try its best to allocate physically contiguous memory first.
In this patch set, some APIs are introduced: kvmalloc(), kvzalloc(), kvcalloc(),
kvrealloc(), kvfree() and kvfree_inatomic().
Some code are converted to use the new generic APIs instead.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
drivers/infiniband/hw/ehca/ipz_pt_fn.c | 22 +-----
drivers/net/cxgb3/cxgb3_defs.h | 2
drivers/net/cxgb3/cxgb3_offload.c | 31 ---------
drivers/net/cxgb3/l2t.c | 4 -
drivers/net/cxgb4/cxgb4.h | 3
drivers/net/cxgb4/cxgb4_main.c | 37 +----------
drivers/net/cxgb4/l2t.c | 2
drivers/scsi/cxgb3i/cxgb3i_ddp.c | 12 +--
drivers/scsi/cxgb3i/cxgb3i_ddp.h | 26 -------
drivers/scsi/cxgb3i/cxgb3i_offload.c | 6 -
fs/ext4/super.c | 21 +-----
fs/file.c | 109 ++++-----------------------------
include/linux/mm.h | 31 +++++++++
include/linux/vmalloc.h | 1
kernel/cgroup.c | 47 +-------------
kernel/relay.c | 35 ----------
mm/nommu.c | 6 +
mm/util.c | 104 +++++++++++++++++++++++++++++++
mm/vmalloc.c | 14 ++++
19 files changed, 207 insertions(+), 306 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Changli Gao @ 2010-05-13 9:51 UTC (permalink / raw)
To: akpm
Cc: Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier, Sean Hefty,
Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
Tetsuo Handa, Peter Zijlstra
generic adaptive large memory allocation APIs
kv*alloc are used to allocate large contiguous memory and the users don't mind
whether the memory is physically or virtually contiguous. The allocator always
try its best to allocate physically contiguous memory first.
In this patch set, some APIs are introduced: kvmalloc(), kvzalloc(), kvcalloc(),
kvrealloc(), kvfree() and kvfree_inatomic().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/linux/mm.h | 31 ++++++++++++++
include/linux/vmalloc.h | 1
mm/nommu.c | 6 ++
mm/util.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/vmalloc.c | 14 ++++++
5 files changed, 156 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 462acaf..0ece978 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1467,5 +1467,36 @@ extern int soft_offline_page(struct page *page, int flags);
extern void dump_page(struct page *page);
+void *__kvmalloc(size_t size, gfp_t flags);
+
+static inline void *kvmalloc(size_t size)
+{
+ return __kvmalloc(size, 0);
+}
+
+static inline void *kvzalloc(size_t size)
+{
+ return __kvmalloc(size, __GFP_ZERO);
+}
+
+static inline void *kvcalloc(size_t n, size_t size)
+{
+ return __kvmalloc(n * size, __GFP_ZERO);
+}
+
+void __kvfree(void *ptr, bool inatomic);
+
+static inline void kvfree(void *ptr)
+{
+ __kvfree(ptr, false);
+}
+
+static inline void kvfree_inatomic(void *ptr)
+{
+ __kvfree(ptr, true);
+}
+
+void *kvrealloc(void *ptr, size_t newsize);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 227c2a5..33ec828 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -60,6 +60,7 @@ extern void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot);
extern void *__vmalloc_area(struct vm_struct *area, gfp_t gfp_mask,
pgprot_t prot);
extern void vfree(const void *addr);
+extern unsigned long vsize(const void *addr);
extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
diff --git a/mm/nommu.c b/mm/nommu.c
index 63fa17d..1ddf3fe 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -223,6 +223,12 @@ void vfree(const void *addr)
}
EXPORT_SYMBOL(vfree);
+unsigned long vsize(const void *addr)
+{
+ return ksize(addr);
+}
+EXPORT_SYMBOL(vsize);
+
void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)
{
/*
diff --git a/mm/util.c b/mm/util.c
index f5712e8..7cc364a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -5,6 +5,7 @@
#include <linux/err.h>
#include <linux/sched.h>
#include <asm/uaccess.h>
+#include <linux/vmalloc.h>
#define CREATE_TRACE_POINTS
#include <trace/events/kmem.h>
@@ -289,6 +290,109 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
}
EXPORT_SYMBOL_GPL(get_user_pages_fast);
+void *__kvmalloc(size_t size, gfp_t flags)
+{
+ void *ptr;
+
+ if (size < PAGE_SIZE)
+ return kmalloc(size, GFP_KERNEL | flags);
+ size = PAGE_ALIGN(size);
+ if (is_power_of_2(size))
+ ptr = (void *)__get_free_pages(GFP_KERNEL | flags |
+ __GFP_NOWARN, get_order(size));
+ else
+ ptr = alloc_pages_exact(size, GFP_KERNEL | flags |
+ __GFP_NOWARN);
+ if (ptr != NULL) {
+ virt_to_head_page(ptr)->private = size;
+ return ptr;
+ }
+
+ ptr = vmalloc(size);
+ if (ptr != NULL && (flags & __GFP_ZERO))
+ memset(ptr, 0, size);
+
+ return ptr;
+}
+EXPORT_SYMBOL(__kvmalloc);
+
+static void kvfree_work(struct work_struct *work)
+{
+ vfree(work);
+}
+
+void __kvfree(void *ptr, bool inatomic)
+{
+ if (unlikely(ZERO_OR_NULL_PTR(ptr)))
+ return;
+ if (is_vmalloc_addr(ptr)) {
+ if (inatomic) {
+ struct work_struct *work;
+
+ work = ptr;
+ BUILD_BUG_ON(sizeof(struct work_struct) > PAGE_SIZE);
+ INIT_WORK(work, kvfree_work);
+ schedule_work(work);
+ } else {
+ vfree(ptr);
+ }
+ } else {
+ struct page *page;
+
+ page = virt_to_head_page(ptr);
+ if (PageSlab(page) || PageCompound(page))
+ kfree(ptr);
+ else if (is_power_of_2(page->private))
+ free_pages((unsigned long)ptr,
+ get_order(page->private));
+ else
+ free_pages_exact(ptr, page->private);
+ }
+}
+EXPORT_SYMBOL(__kvfree);
+
+void *kvrealloc(void *ptr, size_t newsize)
+{
+ void *nptr;
+ size_t oldsize;
+
+ if (unlikely(!newsize)) {
+ kvfree(ptr);
+ return ZERO_SIZE_PTR;
+ }
+
+ if (unlikely(ZERO_OR_NULL_PTR(ptr)))
+ return kvmalloc(newsize);
+
+ if (is_vmalloc_addr(ptr)) {
+ oldsize = vsize(ptr);
+ if (newsize <= oldsize)
+ return ptr;
+ } else {
+ struct page *page;
+
+ page = virt_to_head_page(ptr);
+ if (PageSlab(page) || PageCompound(page)) {
+ if (newsize < PAGE_SIZE)
+ return krealloc(ptr, newsize, GFP_KERNEL);
+ oldsize = ksize(ptr);
+ } else {
+ oldsize = page->private;
+ if (newsize <= oldsize)
+ return ptr;
+ }
+ }
+
+ nptr = kvmalloc(newsize);
+ if (nptr != NULL) {
+ memcpy(nptr, ptr, oldsize);
+ kvfree(ptr);
+ }
+
+ return nptr;
+}
+EXPORT_SYMBOL(kvrealloc);
+
/* Tracepoints definitions. */
EXPORT_TRACEPOINT_SYMBOL(kmalloc);
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ae00746..93552a8 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1413,6 +1413,20 @@ void vfree(const void *addr)
EXPORT_SYMBOL(vfree);
/**
+ * vsize - get the actual amount of memory allocated by vmalloc()
+ * @addr: memory base address
+ */
+unsigned long vsize(const void *addr)
+{
+ struct vmap_area *va;
+
+ va = find_vmap_area((unsigned long)addr);
+
+ return va->va_end - va->va_start - PAGE_SIZE;
+}
+EXPORT_SYMBOL(vsize);
+
+/**
* vunmap - release virtual mapping obtained by vmap()
* @addr: memory base address
*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* Re: [PATCHv8 08/11] mlx4: Allow interfaces to correspond to each other
From: Eli Cohen @ 2010-05-13 11:13 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adafx1wuald.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, May 12, 2010 at 01:30:22PM -0700, Roland Dreier wrote:
> > +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
> > +{
> > + return mlx4_find_get_prot_dev(dev, proto, port);
> > +}
> > +EXPORT_SYMBOL(mlx4_get_prot_dev);
>
> Not sure I understand why you have a wrapper to call another function
> with exactly the same parameters? Can't we get rid of this and just
> rename mlx4_find_get_prot_dev() to mlx4_get_prot_dev()?
Sure, let's change that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox