Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Changli Gao @ 2010-05-14  8:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
	Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
	Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
	linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
	Tetsuo Handa
In-Reply-To: <1273824214.5605.3625.camel@twins>

On Fri, May 14, 2010 at 4:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-05-13 at 22:08 +0800, Changli Gao wrote:
>> > NAK, I really utterly dislike that inatomic argument. The alloc side
>> > doesn't function in atomic context either. Please keep the thing
>> > symmetric in that regards.
>> >
>>
>> There are some users, who release memory in atomic context. for
>> example: fs/file.c: fdmem.
>
> urgh, but yeah, aside from not using vmalloc to allocate fd tables one
> needs to deal with this.
>
> But if that is the only one, I'd let them do the workqueue thing that's
> already there. If there really are more people wanting to do this, then
> maybe add: kvfree_atomic().
>

Tetsuo has pointed another one in apparmor.
http://kernel.ubuntu.com/git?p=jj/ubuntu-lucid.git;a=blobdiff;f=security/apparmor/match.c;h=d2cd55419acfcae85cb748c8f837a4384a3a0d29;hp=afc2dd2260edffcf88521ae86458ad03aa8ea12c;hb=f5eba4b0a01cc671affa429ba1512b6de7caeb5b;hpb=abdff9ddaf2644d0f9962490f73e030806ba90d3
, though apparmor hasn't been merged into mainline.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Peter Zijlstra @ 2010-05-14  8:03 UTC (permalink / raw)
  To: Changli Gao
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
	Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
	Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
	linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
	Tetsuo Handa
In-Reply-To: <AANLkTinLT5g5SKjqmQlS2kxvvMq1gsi1jPDgOKTnrT-q@mail.gmail.com>

On Thu, 2010-05-13 at 22:08 +0800, Changli Gao wrote:
> > NAK, I really utterly dislike that inatomic argument. The alloc side
> > doesn't function in atomic context either. Please keep the thing
> > symmetric in that regards.
> >
> 
> There are some users, who release memory in atomic context. for
> example: fs/file.c: fdmem. 

urgh, but yeah, aside from not using vmalloc to allocate fd tables one
needs to deal with this.

But if that is the only one, I'd let them do the workqueue thing that's
already there. If there really are more people wanting to do this, then
maybe add: kvfree_atomic().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] libibverbs: Add huge page support to ibv_madvise_range()
From: Pradeep Satyanarayana @ 2010-05-14  0:04 UTC (permalink / raw)
  To: Roland Dreier
  Cc: alexv-smomgflXvOZWk0Htik3J/w, roland,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, alexr-smomgflXvOZWk0Htik3J/w,
	monis-smomgflXvOZWk0Htik3J/w, ogerlitz-smomgflXvOZWk0Htik3J/w
In-Reply-To: <adapr0zsswc.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

Roland Dreier wrote:
>  > Maybe we should use the /proc/pid/pagemap and /proc/kpageflags files?
>  > These files let a userspace process find out which physical frame
>  > each virtual page is mapped to and to get the properties of each page
>  > frame including the page size.
> 
> Sounds interesting... the only problem is that /proc/kpageflags is
> root-only, so we can't use that.  But does /proc/pid/pagemap give enough
> information to be useful?

The most useful information that /proc/pid/pagemap provides the PFN (if the
page is not swapped out). In fact there is very useful tool called 
Documentation/vm/page-types.c which gives a lot of the information that we 
need. Again the caveat is that it uses /proc/kpageflags.

/proc/pid/pagemap also provided "page shift". I am not sure what that is
in the context of huge pages.
> 
>  > The disadvantage of this method is, that it applicable on kernel >=
>  > 2.6.25, since this interface is supported starting from the kernel
>  > 2.6.25.
> 
> I think that's OK

Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/9] mm: generic adaptive large memory allocation APIs
From: Andreas Dilger @ 2010-05-13 21:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Changli Gao, Andrew Morton, Hoang-Nam Nguyen, Christoph Raisch,
	Roland Dreier, Sean Hefty, Hal Rosenstock, Divy Le Ray,
	Theodore Ts'o, Alexander Viro, Paul Menage, Li Zefan,
	linux-rdma, linux-kernel@vger.kernel.org Mailinglist, netdev,
	linux-scsi, linux-ext4@vger.kernel.org development, linux-fsdevel,
	linux-mm, containers
In-Reply-To: <1273763055.4353.136.camel@mulgrave.site>

On 2010-05-13, at 09:04, James Bottomley wrote:
> This isn't necessarily true ... most drivers and filesystems have to
> know what type they're getting.  Often they have to do extra tricks to
> process vmalloc areas.  Conversely, large kmalloc areas are a very
> precious commodity: if a driver or filesystem can handle vmalloc for
> large allocations, it should: it's easier for us to expand the vmalloc
> area than to try to make page reclaim keep large contiguous areas ... I
> notice your proposed API does the exact opposite of this ... tries
> kmalloc first and then does vmalloc.
> 
> Given this policy problem, isn't it easier simply to hand craft the
> vmalloc fall back to kmalloc (or vice versa) in the driver than add this
> whole massive raft of APIs for it?

I know we wouldn't mind using large vmalloc allocations for e.g. per-group arrays in ext4 (allocated once per mount), but I'd always understood that using vmalloc for general purpose uses can have a significant impact because the vmalloc() engine has (had?) serious performance problems.  That means it is better performance-wise to have a wrapper function like this to switch between kmalloc() and vmalloc() based on the allocation size, but it makes the code ugly.  Having the wrapper in the kernel would at least identify the different places that are using this kind of workaround.

If the performance of vmalloc() has been improved in the last few years, then I'd be happy to just use vmalloc() all the time.  That said, vmalloc still isn't suitable for sub-page allocations, so if you have a variable-sized allocation that may be very small or very large the small allocations will waste a whole page and a wrapper is still needed, or vmalloc should be changed to call kmalloc/kfree for the sub-page allocations.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
From: Roland Dreier @ 2010-05-13 19:18 UTC (permalink / raw)
  To: Sean Hefty; +Cc: Linux RDMA list, Eli Cohen, ewg
In-Reply-To: <A2C75A9348C942C3BAB7A2F0FF1B6CAB-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>

 > Basically, what I want to understand is why does this change make sense?
 > 
 > @@ -1139,6 +1139,10 @@ struct ib_device {
 >  						  struct ib_grh *in_grh,
 >  						  struct ib_mad *in_mad,
 >  						  struct ib_mad *out_mad);
 > +	int			   (*get_eth_l2_addr)(struct ib_device *device,
 > u8 port,
 > +						      union ib_gid *dgid, int
 > sgid_idx,
 > +						      u8 *mac, u16 *vlan_id, u8
 > *tagged);
 > +

Yes, that was pretty much my original question.  Why do we have a verb
for userspace to call a device-specific method to do the mapping?  The
layering seems wrong somewhere if we have a generic verb to do this
mapping, but then put the mapping in device-specific code.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply

* RE: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
From: Sean Hefty @ 2010-05-13 19:11 UTC (permalink / raw)
  To: 'Eli Cohen'; +Cc: Roland Dreier, Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <20100513165447.GB19438-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>

>In this case it will not be just mapping remote addresses but creating
>the required AH information which is unique to each device.

I understand that the AH is per device.  What I don't get is why we would want
each device to perform the mapping.  We don't expect the device to map GIDs to
LIDs when creating an AH.  The user must have done the mapping beforehand.  The
mapping does not depend on which device I'm using, it should be the same.

Basically, what I want to understand is why does this change make sense?

@@ -1139,6 +1139,10 @@ struct ib_device {
 						  struct ib_grh *in_grh,
 						  struct ib_mad *in_mad,
 						  struct ib_mad *out_mad);
+	int			   (*get_eth_l2_addr)(struct ib_device *device,
u8 port,
+						      union ib_gid *dgid, int
sgid_idx,
+						      u8 *mac, u16 *vlan_id, u8
*tagged);
+

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv8 05/11] ib_core: IBoE UD packet packing support
From: Roland Dreier @ 2010-05-13 17:36 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <20100513082718.GD16073-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>

 > Right, it's already applied.

Doesn't seem to be all there... eg there is nothing for ethernet headers.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Roland Dreier @ 2010-05-13 17:35 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <20100513165201.GA19438-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>

 > One reason is that get_src_path_mask() may access ah_lock.

I don't see that:

static u8 get_src_path_mask(struct ib_device *device, u8 port_num)
{
	struct ib_sa_device *sa_dev;
	struct ib_sa_port   *port;
	unsigned long flags;
	u8 src_path_mask;

	sa_dev = ib_get_client_data(device, &sa_client);
	if (!sa_dev)
		return 0x7f;

so if we never add the sa_client structure it just returns.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
From: Eli Cohen @ 2010-05-13 16:54 UTC (permalink / raw)
  To: Sean Hefty; +Cc: Roland Dreier, Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <8172EB57F9DF4F93BBFAD583235E8C3F-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>

On Thu, May 13, 2010 at 08:55:13AM -0700, Sean Hefty wrote:
> >
> >Interesting idea. Not sure what is better here: a seperate ABI
> >call or some additional 'u8 ctx[32]' field in struct
> >ib_uverbs_create_ah_resp that will be interpreted by the hw-specific
> >user-space driver.
> 
> I don't understand why mapping remote addresses should be driver specific.

In this case it will not be just mapping remote addresses but creating
the required AH information which is unique to each device.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Eli Cohen @ 2010-05-13 16:52 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adatyqbsszt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Thu, May 13, 2010 at 08:48:06AM -0700, Roland Dreier wrote:
> 
> Right, I'm not saying it shouldn't be in multicast.c.  I'm just
> wondering why you don't have the same thing for sa_query.c.
> 
One reason is that get_src_path_mask() may access ah_lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
From: Sean Hefty @ 2010-05-13 15:55 UTC (permalink / raw)
  To: 'Eli Cohen', Roland Dreier; +Cc: Linux RDMA list, Eli Cohen, ewg
In-Reply-To: <20100513135645.GL16073-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>

>Address mapping policies depend on the address type. This patch only
>defines a policy for mapping link-local addresses, and we should
>indeed take care not to change it (if possible).
>Later on, we can add more policies for other address types (e.g.,
>normal IPv6 addresses, mapped IPv4 addresses, etc.)
>>
>> Or maybe it is device-specific, and we could wrap it up into the create
>> AH uverbs call we already have?
>
>Interesting idea. Not sure what is better here: a seperate ABI
>call or some additional 'u8 ctx[32]' field in struct
>ib_uverbs_create_ah_resp that will be interpreted by the hw-specific
>user-space driver.

I don't understand why mapping remote addresses should be driver specific.

^ permalink raw reply

* Re: [PATCH] libibverbs: Add huge page support to ibv_madvise_range()
From: Roland Dreier @ 2010-05-13 15:50 UTC (permalink / raw)
  To: alexv-smomgflXvOZWk0Htik3J/w
  Cc: roland, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	alexr-smomgflXvOZWk0Htik3J/w, monis-smomgflXvOZWk0Htik3J/w,
	ogerlitz-smomgflXvOZWk0Htik3J/w
In-Reply-To: <4BEC06DB.30505-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

 > Maybe we should use the /proc/pid/pagemap and /proc/kpageflags files?
 > These files let a userspace process find out which physical frame
 > each virtual page is mapped to and to get the properties of each page
 > frame including the page size.

Sounds interesting... the only problem is that /proc/kpageflags is
root-only, so we can't use that.  But does /proc/pid/pagemap give enough
information to be useful?

 > The disadvantage of this method is, that it applicable on kernel >=
 > 2.6.25, since this interface is supported starting from the kernel
 > 2.6.25.

I think that's OK
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Roland Dreier @ 2010-05-13 15:48 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <20100513065917.GA16073-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>

 > > Also I'm wondering why you did the "count" stuff to ignore IBoE-only
 > > devices in multicast.c but not sa_query.c?

 > It seems to me the right place to put this logic as the mutlicast code
 > registers as an IB client and the add_one implemntation is
 > multicast.c.

Right, I'm not saying it shouldn't be in multicast.c.  I'm just
wondering why you don't have the same thing for sa_query.c.

 - R.
-- 
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/9] mm: generic adaptive large memory allocation APIs
From: James Bottomley @ 2010-05-13 15:04 UTC (permalink / raw)
  To: Changli Gao
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, Theodore Ts'o,
	Andreas Dilger, Alexander Viro, Paul Menage, Li Zefan, linux-rdma,
	linux-kernel, netdev, linux-scsi, linux-ext4, linux-fsdevel,
	linux-mm, containers
In-Reply-To: <1273744147-7594-1-git-send-email-xiaosuo@gmail.com>

On Thu, 2010-05-13 at 17:49 +0800, Changli Gao wrote:
> generic adaptive large memory allocation APIs
> 
> kv*alloc are used to allocate large contiguous memory and the users don't mind
> whether the memory is physically or virtually contiguous. The allocator always
> try its best to allocate physically contiguous memory first.

This isn't necessarily true ... most drivers and filesystems have to
know what type they're getting.  Often they have to do extra tricks to
process vmalloc areas.  Conversely, large kmalloc areas are a very
precious commodity: if a driver or filesystem can handle vmalloc for
large allocations, it should: it's easier for us to expand the vmalloc
area than to try to make page reclaim keep large contiguous areas ... I
notice your proposed API does the exact opposite of this ... tries
kmalloc first and then does vmalloc.

Given this policy problem, isn't it easier simply to hand craft the
vmalloc fall back to kmalloc (or vice versa) in the driver than add this
whole massive raft of APIs for it?

> In this patch set, some APIs are introduced: kvmalloc(), kvzalloc(), kvcalloc(),
> kvrealloc(), kvfree() and kvfree_inatomic().
> 
> Some code are converted to use the new generic APIs instead.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----
>  drivers/infiniband/hw/ehca/ipz_pt_fn.c |   22 +-----
>  drivers/net/cxgb3/cxgb3_defs.h         |    2 
>  drivers/net/cxgb3/cxgb3_offload.c      |   31 ---------
>  drivers/net/cxgb3/l2t.c                |    4 -
>  drivers/net/cxgb4/cxgb4.h              |    3 
>  drivers/net/cxgb4/cxgb4_main.c         |   37 +----------
>  drivers/net/cxgb4/l2t.c                |    2 
>  drivers/scsi/cxgb3i/cxgb3i_ddp.c       |   12 +--
>  drivers/scsi/cxgb3i/cxgb3i_ddp.h       |   26 -------
>  drivers/scsi/cxgb3i/cxgb3i_offload.c   |    6 -
>  fs/ext4/super.c                        |   21 +-----
>  fs/file.c                              |  109 ++++-----------------------------
>  include/linux/mm.h                     |   31 +++++++++
>  include/linux/vmalloc.h                |    1 
>  kernel/cgroup.c                        |   47 +-------------
>  kernel/relay.c                         |   35 ----------
>  mm/nommu.c                             |    6 +
>  mm/util.c                              |  104 +++++++++++++++++++++++++++++++
>  mm/vmalloc.c                           |   14 ++++
>  19 files changed, 207 insertions(+), 306 deletions(-)

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Changli Gao @ 2010-05-13 14:49 UTC (permalink / raw)
  To: Milton Miller
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
	Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
	Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
	linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
	Tetsuo Handa, Peter Zijlstra
In-Reply-To: <1273761576_4060@mail4.comsite.net>

On Thu, May 13, 2010 at 10:39 PM, Milton Miller <miltonm@bga.com> wrote:
> On Thu, 13 May 2010 at 17:51:25 +0800, Changli Gao wrote:
>
>> +static inline void *kvcalloc(size_t n, size_t size)
>> +{
>> +     return __kvmalloc(n * size, __GFP_ZERO);
>>
>
> This needs multiply overflow checking like kcalloc.
>

Thanks.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Milton Miller @ 2010-05-13 14:39 UTC (permalink / raw)
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
	Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
	Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
	linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
	Tetsuo Handa, Peter Zijlstra, Changli Gao
In-Reply-To: <1273744285-8128-1-git-send-email-xiaosuo@gmail.com>

On Thu, 13 May 2010 at 17:51:25 +0800, Changli Gao wrote:

> +static inline void *kvcalloc(size_t n, size_t size)
> +{
> +	return __kvmalloc(n * size, __GFP_ZERO);
> 

This needs multiply overflow checking like kcalloc.

milton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Changli Gao @ 2010-05-13 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
	Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
	Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
	linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
	Tetsuo Handa
In-Reply-To: <1273756816.5605.3547.camel@twins>

On Thu, May 13, 2010 at 9:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-05-13 at 17:51 +0800, Changli Gao wrote:
>> +void *__kvmalloc(size_t size, gfp_t flags)
>> +{
>> +       void *ptr;
>> +
>> +       if (size < PAGE_SIZE)
>> +               return kmalloc(size, GFP_KERNEL | flags);
>> +       size = PAGE_ALIGN(size);
>> +       if (is_power_of_2(size))
>> +               ptr = (void *)__get_free_pages(GFP_KERNEL | flags |
>> +                                              __GFP_NOWARN, get_order(size));
>> +       else
>> +               ptr = alloc_pages_exact(size, GFP_KERNEL | flags |
>> +                                             __GFP_NOWARN);
>> +       if (ptr != NULL) {
>> +               virt_to_head_page(ptr)->private = size;
>> +               return ptr;
>> +       }
>> +
>> +       ptr = vmalloc(size);
>> +       if (ptr != NULL && (flags & __GFP_ZERO))
>> +               memset(ptr, 0, size);
>> +
>> +       return ptr;
>> +}
>> +EXPORT_SYMBOL(__kvmalloc);
>
> So if I do kvmalloc(size, GFP_ATOMIC) I get GFP_KERNEL|GFP_ATOMIC, which
> is not a recommended variation because one should not mix __GFP_WAIT and
> __GFP_HIGH.

__kvmalloc() is only for internal use(kvmalloc, kvcalloc, and
kvzalloc), and the only value of flags is __GFP_ZERO. How about
replacing flags with a bool variable zero?

void *__kvmalloc(size_t size, bool zero);

 Or check the value of flags in the front of __kvmalloc().

BUG_ON((flags & (~__GFP_ZERO)) != 0);

>
> So I would simply drop the gfp argument to avoid confusion.
>
>> +void __kvfree(void *ptr, bool inatomic)
>> +{
>> +       if (unlikely(ZERO_OR_NULL_PTR(ptr)))
>> +               return;
>> +       if (is_vmalloc_addr(ptr)) {
>> +               if (inatomic) {
>> +                       struct work_struct *work;
>> +
>> +                       work = ptr;
>> +                       BUILD_BUG_ON(sizeof(struct work_struct) > PAGE_SIZE);
>> +                       INIT_WORK(work, kvfree_work);
>> +                       schedule_work(work);
>> +               } else {
>> +                       vfree(ptr);
>> +               }
>> +       } else {
>> +               struct page *page;
>> +
>> +               page = virt_to_head_page(ptr);
>> +               if (PageSlab(page) || PageCompound(page))
>> +                       kfree(ptr);
>> +               else if (is_power_of_2(page->private))
>> +                       free_pages((unsigned long)ptr,
>> +                                  get_order(page->private));
>> +               else
>> +                       free_pages_exact(ptr, page->private);
>> +       }
>> +}
>> +EXPORT_SYMBOL(__kvfree);
>
> NAK, I really utterly dislike that inatomic argument. The alloc side
> doesn't function in atomic context either. Please keep the thing
> symmetric in that regards.
>

There are some users, who release memory in atomic context. for
example: fs/file.c: fdmem.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH] libibverbs: Add huge page support to ibv_madvise_range()
From: Alex Vainman @ 2010-05-13 14:04 UTC (permalink / raw)
  To: Roland Dreier
  Cc: alexv-smomgflXvOZWk0Htik3J/w, roland,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, alexr-smomgflXvOZWk0Htik3J/w,
	monis-smomgflXvOZWk0Htik3J/w, ogerlitz-smomgflXvOZWk0Htik3J/w
In-Reply-To: <adazl0c92kp.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

Roland Dreier Wrote:
>  > In order to avoid adding additional dependency to libibverbs, maybe
>  > we should just to enhance the get_huge_page_size() so it will support
>  > multiple huge page sizes?
> 
> I think that does make sense.  However see my reply to Alexander
> Schmidt -- maybe there is something clever we can do with /proc/*/maps
> to figure out when we need to use this?
> 
>  - R.

Maybe we should use the /proc/pid/pagemap and /proc/kpageflags files? These files let a userspace process find out which physical frame each virtual page is mapped to and to get the properties of each page frame including the page size.
The disadvantage of this method is, that it applicable on kernel >= 2.6.25, since this interface is supported starting from the kernel 2.6.25.

-Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv8 07/11] ib_core: Add API to support IBoE from userspace
From: Eli Cohen @ 2010-05-13 13:56 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adak4r8uaog.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Wed, May 12, 2010 at 01:28:31PM -0700, Roland Dreier wrote:
>  > Add ib_uverbs_get_eth_l2_addr() to allow ibv_create_ah() to resolve <sgid,
>  > dgid> to <vlan, dmac> for any gid type.  Although user-space might bypass this
>  > call for link-local gids, it is better not to replicate the kernel resolution
>  > policy.  Port link layer is also returned by ibv_query_port().
> 
> A high-level comment/question, followed by some notes about the specific patch.
> 
> At the highest level, is having this very low-level command exposed as
> part of the kernel uverbs <-> userspace API the right place to split
> things?  Making the Ethernet address resolution part of the low-level
> driver implies that it's not really a generic part of the verbs interface.

Correct - Ethernet address resolution really isn't part of the verbs
interface accoring to the RoCE specification; it should happen below
the verbs.

> 
> Maybe it is generic, and we should have a generic function instead of
> calling into the low-level driver.  I see the argument that we should
> keep the policy in the kernel, although I'm not sure how strong that
> argument is -- once we start shipping a kernel with a certain policy
> (and I guess OFED has in effect already done that), how could we ever
> change that policy?  We'll have interoperability issues anyway, so it
> seems having userspace and kernel use different policies doesn't cause
> much further problems anyway.

Address mapping policies depend on the address type. This patch only
defines a policy for mapping link-local addresses, and we should
indeed take care not to change it (if possible).
Later on, we can add more policies for other address types (e.g.,
normal IPv6 addresses, mapped IPv4 addresses, etc.)
> 
> Or maybe it is device-specific, and we could wrap it up into the create
> AH uverbs call we already have?

Interesting idea. Not sure what is better here: a seperate ABI
call or some additional 'u8 ctx[32]' field in struct
ib_uverbs_create_ah_resp that will be interpreted by the hw-specific
user-space driver.

> 
> Low-level comments:
> 
>  > +ssize_t ib_uverbs_get_eth_l2_addr(struct ib_uverbs_file *file, const char __user *buf,
>  > +				  int in_len, int out_len)
>  > +{
>  > +	struct ib_uverbs_get_eth_l2_addr       cmd;
>  > +	struct ib_uverbs_get_eth_l2_addr_resp  resp;
>  > +	int              ret;
>  > +	struct ib_pd    *pd;
>  > +
>  > +	if (out_len < sizeof resp)
>  > +		return -ENOSPC;
>  > +
>  > +	if (copy_from_user(&cmd, buf, sizeof cmd))
>  > +		return -EFAULT;
>  > +
>  > +	pd = idr_read_pd(cmd.pd_handle, file->ucontext);
>  > +	if (!pd)
>  > +		return -EINVAL;
>  > +
>  > +	ret = ib_get_eth_l2_addr(pd->device, cmd.port, (union ib_gid *)cmd.gid,
>  > +				 cmd.sgid_idx, resp.mac, &resp.vlan_id, &resp.tagged);
>  > +	put_pd_read(pd);
>  > +	if (!ret) {
>  > +		if (copy_to_user((void __user *) (unsigned long) cmd.response,
>  > +				 &resp, sizeof resp))
>  > +			return -EFAULT;
> 
> This leaks kernel memory contents to userspace since the stack variable
> resp is never cleared.  Also will cause problems if we ever need to use
> the reserved fields for anything.

I see, I missed that subtle detail.


> 
> Also I'm not sure I understand why you pass the PD into this method?  It
> seems you just use it to get a pointer to the device, but you already
> have that from the uverbs_file structure that's passed into all commands.

True, missed that too.

> 
>  > +int ib_get_eth_l2_addr(struct ib_device *device, u8 port, union ib_gid *gid,
>  > +		       int sgid_idx, u8 *mac, __u16 *vlan_id, u8 *tagged)
>  > +{
>  > +	if (!device->get_eth_l2_addr)
>  > +		return -ENOSYS;
>  > +
>  > +	return device->get_eth_l2_addr(device, port, gid, sgid_idx, mac, vlan_id, tagged);
>  > +}
>  > +EXPORT_SYMBOL(ib_get_eth_l2_addr);
> 
> I don't think we need this wrapper, since uverbs can just call the
> get_eth_l2_addr method directly (we already do that for quite a few
> other methods, eg alloc_ucontext is a uverbs-only method that has no
> in-kernel wrapper).  Also the -ENOSYS test isn't necessary, since a
> device-driver shouldn't allow this method unless it actually implements
> it.
> 
I agree.

By the way, do you prefer that I re-create the patch with the fixes or
would you make the necessary changes.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/9] mm: add generic adaptive large memory allocationAPIs
From: Tetsuo Handa @ 2010-05-13 13:36 UTC (permalink / raw)
  To: peterz, xiaosuo
  Cc: akpm, hnguyen, raisch, rolandd, sean.hefty, hal.rosenstock, divy,
	James.Bottomley, tytso, adilger, viro, menage, lizf, linux-rdma,
	linux-kernel, netdev, linux-scsi, linux-ext4, linux-fsdevel,
	linux-mm, containers, eric.dumazet
In-Reply-To: <1273756816.5605.3547.camel@twins>

Peter Zijlstra wrote:
> NAK, I really utterly dislike that inatomic argument. The alloc side
> doesn't function in atomic context either. Please keep the thing
> symmetric in that regards.

Excuse me. kmalloc(GFP_KERNEL) may sleep (and therefore cannot be used in
atomic context). However, kfree() for memory allocated with kmalloc(GFP_KERNEL)
never sleep (and therefore can be used in atomic context).
Why kmalloc() and kfree() are NOT kept symmetric?

^ permalink raw reply

* Re: [PATCH 1/9] mm: add generic adaptive large memory allocation APIs
From: Peter Zijlstra @ 2010-05-13 13:20 UTC (permalink / raw)
  To: Changli Gao
  Cc: akpm, Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier,
	Sean Hefty, Hal Rosenstock, Divy Le Ray, James E.J. Bottomley,
	Theodore Ts'o, Andreas Dilger, Alexander Viro, Paul Menage,
	Li Zefan, linux-rdma, linux-kernel, netdev, linux-scsi,
	linux-ext4, linux-fsdevel, linux-mm, containers, Eric Dumazet,
	Tetsuo Handa
In-Reply-To: <1273744285-8128-1-git-send-email-xiaosuo@gmail.com>

On Thu, 2010-05-13 at 17:51 +0800, Changli Gao wrote:
> +void *__kvmalloc(size_t size, gfp_t flags)
> +{
> +       void *ptr;
> +
> +       if (size < PAGE_SIZE)
> +               return kmalloc(size, GFP_KERNEL | flags);
> +       size = PAGE_ALIGN(size);
> +       if (is_power_of_2(size))
> +               ptr = (void *)__get_free_pages(GFP_KERNEL | flags |
> +                                              __GFP_NOWARN, get_order(size));
> +       else
> +               ptr = alloc_pages_exact(size, GFP_KERNEL | flags |
> +                                             __GFP_NOWARN);
> +       if (ptr != NULL) {
> +               virt_to_head_page(ptr)->private = size;
> +               return ptr;
> +       }
> +
> +       ptr = vmalloc(size);
> +       if (ptr != NULL && (flags & __GFP_ZERO))
> +               memset(ptr, 0, size);
> +
> +       return ptr;
> +}
> +EXPORT_SYMBOL(__kvmalloc);

So if I do kvmalloc(size, GFP_ATOMIC) I get GFP_KERNEL|GFP_ATOMIC, which
is not a recommended variation because one should not mix __GFP_WAIT and
__GFP_HIGH.

So I would simply drop the gfp argument to avoid confusion.

> +void __kvfree(void *ptr, bool inatomic)
> +{
> +       if (unlikely(ZERO_OR_NULL_PTR(ptr)))
> +               return;
> +       if (is_vmalloc_addr(ptr)) {
> +               if (inatomic) {
> +                       struct work_struct *work;
> +
> +                       work = ptr;
> +                       BUILD_BUG_ON(sizeof(struct work_struct) > PAGE_SIZE);
> +                       INIT_WORK(work, kvfree_work);
> +                       schedule_work(work);
> +               } else {
> +                       vfree(ptr);
> +               }
> +       } else {
> +               struct page *page;
> +
> +               page = virt_to_head_page(ptr);
> +               if (PageSlab(page) || PageCompound(page))
> +                       kfree(ptr);
> +               else if (is_power_of_2(page->private))
> +                       free_pages((unsigned long)ptr,
> +                                  get_order(page->private));
> +               else
> +                       free_pages_exact(ptr, page->private);
> +       }
> +}
> +EXPORT_SYMBOL(__kvfree); 

NAK, I really utterly dislike that inatomic argument. The alloc side
doesn't function in atomic context either. Please keep the thing
symmetric in that regards.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv8 04/11] ib_core: IBoE CMA device binding
From: Eli Cohen @ 2010-05-13 13:07 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux RDMA list, Eli Cohen, ewg
In-Reply-To: <adaocgkubbq.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Wed, May 12, 2010 at 01:14:33PM -0700, Roland Dreier wrote:
>  > Multicast GIDs are always mapped to multicast MACs
>  > as is done in IPv6. Some helper functions are added to ib_addr.h.  IPv4
>  > multicast is enabled by translating IPv4 multicast addresses to IPv6 multicast
>  > as described in
>  > http://www.mail-archive.com/ipng-Va8kgSFw1KyzQ7sTrsxvMkEOCMrvLtNR@public.gmane.org/msg02134.html.
> 
> I guess it's a bit unfortunate that the RoCE annex completely ignored
> how to map multicast GIDs to ethernet addresses (I suppose as part of
> the larger decision to ignore address resolution entirely).  Anyway,
> looking at the email message you reference, it seems to be someone
> asking what the right way to map IPv4 multicast addresses to IPv6
> addresses is.  Is there a more definitive document you can point to?

I am not aware of any definitive document that addresses this issue.

> 
> It seems that unfortunately the way the layering of addresses is done,
> there's no way to just use the standard mapping of IPv4 multicast
> addresses to Ethernet addresses (since IBoE is does addressing via
> the CMA mapping to GIDs followed by an unspecified mapping from GIDs to
> Ethernet addresses).
> 

It is natural to treat gids as IPv6 addresses, so using the standard
mapping from IPv6 multicast addresses to mac addresses seems reasonable.

^ permalink raw reply

* Re: [PATCHv8 09/11] mlx4: Add support for IBoE - address resolution
From: Eli Cohen @ 2010-05-13 11:53 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux RDMA list, Eli Cohen, ewg
In-Reply-To: <adabpckuaim.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Wed, May 12, 2010 at 01:32:01PM -0700, Roland Dreier wrote:
>  > +	u8		mac_0_1[2];
>  > +	u8		mac_2_5[4];
> 
> This looks a bit odd.  Any reason why you don't just have "u8 mac[6];"
> in this structure?

We can get rid of this crap. I think the origin of this is a
limitation of ib_pack() to deal with fields that cross a dword
boundary which should not avoid us from using mac[6] here.

^ permalink raw reply

* [PATCH 9/9] ehca: use kvcalloc and kvfree
From: Changli Gao @ 2010-05-13 11:18 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: Hoang-Nam Nguyen, Christoph Raisch, Roland Dreier, Sean Hefty,
	Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Changli Gao

use kvcalloc and kvfree

use kvcalloc and kvfree, but I haven't test it, as I don't have that machine.

Signed-off-by: Changli Gao <xiaosuo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
----
 drivers/infiniband/hw/ehca/ipz_pt_fn.c |   22 +++++-----------------
 1 file changed, 5 insertions(+), 17 deletions(-)
diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c
index 1596e30..f0e7bd4 100644
--- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c
+++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c
@@ -38,8 +38,6 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include <linux/slab.h>
-
 #include "ehca_tools.h"
 #include "ipz_pt_fn.h"
 #include "ehca_classes.h"
@@ -222,15 +220,11 @@ int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue,
 	queue->small_page = NULL;
 
 	/* allocate queue page pointers */
-	queue->queue_pages = kmalloc(nr_of_pages * sizeof(void *), GFP_KERNEL);
+	queue->queue_pages = kvcalloc(nr_of_pages, sizeof(void *));
 	if (!queue->queue_pages) {
-		queue->queue_pages = vmalloc(nr_of_pages * sizeof(void *));
-		if (!queue->queue_pages) {
-			ehca_gen_err("Couldn't allocate queue page list");
-			return 0;
-		}
+		ehca_gen_err("Couldn't allocate queue page list");
+		return 0;
 	}
-	memset(queue->queue_pages, 0, nr_of_pages * sizeof(void *));
 
 	/* allocate actual queue pages */
 	if (is_small) {
@@ -245,10 +239,7 @@ int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue,
 ipz_queue_ctor_exit0:
 	ehca_gen_err("Couldn't alloc pages queue=%p "
 		 "nr_of_pages=%x",  queue, nr_of_pages);
-	if (is_vmalloc_addr(queue->queue_pages))
-		vfree(queue->queue_pages);
-	else
-		kfree(queue->queue_pages);
+	kvfree(queue->queue_pages);
 
 	return 0;
 }
@@ -270,10 +261,7 @@ int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue)
 			free_page((unsigned long)queue->queue_pages[i]);
 	}
 
-	if (is_vmalloc_addr(queue->queue_pages))
-		vfree(queue->queue_pages);
-	else
-		kfree(queue->queue_pages);
+	kvfree(queue->queue_pages);
 
 	return 1;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCHv8 08/11] mlx4: Allow interfaces to correspond to each other
From: Eli Cohen @ 2010-05-13 11:13 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adafx1wuald.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>

On Wed, May 12, 2010 at 01:30:22PM -0700, Roland Dreier wrote:
>  > +void *mlx4_get_prot_dev(struct mlx4_dev *dev, enum mlx4_prot proto, int port)
>  > +{
>  > +	return mlx4_find_get_prot_dev(dev, proto, port);
>  > +}
>  > +EXPORT_SYMBOL(mlx4_get_prot_dev);
> 
> Not sure I understand why you have a wrapper to call another function
> with exactly the same parameters?  Can't we get rid of this and just
> rename mlx4_find_get_prot_dev() to mlx4_get_prot_dev()?

Sure, let's change that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox