All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [patch] hugetlb strict commit accounting
From: Chen, Kenneth W @ 2006-03-09 12:02 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: wli, 'Andrew Morton', linux-mm, linux-kernel
In-Reply-To: <20060309112635.GB9479@localhost.localdomain>

David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> Again, there are no changes to the fault handler.  Including the
> promised changes which would mean my instantiation serialization path
> isn't necessary ;-).

This is the major portion that I omitted in the first patch and is the
real kicker that fulfills the promise of guaranteed available hugetlb
page for shared mapping.

You can shower me all over on the lock protection :-) yes, this is not
perfect and was the reason I did not post it earlier, but I want to give
you the concept on how I envision this route would work.

Again PRIVATE mapping is busted, you can't count them from inode.  You
would have to count them via mm_struct (I think).

- Ken

Note: definition of "reservation" in earlier patch is total hugetlb pages
needed for that file, including the one that is already faulted in.  Maybe
that throw you off a bit because I'm guessing your definition is "needed
in the future" and probably you are looking for a decrement of the counter
in the fault path?


--- ./mm/hugetlb.c.orig	2006-03-09 04:46:38.965547435 -0800
+++ ./mm/hugetlb.c	2006-03-09 04:48:20.804413375 -0800
@@ -196,6 +196,8 @@ static unsigned long set_max_huge_pages(
 		enqueue_huge_page(page);
 		spin_unlock(&hugetlb_lock);
 	}
+	if (count < atomic_read(&resv_huge_pages))
+		count = atomic_read(&resv_huge_pages);
 	if (count >= nr_huge_pages)
 		return nr_huge_pages;
 


^ permalink raw reply

* Re: [RFC] Badness in __mutex_unlock_slowpath with XFS stress tests
From: Christoph Hellwig @ 2006-03-09 12:03 UTC (permalink / raw)
  To: Suzuki; +Cc: linux-fsdevel, linux-aio kvack.org, lkml, suparna, akpm,
	linux-xfs
In-Reply-To: <440FDF3E.8060400@in.ibm.com>

On Thu, Mar 09, 2006 at 01:24:38PM +0530, Suzuki wrote:
> 
> Missed out linux-aio & linux-fs-devel lists. Forwarding.
> 
> Comments ?

I've seen this too.  The problem is that __generic_file_aio_read can return
with or without the i_mutex locked in the direct I/O case for filesystems
that set DIO_OWN_LOCKING.  It's a nasty one and I haven't found a better solution
than copying lots of code from filemap.c into xfs.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply

* Re: [PATCH: 004/017](RFC) Memory hotplug for new nodes v.3. (generic alloc pgdat)
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308212719.002A.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> For node hotplug, basically we have to allocate new pgdat.
> But, there are several types of implementations of pgdat.
> 
> 1. Allocate only pgdat.
>    This style allocate only pgdat area.
>    And its address is recorded in node_data[].
>    It is most popular style.
> 
> 2. Static array of pgdat
>    In this case, all of pgdats are static array.
>    Some archs use this style.
> 
> 3. Allocate not only pgdat, but also per node data.
>    To increase performance, each node has copy of some data as
>    a per node data. So, this area must be allocated too.
> 
>    Ia64 is this style. Ia64 has the copies of node_data[] array
>    on each per node data to increase performance.
> 
> In this series of patches, treat (1) as generic arch.
> 
> generic archs can use generic function. (2) and (3) should have
> its own if necessary. 
> 
> This patch defines pgdat allocator.
> Updating NODE_DATA() macro function is in other patch.
> 
> ( I'll post another patch for (3).
>   I don't know (2) which can use memory hotplug.
>   So, there is not patch for (2). )
> 
> ...
>
> +#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
> +/*
> + * For supporint node-hotadd, we have to allocate new pgdat.
> + *
> + * If an arch have generic style NODE_DATA(),
> + * node_data[nid] = kzalloc() works well . But it depends on each arch.
> + *
> + * In general, generic_alloc_nodedata() is used.
> + * generic...is a local function in mm/memory_hotplug.c
> + *
> + * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
> + *
> + */
> +extern struct pglist_data * arch_alloc_nodedata(int nid);
> +extern void arch_free_nodedata(pg_data_t *pgdat);
> +
> +#else /* !CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> +#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
> +
> +#ifdef CONFIG_NUMA
> +/*
> + * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
> + */
> +static inline struct pglist_data *generic_alloc_nodedata(int nid)
> +{
> +	return kzalloc(sizeof(struct pglist_data), GFP_ATOMIC);
> +}

>From an interface design point of view it's usually best to pass the
gfp_flags ito a function which performs memory allocation, rather than
assuming the worst-case like this.

If it's known that callers of generic_alloc_nodedata() can just never ever
be permitted to sleep then OK.  But GFP_KERNEL allocations are always
preferable.

> +/*
> + * This definition is just for error path in node hotadd.
> + * For node hotremove, we have to replace this.
> + */
> +static inline void generic_free_nodedata(struct pglist_data *pgdat)
> +{
> +	kfree(pgdat);
> +}
> +
> +#else /* !CONFIG_NUMA */
> +/* never called */
> +static inline struct pglist_data *generic_alloc_nodedata(int nid)
> +{
> +	BUG();
> +	return NULL;
> +}
> +static inline void generic_free_nodedata(struct pglist_data *pgdat)
> +{
> +}
> +#endif /* CONFIG_NUMA */
> +#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +

Should the patch provide stubs for generic_alloc_nodedata() and
generic_alloc_nodedata() if !CONFIG_HAVE_ARCH_NODEDATA_EXTENSION?

(If all callers are also inside #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
then the answer would be "no").


^ permalink raw reply

* Re: [PATCH: 003/017](RFC) Memory hotplug for new nodes v.3.(get node id at probe memory)
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308212646.0028.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> When CONFIG_NUMA && CONFIG_ARCH_MEMORY_PROBE, nid should be defined
>  before calling add_memory_node(nid, start, size).
> 
>  Each arch , which supports CONFIG_NUMA && ARCH_MEMORY_PROBE, should
>  define arch_nid_probe(paddr);
> 
>  Powerpc has nice function. X86_64 has not.....

This patch uses an odd mixture of __devinit and <nothing-at-all> in
arch/x86_64/mm/init.c.  I guess it should be using __meminit
throughout.

^ permalink raw reply

* Re: [PATCH: 000/017] (RFC)Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-kernel, linux-mm,
	linux-ia64
In-Reply-To: <20060308212316.0022.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> I'll post newest patches for memory hotadd with pgdat allocation as V3.
>  There are many changes to make more common code.

General comments:

- Thanks for working against -mm.  It can be a bit of a pain, but it
  eases staging and integration later on.

- Please review all the code to check that all those functions which can
  be made static are indeed made static.  I see quite a few global
  functions there.

- Make sure that all functions which can be tagged __meminit are so tagged.

- It would be useful to build a CONFIG_MEMORY_HOTPLUG=n kernel both with
  and without the patchsets and to publish and maintain the increase in
  code size.  Ideally that increase will be zero.  Probably it won't be,
  and it'd be nice to understand why, and to minimise it.

- Arch issues:

  - Which architectures is this patchset aimed at and tested on?

  - Which other architectures might be able to use this code in the
    future?  Because we should ask the maintainers of those other
    architectures to take a look at the changes.

- What locking does node hot-add use?  There are quite a few places in
  the kernel which cheerfully iterate across node lists while assuming that
  they won't change.  The usage of stop_machine_run() is supposed to cover
  all that, I assume?


^ permalink raw reply

* RE: [patch] hugetlb strict commit accounting
From: Chen, Kenneth W @ 2006-03-09 12:02 UTC (permalink / raw)
  To: 'David Gibson'
  Cc: wli, 'Andrew Morton', linux-mm, linux-kernel
In-Reply-To: <20060309112635.GB9479@localhost.localdomain>

David Gibson wrote on Thursday, March 09, 2006 3:27 AM
> Again, there are no changes to the fault handler.  Including the
> promised changes which would mean my instantiation serialization path
> isn't necessary ;-).

This is the major portion that I omitted in the first patch and is the
real kicker that fulfills the promise of guaranteed available hugetlb
page for shared mapping.

You can shower me all over on the lock protection :-) yes, this is not
perfect and was the reason I did not post it earlier, but I want to give
you the concept on how I envision this route would work.

Again PRIVATE mapping is busted, you can't count them from inode.  You
would have to count them via mm_struct (I think).

- Ken

Note: definition of "reservation" in earlier patch is total hugetlb pages
needed for that file, including the one that is already faulted in.  Maybe
that throw you off a bit because I'm guessing your definition is "needed
in the future" and probably you are looking for a decrement of the counter
in the fault path?


--- ./mm/hugetlb.c.orig	2006-03-09 04:46:38.965547435 -0800
+++ ./mm/hugetlb.c	2006-03-09 04:48:20.804413375 -0800
@@ -196,6 +196,8 @@ static unsigned long set_max_huge_pages(
 		enqueue_huge_page(page);
 		spin_unlock(&hugetlb_lock);
 	}
+	if (count < atomic_read(&resv_huge_pages))
+		count = atomic_read(&resv_huge_pages);
 	if (count >= nr_huge_pages)
 		return nr_huge_pages;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 013/017](RFC) Memory hotplug for new nodes v.3. (changes from __init to __meminit)
From: Andrew Morton @ 2006-03-09 12:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213446.003C.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> Index: pgdat6/include/linux/bootmem.h
>  ===================================================================
>  --- pgdat6.orig/include/linux/bootmem.h	2006-03-06 18:25:37.000000000 +0900
>  +++ pgdat6/include/linux/bootmem.h	2006-03-06 21:08:05.000000000 +0900
>  @@ -88,8 +88,8 @@ static inline void *alloc_remap(int nid,
>   }
>   #endif
>   
>  -extern unsigned long __initdata nr_kernel_pages;
>  -extern unsigned long __initdata nr_all_pages;
>  +extern unsigned long __meminitdata nr_kernel_pages;
>  +extern unsigned long __meminitdata nr_all_pages;

Declaring the section for externs like this isn't very useful really.  I
don't think there's any way in which the compiler can check it and the
linker will look at the definition, not at the declarations.  And if we add
these, we just need to keep the declarations updated for cosmetic reasons
as you've discovered.

So I'd recommend you simply remove the __initdata tags here and leave it at
that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 013/017](RFC) Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213446.003C.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> Index: pgdat6/include/linux/bootmem.h
>  =================================>  --- pgdat6.orig/include/linux/bootmem.h	2006-03-06 18:25:37.000000000 +0900
>  +++ pgdat6/include/linux/bootmem.h	2006-03-06 21:08:05.000000000 +0900
>  @@ -88,8 +88,8 @@ static inline void *alloc_remap(int nid,
>   }
>   #endif
>   
>  -extern unsigned long __initdata nr_kernel_pages;
>  -extern unsigned long __initdata nr_all_pages;
>  +extern unsigned long __meminitdata nr_kernel_pages;
>  +extern unsigned long __meminitdata nr_all_pages;

Declaring the section for externs like this isn't very useful really.  I
don't think there's any way in which the compiler can check it and the
linker will look at the definition, not at the declarations.  And if we add
these, we just need to keep the declarations updated for cosmetic reasons
as you've discovered.

So I'd recommend you simply remove the __initdata tags here and leave it at
that.


^ permalink raw reply

* Re: [PATCH: 012/017](RFC) Memory hotplug for new nodes v.3. (rebuild zonelists after online pages)
From: Andrew Morton @ 2006-03-09 12:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213410.003A.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> In current code, zonelist is considered to be build once, no modification.
>  But MemoryHotplug can add new zone/pgdat. It must be updated.
> 
>  This patch modifies build_all_zonelists(). 
>  By this, build_all_zonelist() can reconfig pgdat's zonelists.
> 
>  To update them safety, this patch use stop_machine_run().

Yup, that's a good way of doing it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 012/017](RFC) Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213410.003A.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> In current code, zonelist is considered to be build once, no modification.
>  But MemoryHotplug can add new zone/pgdat. It must be updated.
> 
>  This patch modifies build_all_zonelists(). 
>  By this, build_all_zonelist() can reconfig pgdat's zonelists.
> 
>  To update them safety, this patch use stop_machine_run().

Yup, that's a good way of doing it.

^ permalink raw reply

* Re: [PATCH: 011/017](RFC) Memory hotplug for new nodes v.3. (start kswapd)
From: Andrew Morton @ 2006-03-09 12:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213333.0038.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
>  +EXPORT_SYMBOL(kswapd_run);

Does this need to be exported to modules?

If so, EXPORT_SYMBOL_GPL would be preferred, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 011/017](RFC) Memory hotplug for new nodes v.3. (start
From: Andrew Morton @ 2006-03-09 12:01 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213333.0038.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
>  +EXPORT_SYMBOL(kswapd_run);

Does this need to be exported to modules?

If so, EXPORT_SYMBOL_GPL would be preferred, please.

^ permalink raw reply

* Re: [PATCH: 010/017](RFC) Memory hotplug for new nodes v.3. (allocate wait table)
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213301.0036.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
>  +		/* we can use kmalloc() in run time */
>  +		do {
>  +			table_size = zone->wait_table_size
>  +					* sizeof(wait_queue_head_t);
>  +			zone->wait_table = kmalloc(table_size, GFP_ATOMIC);

Again, GFP_KERNEL would be better is possible.

Won't this place the node's wait_table into a different node's memory?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 010/017](RFC) Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308213301.0036.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
>  +		/* we can use kmalloc() in run time */
>  +		do {
>  +			table_size = zone->wait_table_size
>  +					* sizeof(wait_queue_head_t);
>  +			zone->wait_table = kmalloc(table_size, GFP_ATOMIC);

Again, GFP_KERNEL would be better is possible.

Won't this place the node's wait_table into a different node's memory?

^ permalink raw reply

* Re: [PATCH: 004/017](RFC) Memory hotplug for new nodes v.3. (generic alloc pgdat)
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308212719.002A.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> For node hotplug, basically we have to allocate new pgdat.
> But, there are several types of implementations of pgdat.
> 
> 1. Allocate only pgdat.
>    This style allocate only pgdat area.
>    And its address is recorded in node_data[].
>    It is most popular style.
> 
> 2. Static array of pgdat
>    In this case, all of pgdats are static array.
>    Some archs use this style.
> 
> 3. Allocate not only pgdat, but also per node data.
>    To increase performance, each node has copy of some data as
>    a per node data. So, this area must be allocated too.
> 
>    Ia64 is this style. Ia64 has the copies of node_data[] array
>    on each per node data to increase performance.
> 
> In this series of patches, treat (1) as generic arch.
> 
> generic archs can use generic function. (2) and (3) should have
> its own if necessary. 
> 
> This patch defines pgdat allocator.
> Updating NODE_DATA() macro function is in other patch.
> 
> ( I'll post another patch for (3).
>   I don't know (2) which can use memory hotplug.
>   So, there is not patch for (2). )
> 
> ...
>
> +#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
> +/*
> + * For supporint node-hotadd, we have to allocate new pgdat.
> + *
> + * If an arch have generic style NODE_DATA(),
> + * node_data[nid] = kzalloc() works well . But it depends on each arch.
> + *
> + * In general, generic_alloc_nodedata() is used.
> + * generic...is a local function in mm/memory_hotplug.c
> + *
> + * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
> + *
> + */
> +extern struct pglist_data * arch_alloc_nodedata(int nid);
> +extern void arch_free_nodedata(pg_data_t *pgdat);
> +
> +#else /* !CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> +#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
> +
> +#ifdef CONFIG_NUMA
> +/*
> + * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
> + */
> +static inline struct pglist_data *generic_alloc_nodedata(int nid)
> +{
> +	return kzalloc(sizeof(struct pglist_data), GFP_ATOMIC);
> +}

>From an interface design point of view it's usually best to pass the
gfp_flags ito a function which performs memory allocation, rather than
assuming the worst-case like this.

If it's known that callers of generic_alloc_nodedata() can just never ever
be permitted to sleep then OK.  But GFP_KERNEL allocations are always
preferable.

> +/*
> + * This definition is just for error path in node hotadd.
> + * For node hotremove, we have to replace this.
> + */
> +static inline void generic_free_nodedata(struct pglist_data *pgdat)
> +{
> +	kfree(pgdat);
> +}
> +
> +#else /* !CONFIG_NUMA */
> +/* never called */
> +static inline struct pglist_data *generic_alloc_nodedata(int nid)
> +{
> +	BUG();
> +	return NULL;
> +}
> +static inline void generic_free_nodedata(struct pglist_data *pgdat)
> +{
> +}
> +#endif /* CONFIG_NUMA */
> +#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +

Should the patch provide stubs for generic_alloc_nodedata() and
generic_alloc_nodedata() if !CONFIG_HAVE_ARCH_NODEDATA_EXTENSION?

(If all callers are also inside #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
then the answer would be "no").

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 004/017](RFC) Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308212719.002A.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> For node hotplug, basically we have to allocate new pgdat.
> But, there are several types of implementations of pgdat.
> 
> 1. Allocate only pgdat.
>    This style allocate only pgdat area.
>    And its address is recorded in node_data[].
>    It is most popular style.
> 
> 2. Static array of pgdat
>    In this case, all of pgdats are static array.
>    Some archs use this style.
> 
> 3. Allocate not only pgdat, but also per node data.
>    To increase performance, each node has copy of some data as
>    a per node data. So, this area must be allocated too.
> 
>    Ia64 is this style. Ia64 has the copies of node_data[] array
>    on each per node data to increase performance.
> 
> In this series of patches, treat (1) as generic arch.
> 
> generic archs can use generic function. (2) and (3) should have
> its own if necessary. 
> 
> This patch defines pgdat allocator.
> Updating NODE_DATA() macro function is in other patch.
> 
> ( I'll post another patch for (3).
>   I don't know (2) which can use memory hotplug.
>   So, there is not patch for (2). )
> 
> ...
>
> +#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
> +/*
> + * For supporint node-hotadd, we have to allocate new pgdat.
> + *
> + * If an arch have generic style NODE_DATA(),
> + * node_data[nid] = kzalloc() works well . But it depends on each arch.
> + *
> + * In general, generic_alloc_nodedata() is used.
> + * generic...is a local function in mm/memory_hotplug.c
> + *
> + * Now, arch_free_nodedata() is just defined for error path of node_hot_add.
> + *
> + */
> +extern struct pglist_data * arch_alloc_nodedata(int nid);
> +extern void arch_free_nodedata(pg_data_t *pgdat);
> +
> +#else /* !CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +#define arch_alloc_nodedata(nid)	generic_alloc_nodedata(nid)
> +#define arch_free_nodedata(pgdat)	generic_free_nodedata(pgdat)
> +
> +#ifdef CONFIG_NUMA
> +/*
> + * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat.
> + */
> +static inline struct pglist_data *generic_alloc_nodedata(int nid)
> +{
> +	return kzalloc(sizeof(struct pglist_data), GFP_ATOMIC);
> +}

From an interface design point of view it's usually best to pass the
gfp_flags ito a function which performs memory allocation, rather than
assuming the worst-case like this.

If it's known that callers of generic_alloc_nodedata() can just never ever
be permitted to sleep then OK.  But GFP_KERNEL allocations are always
preferable.

> +/*
> + * This definition is just for error path in node hotadd.
> + * For node hotremove, we have to replace this.
> + */
> +static inline void generic_free_nodedata(struct pglist_data *pgdat)
> +{
> +	kfree(pgdat);
> +}
> +
> +#else /* !CONFIG_NUMA */
> +/* never called */
> +static inline struct pglist_data *generic_alloc_nodedata(int nid)
> +{
> +	BUG();
> +	return NULL;
> +}
> +static inline void generic_free_nodedata(struct pglist_data *pgdat)
> +{
> +}
> +#endif /* CONFIG_NUMA */
> +#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
> +

Should the patch provide stubs for generic_alloc_nodedata() and
generic_alloc_nodedata() if !CONFIG_HAVE_ARCH_NODEDATA_EXTENSION?

(If all callers are also inside #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
then the answer would be "no").


^ permalink raw reply

* Re: [LARTC] Dual ISP routing and NAT problem
From: Markus Schulz @ 2006-03-09 12:00 UTC (permalink / raw)
  To: lartc
In-Reply-To: <44071C23.7040206@chello.at>

Am Donnerstag, 2. März 2006 17:24 schrieb Mart Frauenlob:
> Hello newsgroup,
>
> I hope somebody with more routing experience then me can help me with
> the problem I have.
>
> The setup is as described below. A dual internet provider routing,
> multiple local area networks, and a dmz network with one public and
> one private ip range.
> I followed the instructions at lartc.org, and so far everything is
> working. The default route is via 'PROV_STATIC', only packets comming
> from LAN 192.168.111.0/24 are routed via 'PROV_DSL'.
> Now if I want to do network address translation via iptables for
> certain traffic coming into the dsl interface ppp0,
> packets never reach their destination.
> DNAT into DMZ or any of the LANs over the eth0 interface works as
> expected. So for example applying a DNAT rule like:
> 'iptables -t nat -A PREROUTING -i ppp0 -d 217.92.8.242 -p tcp --dport
> 80 -j DNAT --to-destination 62.155.170.254'
> fails.
>
> Same for NAT attempts into the LANs 192.168.112.0/24 and
> 192.168.113.0/24. While DNAT into LAN 192.168.111.0/24 works
> perfectly.
>
> So I think the problem is that traffic from the DMZ and those two
> LANs have the ip rules applied to end up in the the table
> 'PROV_STATIC'. Which usually is what I want, but not in this case,
> where I want port or protocol specific traffic to be routed
> differntly.
> Is there a way to 'override' the default routing behaviour for i.e.
> http traffic?

yes, mark the traffic with iptables and route them with a higher prio 
routing rule differently. 
for example:
iptables -t mangle -I PREROUTING ... -j MARK --set-mark 0x01

#insert rule for all marked packets to look at table 100 for routing 
entries.
ip rule add prio $PRIO fwmark 0x01 table 100
#insert your routing entries for alle marked packets into table 100
ip route add <...> table 100
...

$IP must changed according your setup.
the $PRIO must be changed to take at the right place. If i understand 
your problem correctly the prio must be below 32759 and $IP=all. But 
i'm not sure if i understand it right.

-- 
Markus Schulz

modprobe windows
modprobe: This module will TAINT the kernel
_______________________________________________
LARTC mailing list
LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

^ permalink raw reply

* Re: [PATCH: 003/017](RFC) Memory hotplug for new nodes v.3.(get node id at probe memory)
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308212646.0028.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> When CONFIG_NUMA && CONFIG_ARCH_MEMORY_PROBE, nid should be defined
>  before calling add_memory_node(nid, start, size).
> 
>  Each arch , which supports CONFIG_NUMA && ARCH_MEMORY_PROBE, should
>  define arch_nid_probe(paddr);
> 
>  Powerpc has nice function. X86_64 has not.....

This patch uses an odd mixture of __devinit and <nothing-at-all> in
arch/x86_64/mm/init.c.  I guess it should be using __meminit
throughout.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 003/017](RFC) Memory hotplug for new nodes v.3.(get
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-ia64, linux-kernel,
	linux-mm
In-Reply-To: <20060308212646.0028.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> When CONFIG_NUMA && CONFIG_ARCH_MEMORY_PROBE, nid should be defined
>  before calling add_memory_node(nid, start, size).
> 
>  Each arch , which supports CONFIG_NUMA && ARCH_MEMORY_PROBE, should
>  define arch_nid_probe(paddr);
> 
>  Powerpc has nice function. X86_64 has not.....

This patch uses an odd mixture of __devinit and <nothing-at-all> in
arch/x86_64/mm/init.c.  I guess it should be using __meminit
throughout.

^ permalink raw reply

* Re: [PATCH: 000/017] (RFC)Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-kernel, linux-mm,
	linux-ia64
In-Reply-To: <20060308212316.0022.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> I'll post newest patches for memory hotadd with pgdat allocation as V3.
>  There are many changes to make more common code.

General comments:

- Thanks for working against -mm.  It can be a bit of a pain, but it
  eases staging and integration later on.

- Please review all the code to check that all those functions which can
  be made static are indeed made static.  I see quite a few global
  functions there.

- Make sure that all functions which can be tagged __meminit are so tagged.

- It would be useful to build a CONFIG_MEMORY_HOTPLUG=n kernel both with
  and without the patchsets and to publish and maintain the increase in
  code size.  Ideally that increase will be zero.  Probably it won't be,
  and it'd be nice to understand why, and to minimise it.

- Arch issues:

  - Which architectures is this patchset aimed at and tested on?

  - Which other architectures might be able to use this code in the
    future?  Because we should ask the maintainers of those other
    architectures to take a look at the changes.

- What locking does node hot-add use?  There are quite a few places in
  the kernel which cheerfully iterate across node lists while assuming that
  they won't change.  The usage of stop_machine_run() is supposed to cover
  all that, I assume?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH: 000/017] (RFC)Memory hotplug for new nodes v.3.
From: Andrew Morton @ 2006-03-09 12:00 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: tony.luck, ak, jschopp, haveblue, linux-kernel, linux-mm,
	linux-ia64
In-Reply-To: <20060308212316.0022.Y-GOTO@jp.fujitsu.com>

Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
> I'll post newest patches for memory hotadd with pgdat allocation as V3.
>  There are many changes to make more common code.

General comments:

- Thanks for working against -mm.  It can be a bit of a pain, but it
  eases staging and integration later on.

- Please review all the code to check that all those functions which can
  be made static are indeed made static.  I see quite a few global
  functions there.

- Make sure that all functions which can be tagged __meminit are so tagged.

- It would be useful to build a CONFIG_MEMORY_HOTPLUG=n kernel both with
  and without the patchsets and to publish and maintain the increase in
  code size.  Ideally that increase will be zero.  Probably it won't be,
  and it'd be nice to understand why, and to minimise it.

- Arch issues:

  - Which architectures is this patchset aimed at and tested on?

  - Which other architectures might be able to use this code in the
    future?  Because we should ask the maintainers of those other
    architectures to take a look at the changes.

- What locking does node hot-add use?  There are quite a few places in
  the kernel which cheerfully iterate across node lists while assuming that
  they won't change.  The usage of stop_machine_run() is supposed to cover
  all that, I assume?


^ permalink raw reply

* Re: [PATCH] Busy inodes after unmount, be more verbose in generic_shutdown_super
From: Kirill Korotaev @ 2006-03-09 12:03 UTC (permalink / raw)
  To: Neil Brown
  Cc: Kirill Korotaev, linux-kernel, Andrew Morton, Olaf Hering,
	Jan Blunck, Al Viro
In-Reply-To: <17422.5456.875119.579068@cse.unsw.edu.au>

[-- Attachment #1: Type: text/plain, Size: 969 bytes --]

>>>>In general your patch is still does what mine do, so I will be happy if 
>>>>any of this is commited mainstream. In future, please, keep the 
>>>>reference to original authors, this will also make sure that I'm on CC 
>>>>if something goes wrong.
>>>
>>>
>>>Sorry: which 'original author' did I miss ?
>>
>>I mean, it is better to mention original author 
>>(http://marc.theaimsgroup.com/?l=linux-kernel&m=114123870406751&w=2) in 
>>patch description, as it makes sure that he will be on CC if this patch 
>>will be discussed later again. My patch fixing this race was in -mm tree 
>>for half a year already.
> 
> 
> Which patch is that?  The race still seems to be present in -mm.
I attached you my latest patch for 2.6.16-rc5

in -mm tree these patch was until the discussion arose again recently:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14/2.6.14-mm2/broken-out/fix-of-dcache-race-leading-to-busy-inodes-on-umount.patch

Thanks,
Kirill

[-- Attachment #2: diff-ms-dcache-race-20060303 --]
[-- Type: text/plain, Size: 10441 bytes --]

--- linux-2.6.15.orig/fs/dcache.c	2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/fs/dcache.c	2006-03-03 16:22:38.862706448 +0300
@@ -114,6 +114,75 @@ static inline void dentry_iput(struct de
 	}
 }
 
+struct dcache_shrinker {
+	struct list_head list;
+	struct dentry *dentry;
+};
+
+DECLARE_WAIT_QUEUE_HEAD(dcache_shrinker_wq);
+
+/* called under dcache_lock */
+static void dcache_shrinker_add(struct dcache_shrinker *ds,
+		struct dentry *parent, struct dentry *dentry)
+{
+	struct super_block *sb;
+
+	sb = parent->d_sb;
+	ds->dentry = parent;
+	list_add(&ds->list, &sb->s_dshrinkers);
+}
+
+/* called under dcache_lock */
+static void dcache_shrinker_del(struct dcache_shrinker *ds)
+{
+	if (ds == NULL || list_empty(&ds->list))
+		return;
+
+	list_del_init(&ds->list);
+	wake_up_all(&dcache_shrinker_wq);
+}
+
+/* called under dcache_lock, drops inside */
+static void dcache_shrinker_wait(struct super_block *sb)
+{
+	DECLARE_WAITQUEUE(wq, current);
+
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	add_wait_queue(&dcache_shrinker_wq, &wq);
+	spin_unlock(&dcache_lock);
+
+	schedule();
+	remove_wait_queue(&dcache_shrinker_wq, &wq);
+	__set_current_state(TASK_RUNNING);
+}
+
+void dcache_shrinker_wait_sb(struct super_block *sb)
+{
+	/* the root dentry can be held in dput_recursive */
+	spin_lock(&dcache_lock);
+	while (!list_empty(&sb->s_dshrinkers)) {
+		dcache_shrinker_wait(sb);
+		spin_lock(&dcache_lock);
+	}
+	spin_unlock(&dcache_lock);
+}
+
+/* dcache_lock protects shrinker's list */
+static void shrink_dcache_racecheck(struct dentry *parent, int *racecheck)
+{
+	struct super_block *sb;
+	struct dcache_shrinker *ds;
+
+	sb = parent->d_sb;
+	list_for_each_entry(ds, &sb->s_dshrinkers, list) {
+		/* is one of dcache shrinkers working on the dentry? */
+		if (ds->dentry == parent) {
+			*racecheck = 1;
+			break;
+		}
+	}
+}
+
 /* 
  * This is dput
  *
@@ -132,8 +201,9 @@ static inline void dentry_iput(struct de
  */
 
 /*
- * dput - release a dentry
- * @dentry: dentry to release 
+ * dput_recursive - go upward through the dentry tree and release dentries
+ * @dentry: starting dentry
+ * @ds: shrinker to be added to active list (see shrink_dcache_parent)
  *
  * Release a dentry. This will drop the usage count and if appropriate
  * call the dentry unlink method as well as removing it from the queues and
@@ -142,18 +212,15 @@ static inline void dentry_iput(struct de
  *
  * no dcache lock, please.
  */
-
-void dput(struct dentry *dentry)
+static void dput_recursive(struct dentry *dentry, struct dcache_shrinker *ds)
 {
-	if (!dentry)
-		return;
-
-repeat:
 	if (atomic_read(&dentry->d_count) == 1)
 		might_sleep();
 	if (!atomic_dec_and_lock(&dentry->d_count, &dcache_lock))
 		return;
+	dcache_shrinker_del(ds);
 
+repeat:
 	spin_lock(&dentry->d_lock);
 	if (atomic_read(&dentry->d_count)) {
 		spin_unlock(&dentry->d_lock);
@@ -185,6 +252,7 @@ unhash_it:
 
 kill_it: {
 		struct dentry *parent;
+		struct dcache_shrinker lds;
 
 		/* If dentry was on d_lru list
 		 * delete it from there
@@ -194,18 +262,47 @@ kill_it: {
   			dentry_stat.nr_unused--;
   		}
   		list_del(&dentry->d_u.d_child);
+		parent = dentry->d_parent;
+		dcache_shrinker_add(&lds, parent, dentry);
 		dentry_stat.nr_dentry--;	/* For d_free, below */
 		/*drops the locks, at that point nobody can reach this dentry */
 		dentry_iput(dentry);
-		parent = dentry->d_parent;
 		d_free(dentry);
-		if (dentry == parent)
+		if (unlikely(dentry == parent)) {
+			spin_lock(&dcache_lock);
+			dcache_shrinker_del(&lds);
+			spin_unlock(&dcache_lock);
 			return;
+		}
 		dentry = parent;
-		goto repeat;
+		spin_lock(&dcache_lock);
+		dcache_shrinker_del(&lds);
+		if (atomic_dec_and_test(&dentry->d_count))
+			goto repeat;
+		spin_unlock(&dcache_lock);
 	}
 }
 
+/*
+ * dput - release a dentry
+ * @dentry: dentry to release 
+ *
+ * Release a dentry. This will drop the usage count and if appropriate
+ * call the dentry unlink method as well as removing it from the queues and
+ * releasing its resources. If the parent dentries were scheduled for release
+ * they too may now get deleted.
+ *
+ * no dcache lock, please.
+ */
+
+void dput(struct dentry *dentry)
+{
+	if (!dentry)
+		return;
+
+	dput_recursive(dentry, NULL);
+}
+
 /**
  * d_invalidate - invalidate a dentry
  * @dentry: dentry to invalidate
@@ -362,19 +459,23 @@ restart:
  * removed.
  * Called with dcache_lock, drops it and then regains.
  */
-static inline void prune_one_dentry(struct dentry * dentry)
+static void prune_one_dentry(struct dentry * dentry)
 {
 	struct dentry * parent;
+	struct dcache_shrinker ds;
 
 	__d_drop(dentry);
 	list_del(&dentry->d_u.d_child);
+	parent = dentry->d_parent;
+	dcache_shrinker_add(&ds, parent, dentry);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
 	if (parent != dentry)
-		dput(parent);
+		dput_recursive(parent, &ds);
 	spin_lock(&dcache_lock);
+	dcache_shrinker_del(&ds);
 }
 
 /**
@@ -557,13 +658,12 @@ positive:
  * drop the lock and return early due to latency
  * constraints.
  */
-static int select_parent(struct dentry * parent)
+static int select_parent(struct dentry * parent, int * racecheck)
 {
 	struct dentry *this_parent = parent;
 	struct list_head *next;
 	int found = 0;
 
-	spin_lock(&dcache_lock);
 repeat:
 	next = this_parent->d_subdirs.next;
 resume:
@@ -605,6 +705,9 @@ dentry->d_parent->d_name.name, dentry->d
 #endif
 			goto repeat;
 		}
+
+		if (!found && racecheck != NULL)
+			shrink_dcache_racecheck(dentry, racecheck);
 	}
 	/*
 	 * All done at this level ... ascend and resume the search.
@@ -619,7 +722,6 @@ this_parent->d_parent->d_name.name, this
 		goto resume;
 	}
 out:
-	spin_unlock(&dcache_lock);
 	return found;
 }
 
@@ -632,10 +734,66 @@ out:
  
 void shrink_dcache_parent(struct dentry * parent)
 {
-	int found;
+	int found, r;
 
-	while ((found = select_parent(parent)) != 0)
+	while (1) {
+		spin_lock(&dcache_lock);
+		found = select_parent(parent, NULL);
+		if (found)
+			goto found;
+
+		/*
+		 * try again with a dput_recursive() race check.
+		 * it returns quickly if everything was really shrinked
+		 */
+		r = 0;
+		found = select_parent(parent, &r);
+		if (found)
+			goto found;
+		if (!r)
+			break;
+
+		/* drops the lock inside */
+		dcache_shrinker_wait(parent->d_sb);
+		continue;
+
+found:
+		spin_unlock(&dcache_lock);
 		prune_dcache(found);
+	}
+	spin_unlock(&dcache_lock);
+}
+
+/*
+ * Move any unused anon dentries to the end of the unused list.
+ * called under dcache_lock
+ */
+static int select_anon(struct hlist_head *head, int *racecheck)
+{
+	struct hlist_node *lp;
+	int found = 0;
+
+	hlist_for_each(lp, head) {
+		struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
+		if (!list_empty(&this->d_lru)) {
+			dentry_stat.nr_unused--;
+			list_del_init(&this->d_lru);
+		}
+
+		/* 
+		 * move only zero ref count dentries to the end 
+		 * of the unused list for prune_dcache
+		 */
+		if (!atomic_read(&this->d_count)) {
+			list_add_tail(&this->d_lru, &dentry_unused);
+			dentry_stat.nr_unused++;
+			found++;
+		}
+
+		if (!found && racecheck != NULL)
+			shrink_dcache_racecheck(this, racecheck);
+	}
+	return found;
 }
 
 /**
@@ -648,33 +806,36 @@ void shrink_dcache_parent(struct dentry 
  * done under dcache_lock.
  *
  */
-void shrink_dcache_anon(struct hlist_head *head)
+void shrink_dcache_anon(struct super_block *sb)
 {
-	struct hlist_node *lp;
-	int found;
-	do {
-		found = 0;
+	int found, r;
+
+	while (1) {
 		spin_lock(&dcache_lock);
-		hlist_for_each(lp, head) {
-			struct dentry *this = hlist_entry(lp, struct dentry, d_hash);
-			if (!list_empty(&this->d_lru)) {
-				dentry_stat.nr_unused--;
-				list_del_init(&this->d_lru);
-			}
+		found = select_anon(&sb->s_anon, NULL);
+		if (found)
+			goto found;
 
-			/* 
-			 * move only zero ref count dentries to the end 
-			 * of the unused list for prune_dcache
-			 */
-			if (!atomic_read(&this->d_count)) {
-				list_add_tail(&this->d_lru, &dentry_unused);
-				dentry_stat.nr_unused++;
-				found++;
-			}
-		}
+		/*
+		 * try again with a dput_recursive() race check.
+		 * it returns quickly if everything was really shrinked
+		 */
+		r = 0;
+		found = select_anon(&sb->s_anon, &r);
+		if (found)
+			goto found;
+		if (!r)
+			break;
+
+		/* drops the lock inside */
+		dcache_shrinker_wait(sb);
+		continue;
+
+found:
 		spin_unlock(&dcache_lock);
 		prune_dcache(found);
-	} while(found);
+	}
+	spin_unlock(&dcache_lock);
 }
 
 /*
--- linux-2.6.15.orig/fs/super.c	2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/fs/super.c	2006-03-03 16:22:38.841709640 +0300
@@ -69,6 +69,7 @@ static struct super_block *alloc_super(v
 		INIT_LIST_HEAD(&s->s_io);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
+		INIT_LIST_HEAD(&s->s_dshrinkers);
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		init_rwsem(&s->s_umount);
@@ -231,8 +232,9 @@ void generic_shutdown_super(struct super
 	if (root) {
 		sb->s_root = NULL;
 		shrink_dcache_parent(root);
-		shrink_dcache_anon(&sb->s_anon);
+		shrink_dcache_anon(sb);
 		dput(root);
+		dcache_shrinker_wait_sb(sb);
 		fsync_super(sb);
 		lock_super(sb);
 		sb->s_flags &= ~MS_ACTIVE;
--- linux-2.6.15.orig/include/linux/dcache.h	2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/include/linux/dcache.h	2006-03-03 16:22:38.843709336 +0300
@@ -209,7 +209,8 @@ extern struct dentry * d_alloc_anon(stru
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
-extern void shrink_dcache_anon(struct hlist_head *);
+extern void shrink_dcache_anon(struct super_block *);
+extern void dcache_shrinker_wait_sb(struct super_block *sb);
 extern int d_invalidate(struct dentry *);
 
 /* only used at mount-time */
--- linux-2.6.15.orig/include/linux/fs.h	2006-01-03 06:21:10.000000000 +0300
+++ linux-2.6.15-025stab015-dcache-race/include/linux/fs.h	2006-03-03 16:22:38.821712680 +0300
@@ -803,6 +803,7 @@ struct super_block {
 	struct list_head	s_io;		/* parked for writeback */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
+	struct list_head	s_dshrinkers;	/* active dcache shrinkers */
 
 	struct block_device	*s_bdev;
 	struct list_head	s_instances;

^ permalink raw reply

* Re: Bug fixes in -mm that should go into 2.6.16
From: Jan Kara @ 2006-03-09 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Adrian Bunk, linux-kernel
In-Reply-To: <20060308151845.30b8d672.akpm@osdl.org>

> Adrian Bunk <bunk@stusta.de> wrote:
> >
> > Hi Andrew,
> > 
> > the following two patches in -mm should IMHO go into 2.6.16:
> >   fix-oops-in-invalidate_dquots.patch
> 
> Maybe.  I worry about the intrusiveness versus probability-of-oops.
  Yes, I guess the oops is not very probable - at least the bug was
there unnoticed for several months...
  BTW Recently I found out in discussion with Neil Brown that probably
there is a similar problem with umount. The problem is that an inode in
both generic_delete_inode() and generic_forget_inode() is removed from
i_sb_list and i_list. Then I_FREEING is set and inode_lock released. Now
if umount is called, I did not find anything that protects
invalidate_inodes() from missing those pending inodes. So it could
happen that we succeed with unmounting the filesystem but there are
still some live inodes... So we should either leave those inodes in some
superblock list where invalidate_inodes() can reach them or we should
implement some other measure that blocks umount from proceeding before
all those pending inodes are really processed (one idea was some
active_inode counter in the superblock). And if we solve this problem
for umount, then quota can possibly use similar approach for handling
the problem with invalidate_dquots().
  Any ideas?

								Honza

-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply

* Re: [PATCH] Fix shrink_dcache_parent() against shrink_dcache_memory() race (updated patch)
From: Jan Blunck @ 2006-03-09 11:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: balbir, viro, olh, neilb, dev, bsingharora, linux-kernel
In-Reply-To: <20060309032157.0592153e.akpm@osdl.org>

On Thu, Mar 09, Andrew Morton wrote:

> >  Hmm, right. Andrew, if you want a rediff against -mm just tell me. I'm
> >  actually diff'ing against lates linux-2.6.git.
> 
> I'll work it out.
> 

Ok, so I'll just send an updated patch against linux-2.6.git adressing your
issues later.

> Cosmetically, I don't think wait_on_prunes() should be concerned about
> whether or not it "slept".  That action is not significant and preemptible
> kernels can "sleep" at just about any stage.  So I think the concept of
> "slept" in there should be replaced with, say, "prunes_remaining" or
> something like that.  Consequently the all-important comment over
> wait_on_prunes() should be updated to provide a bit more information about
> the significance of its return value, please.

Ok, will do.

> Also I think there should be some explanation somewhere which describes why
> we can continue to assume that there aren't any prunes left to do after
> wait_on_prunes() has dropped dcache_lock.  I mean, once you've dropped the
> lock it's usually the case that anything which you examined while holding
> that lock now becomes out-of-date and invalid.  I assume the thinking is
> that because there's an unmount in progress, nothing can come in and add
> new dentries?
> 
> IOW: why isn't there a race between wait_on_prunes() and prune_one_dentry()?

Because outside dcache_lock, either the refcount on the parent dentry is wrong
(therefore select_parent() might return 0) and sb_prunes != 0 or the refcount
is correct and the sb_prunes == 0. Since sb_root == NULL when unmounting the
filesystem there are no new dentries coming in.

> Are we all happy with this patch now?

I'll update the comments and come back to you with an updated patch. Then I'm
fine with the patch.

Regards,
	Jan

-- 
Jan Blunck                                               jblunck@suse.de
SuSE LINUX AG - A Novell company
Maxfeldstr. 5                                          +49-911-74053-608
D-90409 Nürnberg                                      http://www.suse.de

^ permalink raw reply

* [PATCH] clone: add refs/remotes/* to Pull: targets when cloning
From: Eric Wong @ 2006-03-09 11:55 UTC (permalink / raw)
  To: git, Junio C Hamano
In-Reply-To: <20060309115452.GA13369@localdomain>

Signed-off-by: Eric Wong <normalperson@yhbt.net>

---

 git-clone.sh |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

be2db2344099b3713c1136e84bab7390b6198895
diff --git a/git-clone.sh b/git-clone.sh
index 4ed861d..a8ab7fd 100755
--- a/git-clone.sh
+++ b/git-clone.sh
@@ -249,6 +249,12 @@ Pull: $head_points_at:$origin" &&
 			test "$origin" = "$head" ||
 			echo "Pull: ${head}:${head}"
 		done >>"$GIT_DIR/remotes/origin"
+		(test -d "$GIT_DIR"/refs/remotes && cd "$GIT_DIR" &&
+		 find "refs/remotes" -type f -print | sed -e 's|^refs/||') |
+		while read ref
+		do
+			echo "Pull: $ref:$ref"
+		done >>"$GIT_DIR/remotes/origin"
 	esac
 
 	case "$no_checkout" in
-- 
1.2.4.ga2910

^ permalink raw reply related


This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.