* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: David Miller @ 2009-10-27 1:41 UTC (permalink / raw)
To: eric.dumazet; +Cc: benny+usenet, gertjan_hofman, mcarlson, netdev, kaber
In-Reply-To: <4AE64E41.5030607@gmail.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Oct 2009 02:34:57 +0100
> I hope I did not miss another important thing in this second version.
>
> Again, tested on tg3 only, and on net-next-2.6 tree
>
> [PATCH] vlan: allow null VLAN ID to be used
Looks good, applied to net-next-2.6
Someone now needs to convert IXGBE to use VLAN_PRIO_MASK and
VLAN_PRIO_SHIFT instead of it's internal macros.
^ permalink raw reply
* Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
From: KOSAKI Motohiro @ 2009-10-27 2:42 UTC (permalink / raw)
To: David Rientjes
Cc: kosaki.motohiro, Mel Gorman, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, Mohamed Abbas,
Jens Axboe, John W. Linville, Pekka Enberg,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List, netdev, li
In-Reply-To: <alpine.DEB.2.00.0910260005500.15361@chino.kir.corp.google.com>
> On Mon, 26 Oct 2009, KOSAKI Motohiro wrote:
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bf72055..5a27896 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1899,6 +1899,12 @@ rebalance:
> > if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > /* Wait for some write requests to complete then retry */
> > congestion_wait(BLK_RW_ASYNC, HZ/50);
> > +
> > + /*
> > + * While we wait congestion wait, Amount of free memory can
> > + * be changed dramatically. Thus, we kick kswapd again.
> > + */
> > + wake_all_kswapd(order, zonelist, high_zoneidx);
> > goto rebalance;
> > }
> >
>
> We're blocking to finish writeback of the directly reclaimed memory, why
> do we need to wake kswapd afterwards?
the same reason of "goto restart" case. that's my intention.
if following scenario occur, it is equivalent that we didn't call wake_all_kswapd().
1. call congestion_wait()
2. kswapd reclaimed lots memory and sleep
3. another task consume lots memory
4. wakeup from congestion_wait()
IOW, if we falled into __alloc_pages_slowpath(), we naturally expect
next page_alloc() don't fall into slowpath. however if kswapd end to
its work too early, this assumption isn't true.
Is this too pessimistic assumption?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 3/5] vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
From: KOSAKI Motohiro @ 2009-10-27 2:42 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
Mohamed Abbas, Jens Axboe, John W. Linville, Pekka Enberg,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List, netdev
In-Reply-To: <1256221356-26049-4-git-send-email-mel@csn.ul.ie>
> When a high-order allocation fails, kswapd is kicked so that it reclaims
> at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> allocations. Something has changed in recent kernels that affect the timing
> where high-order GFP_ATOMIC allocations are now failing with more frequency,
> particularly under pressure. This patch forces kswapd to notice sooner that
> high-order allocations are occuring.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64e4388..cd68109 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2016,6 +2016,15 @@ loop_again:
> priority != DEF_PRIORITY)
> continue;
>
> + /*
> + * Exit quickly to restart if it has been indicated
> + * that higher orders are required
> + */
> + if (pgdat->kswapd_max_order > order) {
> + all_zones_ok = 1;
> + goto out;
> + }
> +
> if (!zone_watermark_ok(zone, order,
> high_wmark_pages(zone), end_zone, 0))
> all_zones_ok = 0;
this is simplest patch and seems reasonable.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro>
btw, now balance_pgdat() have too complex flow. at least Vincent was
confused it.
Then, I think kswap_max_order handling should move into balance_pgdat()
at later release.
the following patch addressed my proposed concept.
From 2c5be772f6db25a5ef82975960d0b5788736ec2b Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Mon, 26 Oct 2009 23:25:29 +0900
Subject: [PATCH] kswapd_max_order handling move into balance_pgdat()
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 45 +++++++++++++++++++++------------------------
1 files changed, 21 insertions(+), 24 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64e4388..49001d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1915,7 +1915,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
* interoperates with the page allocator fallback scheme to ensure that aging
* of pages is balanced across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat)
{
int all_zones_ok;
int priority;
@@ -1928,7 +1928,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = vm_swappiness,
- .order = order,
+ .order = 0,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
};
@@ -1938,6 +1938,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
* free_pages == high_wmark_pages(zone).
*/
int temp_priority[MAX_NR_ZONES];
+ int order = 0;
+ int new_order;
loop_again:
total_scanned = 0;
@@ -1945,6 +1947,11 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
+ new_order = pgdat->kswapd_max_order;
+ pgdat->kswapd_max_order = 0;
+ if (order < new_order)
+ order = sc.order = new_order;
+
for (i = 0; i < pgdat->nr_zones; i++)
temp_priority[i] = DEF_PRIORITY;
@@ -2087,11 +2094,17 @@ out:
zone->prev_priority = temp_priority[i];
}
- if (!all_zones_ok) {
- cond_resched();
- try_to_freeze();
+ cond_resched();
+ try_to_freeze();
+ /*
+ * restart if someone wants a larger 'order' allocation
+ */
+ if (order < pgdat->kswapd_max_order)
+ goto loop_again;
+
+ if (!all_zones_ok) {
/*
* Fragmentation may mean that the system cannot be
* rebalanced for high-order allocations in all zones.
@@ -2130,7 +2143,6 @@ out:
*/
static int kswapd(void *p)
{
- unsigned long order;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
DEFINE_WAIT(wait);
@@ -2160,32 +2172,17 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();
- order = 0;
for ( ; ; ) {
- unsigned long new_order;
-
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
- new_order = pgdat->kswapd_max_order;
- pgdat->kswapd_max_order = 0;
- if (order < new_order) {
- /*
- * Don't sleep if someone wants a larger 'order'
- * allocation
- */
- order = new_order;
- } else {
- if (!freezing(current))
- schedule();
-
- order = pgdat->kswapd_max_order;
- }
+ if (!freezing(current))
+ schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
if (!try_to_freeze()) {
/* We can speed up thawing tasks if we don't call
* balance_pgdat after returning from the refrigerator
*/
- balance_pgdat(pgdat, order);
+ balance_pgdat(pgdat);
}
}
return 0;
--
1.6.2.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* Re: [PATCH 4/5] page allocator: Pre-emptively wake kswapd when high-order watermarks are hit
From: KOSAKI Motohiro @ 2009-10-27 2:42 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
Mohamed Abbas, Jens Axboe, John W. Linville, Pekka Enberg,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List, netdev
In-Reply-To: <1256221356-26049-5-git-send-email-mel@csn.ul.ie>
> When a high-order allocation fails, kswapd is kicked so that it reclaims
> at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> allocations. Something has changed in recent kernels that affect the timing
> where high-order GFP_ATOMIC allocations are now failing with more frequency,
> particularly under pressure.
>
> This patch pre-emptively checks if watermarks have been hit after a
> high-order allocation completes successfully. If the watermarks have been
> reached, kswapd is woken in the hope it fixes the watermarks before the
> next GFP_ATOMIC allocation fails.
>
> Warning, this patch is somewhat of a band-aid. If this makes a difference,
> it still implies that something has changed that is either causing more
> GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
> driver) or make them more likely to fail.
hmm, I'm confused. this description addressed generic high order allocation.
but,
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/page_alloc.c | 33 ++++++++++++++++++++++-----------
> 1 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f2aa3e..851df40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1596,6 +1596,17 @@ try_next_zone:
> return page;
> }
>
> +static inline
> +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> + enum zone_type high_zoneidx)
> +{
> + struct zoneref *z;
> + struct zone *zone;
> +
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> + wakeup_kswapd(zone, order);
> +}
> +
> static inline int
> should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> @@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
__alloc_pages_high_priority() is only called if ALLOC_NO_WATERMARKS.
ALLOC_NO_WATERMARKS mean PF_MEMALLOC or TIF_MEMDIE and GFP_ATOMIC don't make
nested alloc_pages() (= don't make PF_MEMALLOC case).
Then, I haven't understand why this patch improve iwlagn GFP_ATOMIC case.
hmm, maybe I missed something. I see the code again tommorow.
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> - return page;
> -}
> -
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> - enum zone_type high_zoneidx)
> -{
> - struct zoneref *z;
> - struct zone *zone;
> + /*
> + * If after a high-order allocation we are now below watermarks,
> + * pre-emptively kick kswapd rather than having the next allocation
> + * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
> + * allocations or entering direct reclaim
> + */
> + if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
> + preferred_zone->watermark[ALLOC_WMARK_LOW],
> + zone_idx(preferred_zone), ALLOC_WMARK_LOW))
> + wake_all_kswapd(order, zonelist, high_zoneidx);
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - wakeup_kswapd(zone, order);
> + return page;
> }
>
> static inline int
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 5/5] ONLY-APPLY-IF-STILL-FAILING Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
From: KOSAKI Motohiro @ 2009-10-27 2:42 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, David Rientjes,
Mohamed Abbas, Jens Axboe, John W. Linville, Pekka Enberg,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List, netdev
In-Reply-To: <1256221356-26049-6-git-send-email-mel@csn.ul.ie>
> Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
> least that the commits 373c0a7e 8aa7e847 dramatically increased the
> number of GFP_ATOMIC failures that were occuring within a wireless
> driver. It was never isolated which of the changes was the exact problem
> and it's possible it has been fixed since. If problems are still
> occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
> applied to determine if the congestion_wait() callers are still broken.
Oops. no, please no.
8aa7e847 is regression fixing commit. this revert indicate the regression
occur again.
if we really need to revert it, we need to revert 1faa16d2287 too.
however, I doubt this commit really cause regression to iwlan. IOW,
I agree Jens.
I hope to try reproduce this problem on my test environment. Can anyone
please explain reproduce way?
Is special hardware necessary?
----------------------------------------------------
commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0
Author: Jens Axboe <jens.axboe@oracle.com>
Date: Thu Jul 9 14:52:32 2009 +0200
Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 2:45 UTC (permalink / raw)
To: Stephen Hemminger, Al Viro
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel
In-Reply-To: <20091026153656.25be4369@nehalam>
Stephen Hemminger <shemminger@vyatta.com>, Al Viro a écrit :
> --- a/include/linux/dcache.h 2009-10-26 14:58:45.220347300 -0700
> +++ b/include/linux/dcache.h 2009-10-26 15:12:15.004160122 -0700
> @@ -45,15 +45,28 @@ struct dentry_stat_t {
> };
> extern struct dentry_stat_t dentry_stat;
>
> -/* Name hashing routines. Initial hash value */
> -/* Hash courtesy of the R5 hash in reiserfs modulo sign bits */
> -#define init_name_hash() 0
> +/*
> + * Fowler / Noll / Vo (FNV) Hash
> + * see: http://www.isthe.com/chongo/tech/comp/fnv/
> + */
> +#ifdef CONFIG_64BIT
> +#define FNV_PRIME 1099511628211ull
> +#define FNV1_INIT 14695981039346656037ull
> +#else
> +#define FNV_PRIME 16777619u
> +#define FNV1_INIT 2166136261u
> +#endif
> +
> +#define init_name_hash() FNV1_INIT
>
> -/* partial hash update function. Assume roughly 4 bits per character */
> +/* partial hash update function. */
> static inline unsigned long
> -partial_name_hash(unsigned long c, unsigned long prevhash)
> +partial_name_hash(unsigned char c, unsigned long prevhash)
> {
> - return (prevhash + (c << 4) + (c >> 4)) * 11;
> + prevhash ^= c;
> + prevhash *= FNV_PRIME;
> +
> + return prevhash;
> }
>
> /*
OK, but thats strlen(name) X (long,long) multiplies.
I suspect you tested on recent x86_64 cpu.
Some arches might have slow multiplies, no ?
jhash() (and others) are optimized by compiler to use basic and fast operations.
jhash operates on blocs of 12 chars per round, so it might be a pretty good choice once
out-of-line (because its pretty large and full_name_hash() is now used by
a lot of call sites)
Please provide your test program source, so that other can test on various arches.
Thanks
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Stephen Hemminger @ 2009-10-27 3:53 UTC (permalink / raw)
To: Eric Dumazet
Cc: Al Viro, Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel
In-Reply-To: <4AE65EDE.8080605@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2064 bytes --]
On Tue, 27 Oct 2009 03:45:50 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Stephen Hemminger <shemminger@vyatta.com>, Al Viro a écrit :
> > --- a/include/linux/dcache.h 2009-10-26 14:58:45.220347300 -0700
> > +++ b/include/linux/dcache.h 2009-10-26 15:12:15.004160122 -0700
> > @@ -45,15 +45,28 @@ struct dentry_stat_t {
> > };
> > extern struct dentry_stat_t dentry_stat;
> >
> > -/* Name hashing routines. Initial hash value */
> > -/* Hash courtesy of the R5 hash in reiserfs modulo sign bits */
> > -#define init_name_hash() 0
> > +/*
> > + * Fowler / Noll / Vo (FNV) Hash
> > + * see: http://www.isthe.com/chongo/tech/comp/fnv/
> > + */
> > +#ifdef CONFIG_64BIT
> > +#define FNV_PRIME 1099511628211ull
> > +#define FNV1_INIT 14695981039346656037ull
> > +#else
> > +#define FNV_PRIME 16777619u
> > +#define FNV1_INIT 2166136261u
> > +#endif
> > +
> > +#define init_name_hash() FNV1_INIT
> >
> > -/* partial hash update function. Assume roughly 4 bits per character */
> > +/* partial hash update function. */
> > static inline unsigned long
> > -partial_name_hash(unsigned long c, unsigned long prevhash)
> > +partial_name_hash(unsigned char c, unsigned long prevhash)
> > {
> > - return (prevhash + (c << 4) + (c >> 4)) * 11;
> > + prevhash ^= c;
> > + prevhash *= FNV_PRIME;
> > +
> > + return prevhash;
> > }
> >
> > /*
>
> OK, but thats strlen(name) X (long,long) multiplies.
>
> I suspect you tested on recent x86_64 cpu.
>
> Some arches might have slow multiplies, no ?
>
> jhash() (and others) are optimized by compiler to use basic and fast operations.
> jhash operates on blocs of 12 chars per round, so it might be a pretty good choice once
> out-of-line (because its pretty large and full_name_hash() is now used by
> a lot of call sites)
>
> Please provide your test program source, so that other can test on various arches.
>
> Thanks
long on i386 is 32 bits so it is 32 bit multiply. There is also an optimization
that uses shift and adds.
--
[-- Attachment #2: hashtest.tar.bz2 --]
[-- Type: application/x-bzip, Size: 7585 bytes --]
^ permalink raw reply
* RE: [PATCH] TI DaVinci EMAC: Minor macro related updates
From: Chaithrika U S @ 2009-10-27 3:57 UTC (permalink / raw)
To: 'Jean-Christophe PLAGNIOL-VILLARD'
Cc: netdev, davem, davinci-linux-open-source
In-Reply-To: <20091026220722.GA23415@game.jcrosoft.org>
On Tue, Oct 27, 2009 at 03:37:22, Jean-Christophe PLAGNIOL-VILLARD wrote:
> On 16:25 Thu 01 Oct , Chaithrika U S wrote:
> > Use BIT for macro definitions wherever possible, remove
> > unused and redundant macros.
> >
> > Signed-off-by: Chaithrika U S <chaithrika@ti.com>
> > ---
> > Applies to Linus' kernel tree
> do you plan to send a new version soon?
>
> as the current DaVinci EMAC does not build on the v2.6.32-rc5
>
> Best Regards,
> J.
>
DaVinci EMAC builds and work fine on Linus' tree and DaVinci GIT tree.
Can you please provide more info regarding the errors you are seeing?
Regards,
Chaithrika
^ permalink raw reply
* Re: [PATCH v2 8/8] Document future removal of sysctl_tcp_* options
From: Bill Fink @ 2009-10-27 5:09 UTC (permalink / raw)
To: Gilad Ben-Yossef
Cc: Eric Dumazet, William Allen Simpson, netdev, Ilpo Järvinen
In-Reply-To: <4AE5C579.2070104@codefidence.com>
On Mon, 26 Oct 2009, Gilad Ben-Yossef wrote:
> Bill Fink wrote:
>
> >> OK. It really sounds like we should go with my first suggestion: global
> >> sysctl based kill switches, just as we have now and in addition, the
> >> ability to kill TCP options per route. The TCP option will be used if
> >> and only if both kill switches (global and per route) are not set.
> >>
> >
> > This wording is confusing. The global kill switch not being set
> > really means that the sysctl is set. And this assumes the per-route
> > default is not set. Correct?
> >
> Now it is my turn to get confused, because I didn't understand your
> question :-)
My question was just if I understood you correctly, which apparently
I did.
> What I suggest is to leave the sysctl exactly as they are now:
>
> - You leave them be (value of 1), the respective TCP option is
> supported. This is the default.
> - You turn them off (write 0), the respective TCP option is not supported.
>
> What I suggest to *add* is the following ability:
>
> - If you have the TCP option support turned on (default, value of one),
> you can turn support for the option for a specific route using a ip
> route option.
Should be "turn" -> "turn off" above.
> Hope that made it clearer.
Yes.
> >> What we achieve is:
> >>
> >> 1. Global kill switches work exactly as they do now, whether you use the
> >> new per route options or not, so backwards compatible.
> >>
> >> 2. In addition, if the global kill switch is not in effect, you can also
> >> kill the options on a per route basis.
> >>
> >> I'm going to send third version of the patch to this effect, minus the
> >> new remote DoS possibility that Ilpo pointed out and leaving the global
> >> sysctl kill switches be.
> >>
> >> If you like it, please ACK ;-)
> >>
> >
> > IIUC this doesn't seem right to me. I believe the global setting
> > should be a default and the route specific an override. Your scheme
> > would mean that if I set a global option to disable timestamps, then
> > I couldn't enable timestamps on specific routes using the per route
> > setting.
> >
> Yes. You understand my intention perfectly.
>
> Let me try to explain why I believe this is the correct behavior to
> implement:
>
> 1. This is the closest thing to what we have now. Today you write 0 to
> the sysctl and that TCP option is turned off globally. Period. My
> suggestion leaves this behavior as is now regardless if you've used per
> route settings. The other way make a very subtle change in the meaning
> of writing 0 to the sysctl.
Continuing to support existing behavior is the most important
consideration for me, so your intention for the per route settings
doesn't cause me any major grief. I initially thought the per route
setting as an override to a global default setting was more intuitive
than your method of it being a disable switch only, but my thinking
has since evolved (see below).
> I believe very subtle changes to meaning of long established interfaces
> is bad way to go. It's better to change interfaces on users, but it is
> even worse to maek something that they have long used do something just
> slightly different.
I didn't see how my suggestion changed existing behavior. If you
didn't use the per route settings everything remained as it was
(or so I initially thought). And if you were using a new feature
then you needed to be aware of its semantics.
> 2. If the per route options needs to be "default, of or off" instead of
> "on or off", we'd need to move from 1 bit to store the option to, well
> 2s bit in theory, but probably 32 bits in practice, since we can't use
> RTAX_FEATURES any longer.
>
> Yes, we can invent RTAX_FEATURES_TWO_BITS or some such, but I'd say that
> is ugly :-)
I only intended a single bit for the global default and per route
settings. The global default would just be enabled (1) or disabled(0).
The per route setting would be inherited from the global default setting,
and also would be just enabled(1) or disabled(0).
Explicitly setting a per route setting to enabled(1) or disabled(0)
would override the initial default setting.
However I didn't fully work through all the ramifications of this
idea, and I now realize this would entail some changes to the existing
behavior of the global setting, which I agree is unacceptable.
I therefore now believe your latest proposal is a better approach
for maintaining backward compatibility and avoiding any nasty
unexpected surprises to existing working configurations.
> 3. I believe that the scenario of needing to set the support of a TCP
> option globally off and just turn it on for a specific route is not very
> likely to be needed and losing it is a small price to pay for 1 + 2.
I agree this is probably an uncommon scenario.
> > And it also doesn't seem to address Eric's scenario. If I understand
> > his concern correctly, what seems to be needed is a third global
> > reset value (not calling it a setting since the actual global setting
> > wouldn't be changed), which would reset any per-route override settings
> > to the global default setting.
> >
> >
> Well, I do not believe this is what Eric meant (Eric?) but if it is then
> I fail to see why
> to require from the per route TCP options switches what is not required
> of any other
> route specific option already existing, since AFAIK we don't have a
> "reset to default values" to the other options already supported.
I won't try and speak for Eric anymore.
> Having said all that, I have no issue with re-spinning the patch with
> your suggestion.
> I don't feel all that much which is the correct way- I just want to get
> as much feedback as possible
> since I'm suggesting to add a new user interface options and we all know
> it is very hard to back peddle
> on those, so I'm trying to make sure to get enough feedback to do it
> right the firs time.
>
> So any feedback from anyone regarding favorite interface? it seems each
> person fancy a different one :-)
I haven't checked the details of your patch, but I am now fine with
the concepts of the patch as you most recently presented them.
-Bill
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Stephen Hemminger @ 2009-10-27 5:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <9986527.24561256620662709.JavaMail.root@tahiti.vyatta.com>
One of the root causes of slowness in network usage
was my original choice of power of 2 for hash size, to avoid
a mod operation. It turns out if size is not a power of 2
the original algorithm works fairly well.
On slow cpu; with 10million entries and 211 hash size
Algorithm Time Ratio Max StdDev
string10 1.271871 1.00 47397 0.01
djb2 1.406322 1.00 47452 0.12
SuperFastHash 1.422348 1.00 48400 1.99
string_hash31 1.424079 1.00 47437 0.08
jhash_string 1.459232 1.00 47954 1.01
sdbm 1.499209 1.00 47499 0.22
fnv32 1.539341 1.00 47728 0.75
full_name_hash 1.556792 1.00 47412 0.04
string_hash17 1.719039 1.00 47413 0.05
pjw 1.827365 1.00 47441 0.09
elf 2.033545 1.00 47441 0.09
fnv64 2.199533 1.00 47666 0.53
crc 5.705784 1.00 47913 0.95
md5_string 10.308376 1.00 47946 1.00
fletcher 1.418866 1.01 53189 18.65
adler32 2.842117 1.01 53255 18.79
kr_hash 1.175678 6.43 468517 507.44
xor 1.114692 11.02 583189 688.96
lastchar 0.795316 21.10 1000000 976.02
How important is saving the one division, versus getting better
distribution.
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: David Miller @ 2009-10-27 5:24 UTC (permalink / raw)
To: stephen.hemminger
Cc: eric.dumazet, akpm, torvalds, opurdila, netdev, linux-kernel,
viro
In-Reply-To: <19864844.24581256620784317.JavaMail.root@tahiti.vyatta.com>
From: Stephen Hemminger <stephen.hemminger@vyatta.com>
Date: Mon, 26 Oct 2009 22:19:44 -0700 (PDT)
> How important is saving the one division, versus getting better
> distribution.
80 cpu cycles or more on some processors. Cheaper to use
jenkins with a power-of-2 sized hash.
^ permalink raw reply
* Re: [patch next 3/4] netxen: fix bonding support
From: Eric W. Biederman @ 2009-10-27 5:50 UTC (permalink / raw)
To: Dhananjay Phadke; +Cc: netdev@vger.kernel.org
In-Reply-To: <7608421F3572AB4292BB2532AE89D5658B0B6E4F74@AVEXMB1.qlogic.org>
Dhananjay Phadke <dhananjay.phadke@qlogic.com> writes:
>> Yes. That should prevent the null pointer deference. Will it also
>> allow setting the mac address when the NIC is down?
>>
>> Eric
>
> Yes, we do save new address in netdev->dev_addr.
> This is later on programmed in hardware when interface is brought up.
Yep. Thanks. I just tested it and confirmed the crash is no longer there.
Eric
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 6:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <19864844.24581256620784317.JavaMail.root@tahiti.vyatta.com>
Stephen Hemminger a écrit :
> One of the root causes of slowness in network usage
> was my original choice of power of 2 for hash size, to avoid
> a mod operation. It turns out if size is not a power of 2
> the original algorithm works fairly well.
Interesting, but I suspect all users have power of 2 tables :(
>
> On slow cpu; with 10million entries and 211 hash size
>
>
>
> How important is saving the one division, versus getting better
> distribution.
unsigned int fold1(unsigned hash)
{
return hash % 211;
}
Compiler uses a reciprocal divide because of 211 being a constant.
And you also could try following that contains one multiply only,
and check if hash distribution properties are still OK
unsigned int fold2(unsigned hash)
{
return ((unsigned long long)hash * 211) >> 32;
}
fold1:
movl 4(%esp), %ecx
movl $-1689489505, %edx
movl %ecx, %eax
mull %edx
shrl $7, %edx
imull $211, %edx, %edx
subl %edx, %ecx
movl %ecx, %eax
ret
fold2:
movl $211, %eax
mull 4(%esp)
movl %edx, %eax
ret
^ permalink raw reply
* RE: [patch next 3/4] netxen: fix bonding support
From: Dhananjay Phadke @ 2009-10-27 6:06 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: netdev@vger.kernel.org
In-Reply-To: <m1aazdl713.fsf@fess.ebiederm.org>
Eric,
Thanks for reporting and verifying.
-Dhananjay
________________________________________
From: Eric W. Biederman [ebiederm@xmission.com]
Sent: Monday, October 26, 2009 10:50 PM
To: Dhananjay Phadke
Cc: netdev@vger.kernel.org
Subject: Re: [patch next 3/4] netxen: fix bonding support
Dhananjay Phadke <dhananjay.phadke@qlogic.com> writes:
>> Yes. That should prevent the null pointer deference. Will it also
>> allow setting the mac address when the NIC is down?
>>
>> Eric
>
> Yes, we do save new address in netdev->dev_addr.
> This is later on programmed in hardware when interface is brought up.
Yep. Thanks. I just tested it and confirmed the crash is no longer there.
Eric
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 6:50 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <4AE68E23.20205@gmail.com>
Eric Dumazet a écrit :
> unsigned int fold2(unsigned hash)
> {
> return ((unsigned long long)hash * 211) >> 32;
> }
>
I tried this reciprocal thing with 511 and 1023 values and got on a PIII 550 MHz, gcc-3.3.2 :
# ./hashtest 100000 511
jhash_string 0.033123 1.01 234 1.06
fnv32 0.033911 1.02 254 1.38
# ./hashtest 1000000 511
jhash_string 0.331155 1.00 2109 1.10
fnv32 0.359346 1.00 2151 1.65
# ./hashtest 10000000 511
jhash_string 3.383340 1.00 19985 1.03
fnv32 3.849359 1.00 20198 1.53
# ./hashtest 100000 1023
jhash_string 0.033123 1.03 134 1.01
fnv32 0.034260 1.03 142 1.32
# ./hashtest 1000000 1023
jhash_string 0.332329 1.00 1075 1.06
fnv32 0.422035 1.00 1121 1.59
# ./hashtest 10000000 1023
jhash_string 3.417559 1.00 10107 1.01
fnv32 3.747563 1.00 10223 1.35
511 value on 64bit, and 1023 on 32bit arches are nice because
hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
Conclusion : jhash and 511/1023 hashsize for netdevices,
no divides, only one multiply for the fold.
^ permalink raw reply
* Re: [PATCH] net: Adjust softirq raising in __napi_schedule
From: Johannes Berg @ 2009-10-27 7:01 UTC (permalink / raw)
To: Tilman Schmidt
Cc: Jarek Poplawski, David Miller,
hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-ppp-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
paulus-eUNUBHrolfbYtjvyW6yDsg, Michael Buesch, Oliver Hartkopp
In-Reply-To: <4AE64441.7060008-ZTO5kqT2PaM@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]
On Tue, 2009-10-27 at 01:52 +0100, Tilman Schmidt wrote:
> > Any code (say ISDN code) that calls netif_rx() is clearly assuming to
> > always be running in (soft)irq context, otherwise it couldn't call
> > netif_rx() unconditionally. Agree so far?
>
> Well, in fact I'm not sure. :-) All I know is that in the ISDN case, no
> such assumption is explicitly stated anywhere. (The code in question is
> called from the rcvcallb_skb() callback method which the hardware driver
> calls when data has been received, and the description of that method in
> Documentation/isdn/INTERFACE does not say anything about the context in
> which it may be called.) The relevant code in drivers/isdn/i4l/isdn_ppp.c
> is rather old, perhaps even older than softirqs and the netif_rx() /
> netif_rx_ni() split. (Bear in mind that we are talking about the old
> ISDN4Linux subsystem which initially didn't even make it into the 2.6
> series because it was considered obsolete.) It seems quite possible to me
> that just no one ever thought about that question.
Heh :)
> > So now if you change the ISDN code to call netif_rx_ni(), you've changed
> > the assumption that the ISDN code makes -- that it is running in
> > (soft)irq context. Therefore, you need to verify that this is actually a
> > correct change, which is what I tried to say.
>
> Understood. However, the fact that the local_softirq_pending message is
> appearing would seem to indicate that this assumption was wrong to
> begin with, wouldn't it?
I thought it only recently started appearing with a new driver or
something, but I may have misunderstood you. Anyway, I think that sums
up the issue from my POV.
johannes
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 7:29 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <4AE69829.9070207@gmail.com>
Eric Dumazet a écrit :
>
>
> 511 value on 64bit, and 1023 on 32bit arches are nice because
> hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
>
> Conclusion : jhash and 511/1023 hashsize for netdevices,
> no divides, only one multiply for the fold.
Just forget about 511 & 1023, as power of two works too.
-> 512 & 1024 + jhash
Guess what, David already said this :)
^ permalink raw reply
* Re: Increasing TCP initial cwnd
From: Ilpo Järvinen @ 2009-10-27 7:38 UTC (permalink / raw)
To: Yair Gottdenker; +Cc: Netdev
In-Reply-To: <fa3cddca0910260807q714a24fdp1ffae601bb6f67ba@mail.gmail.com>
On Mon, 26 Oct 2009, Yair Gottdenker wrote:
> I already had a look in it. Their changes that are related to changing
> the initial cwnd (snd_cwnd) are in several places in the patch:
Right. I know what I've modified in that patch... ;-)
> 1. tcp_ipv4.c in function tcp_init_cwnd. since currently I am working
> with no metric this is irrelevant.
You managed to mix up this filename twice in a row in a different way :-).
...That function resides in tcp_input.c instead. And changes in
tcp_init_metrics are not completely irrelevant, even with no metrics it
is executed as "no metrics" works a bit differenctly than you seem to
think.
> 2. tcp_ipv4.c: in function tcp_v4_init_sock, which I applied
> 3. tcp_minisocks.c - which I applied
> . They patched kernel 2.6.18.8 which is very old and there were a lot
> of changes since that in the kernel tcp layer. I didn't test their
> patch but applying the same logic to 2.6.31.3 doesn't seems to change
> initial window size.
>
> It is clear that in order to correctly make the change to the initial
> cwnd, one should apply changes to many more scenarios like setting the
> initial cwnd after idle time and more. I want to proceed step by step,
> first the naive changes to make it work.
Hmm, it now rings a bell... You cannot directly set initial window to a
large value (or you can but that doesn't do anything on the wire) because
you hit first the limit of receiver advertised window that applies
auto-tuning (with a starting value was four IIRC).
--
i.
^ permalink raw reply
* next tree for October 27: znet build failure
From: Sachin Sant @ 2009-10-27 9:35 UTC (permalink / raw)
To: linux-next; +Cc: Stephen Rothwell, netdev, David Miller
In-Reply-To: <20091027190850.9694fb39.sfr@canb.auug.org.au>
Fails to build on x86_32 with lots of errors related to znet
drivers/net/znet.c:107:29: error: wireless/i82593.h: No such file or directory
drivers/net/znet.c:133: error: field 'i593_init' has incomplete type
drivers/net/znet.c: In function 'znet_set_multicast_list':
drivers/net/znet.c:235: error: invalid application of 'sizeof' to incomplete type 'struct i82593_conf_block'
drivers/net/znet.c:247: error: dereferencing pointer to incomplete type
.....
Commit 879e9304... moved wireless/i82593.h to drivers/staging/wavelan/i82593.h
Thanks
-Sachin
--
---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------
^ permalink raw reply
* Re: performance regression in virtio-net in 2.6.32-rc4
From: Avi Kivity @ 2009-10-27 9:41 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Rusty Russell, virtualization, kvm, netdev
In-Reply-To: <20091026184835.GB26473@redhat.com>
On 10/26/2009 08:48 PM, Michael S. Tsirkin wrote:
> Hi!
> I noticed a performance regression in virtio net: going from
> 2.6.31 to 2.6.32-rc4 I see this, for guest to host communication:
>
> Any tips on debugging this?
>
Lacking better advice, a bisect can help as a last resort. 'git bisect
start -- drivers/net drivers/virtio' will probably find it fastest.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: Benny Amorsen @ 2009-10-27 9:52 UTC (permalink / raw)
To: Eric Dumazet
In-Reply-To: <4AE5CAC6.4000604@gmail.com>
Eric Dumazet <eric.dumazet@gmail.com> writes:
> Here is the patch I cooked that permitted VLAN 0 to be used with tg3
> (and other HW accelerated vlan nics I suppose)
>
> [PATCH] vlan: allow VLAN ID 0 to be used
>
> We currently use a 16 bit field (vlan_tci) to store VLAN ID on a skb.
>
> 0 value is used a special value, meaning VLAN ID not set.
> This forbids use of VLAN ID 0
Are you sure you actually want to do this?
VLAN 0 IS special. Frames received on VLAN 0 should be treated just as
if they had no VLAN tag at all, except that they have an 802.1p value.
Sending frames with VLAN 0 should have something to do with whether
the sender wants to use 802.1p, which doesn't really have much to do
with VLAN's at all...
It would be nice if the unsuspecting user was at least warned that their
use of VLAN 0 is non-standard and may cause surprising results like
leakage into the "native" VLAN. That could be done in /sbin/ip or
/sbin/vconfig, of course.
/Benny
^ permalink raw reply
* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: Eric Dumazet @ 2009-10-27 10:02 UTC (permalink / raw)
To: Benny Amorsen
Cc: Gertjan Hofman, Matt Carlson, netdev@vger.kernel.org,
Patrick McHardy, David S. Miller
In-Reply-To: <m3aazdb1ue.fsf@ursa.amorsen.dk>
Benny Amorsen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
>
>> Here is the patch I cooked that permitted VLAN 0 to be used with tg3
>> (and other HW accelerated vlan nics I suppose)
>>
>> [PATCH] vlan: allow VLAN ID 0 to be used
>>
>> We currently use a 16 bit field (vlan_tci) to store VLAN ID on a skb.
>>
>> 0 value is used a special value, meaning VLAN ID not set.
>> This forbids use of VLAN ID 0
>
> Are you sure you actually want to do this?
>
> VLAN 0 IS special. Frames received on VLAN 0 should be treated just as
> if they had no VLAN tag at all, except that they have an 802.1p value.
>
> Sending frames with VLAN 0 should have something to do with whether
> the sender wants to use 802.1p, which doesn't really have much to do
> with VLAN's at all...
>
> It would be nice if the unsuspecting user was at least warned that their
> use of VLAN 0 is non-standard and may cause surprising results like
> leakage into the "native" VLAN. That could be done in /sbin/ip or
> /sbin/vconfig, of course.
>
Quotting http://en.wikipedia.org/wiki/IEEE_802.1Q
VLAN Identifier (VID): a 12-bit field specifying the VLAN to which the frame belongs.
A value of 0 means that the frame doesn't belong to any VLAN; in this case the 802.1Q
tag specifies only a priority and is referred to as a priority tag.
A value of hex FFF is reserved for implementation use.
All other values may be used as VLAN identifiers, allowing up to 4094 VLANs
So we expect to generate a 802.1Q frame, even with a VID=0 field.
Before patch, device sends a non 802.1Q frame, which is not what was wanted by user.
(Maybe he wants to check its device/network is able to transport 1522 bytes frames, who knows...)
To use non tagged frames, user selects eth0 device, and to send tagged frames, he selects eth0.0
Now, maybe eth0 and eth0.0 should share same IP addresses, because incoming frame
with ID=0 tag should be received by eth0 device, but I am not sure standard requires this.
^ permalink raw reply
* Re: [PATCH 6/9] ser_gigaset: checkpatch cleanup
From: Karsten Keil @ 2009-10-27 10:20 UTC (permalink / raw)
To: isdn4linux
Cc: Tilman Schmidt, Joe Perches, netdev, linux-kernel, i4ldeveloper,
Hansjoerg Lipp, David Miller
In-Reply-To: <4AE637D8.60809@imap.cc>
On Dienstag, 27. Oktober 2009 00:59:20 Tilman Schmidt wrote:
> Am 26.10.2009 01:54 schrieb Joe Perches:
> > On Sun, 2009-10-25 at 20:30 +0100, Tilman Schmidt wrote:
> >> Duly uglified as demanded by checkpatch.pl.
> >> diff --git a/drivers/isdn/gigaset/ser-gigaset.c
> >> b/drivers/isdn/gigaset/ser-gigaset.c index 3071a52..ac3409e 100644
> >> --- a/drivers/isdn/gigaset/ser-gigaset.c
> >> +++ b/drivers/isdn/gigaset/ser-gigaset.c
> >> @@ -164,9 +164,15 @@ static void gigaset_modem_fill(unsigned long data)
> >> {
> >> struct cardstate *cs = (struct cardstate *) data;
> >> struct bc_state *bcs;
> >> + struct sk_buff *nextskb;
> >> int sent = 0;
> >>
> >> - if (!cs || !(bcs = cs->bcs)) {
> >> + if (!cs) {
> >> + gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> >> + return;
> >> + }
> >> + bcs = cs->bcs;
> >> + if (!bcs) {
> >> gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> >> return;
> >> }
> >
> > perhaps:
> > if (!cs || !cs->bcs) {
> > gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> > return;
> > }
> > bcs = cs->bcs;
>
> That would evaluate cs->bcs twice, and is also, in my experience,
gcc should handle this subsequent double evaluation well enough.
> significantly more prone to easily overlooked typos which result in
> checking a different pointer in the if statement than the one that's
> actually used in the subsequent assignment.
>
Yes this may happen, but more often a = in if statements should be a ==.
The kernel code style says only one statement per line, which implies no
assignments in if statements, so we should follow this.
The checkpatch.pl script complain about these issues.
Yes sometimes in the past (while preparing some old code for kernel submit)
I was not very happy about all these rules, it take lot time to reach zero
reports from checkpatch.
But it make lot of sense in the long term, currently I have to debug code
written without any code style, it is really, really painful and I need 5-10
times more time for simple understanding the basic function of the code.
Karsten
^ permalink raw reply
* Re: [PATCH 5/5] ONLY-APPLY-IF-STILL-FAILING Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
From: Frans Pop @ 2009-10-27 10:29 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Mel Gorman, Jiri Kosina, Sven Geggus, Karol Lewandowski,
Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
Kalle Valo, David Rientjes, Mohamed Abbas, Jens Axboe,
John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
netdev, linux-kernel, linux-mm@kvack.org
In-Reply-To: <20091026235628.2F7B.A69D9226@jp.fujitsu.com>
On Tuesday 27 October 2009, KOSAKI Motohiro wrote:
> Oops. no, please no.
> 8aa7e847 is regression fixing commit. this revert indicate the
> regression occur again.
> if we really need to revert it, we need to revert 1faa16d2287 too.
> however, I doubt this commit really cause regression to iwlan. IOW,
> I agree Jens.
This is not intended as a patch for mainline, but just as a test to see if
it improves things. It may be a regression fix, but it also creates a
significant change in behavior during swapping in my test case.
If a fix is needed, it will probably by different from this revert.
Please read: http://lkml.org/lkml/2009/10/26/510.
This mail has some data: http://lkml.org/lkml/2009/10/26/455.
> I hope to try reproduce this problem on my test environment. Can anyone
> please explain reproduce way?
Please see my mails in this thread for bug #14141:
http://thread.gmane.org/gmane.linux.kernel/896714
You will probably need to read some of them to understand the context of
the two mails linked above.
The most relevant ones are (all from the same thread; not sure why gmane
gives such weird links):
http://article.gmane.org/gmane.linux.kernel.mm/39909
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7228
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7165
> Is special hardware necessary?
Not special hardware, but you may need an encrypted partition and NFS; the
test may need to be modified according to the amount of memory you have.
I think it should be possible to reproduce the freezes I see while ignoring
the SKB allocation errors as IMO those are just a symptom, not the cause.
So you should not need wireless.
The severity of the freezes during my test often increases if the test is
repeated (without rebooting).
Cheers,
FJP
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Mel Gorman @ 2009-10-27 10:40 UTC (permalink / raw)
To: reinette chatre
Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
Tobias Oetiker, Rafael J. Wysocki, David Miller, Kalle Valo,
David Rientjes, KOSAKI Motohiro, Abbas, Mohamed, Jens Axboe,
John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
In-Reply-To: <1256226219.21134.1493.camel@rc-desk>
On Thu, Oct 22, 2009 at 08:43:38AM -0700, reinette chatre wrote:
> On Thu, 2009-10-22 at 07:22 -0700, Mel Gorman wrote:
> > [Bug #14141] order 2 page allocation failures in iwlagn
> > Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
> > allocations within the wireless driver. This has caused large numbers
> > of failure reports to occur as reported by Frans Pop. Fixing this
> > requires changes to the driver if it wants to use GFP_ATOMIC which
> > is in the hands of Mohamed Abbas and Reinette Chatre. However,
> > it is very likely that it has being compounded by core mm changes
> > that this series is aimed at.
>
> Driver has been changed to allocate paged skb for its receive buffers.
> This reduces amount of memory needed from order-2 to order-1. This work
> is significant and will thus be in 2.6.33.
>
What do you want to do for -stable in 2.6.31?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox