Re: slow performance on disk/network i/o full speed after drop

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: slow performance on disk/network i/o full speed after drop_caches
       [not found] <4E5494D4.1050605@profihost.ag>
@ 2011-08-24  6:20 ` Pekka Enberg
  2011-08-24  9:01   ` Stefan Priebe - Profihost AG
  2011-08-24  9:32   ` Wu Fengguang
  0 siblings, 2 replies; 16+ messages in thread
From: Pekka Enberg @ 2011-08-24  6:20 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: LKML, linux-mm, Andrew Morton, Mel Gorman, Jens Axboe,
	Wu Fengguang, Linux Netdev List

On Wed, Aug 24, 2011 at 9:06 AM, Stefan Priebe - Profihost AG
<s.priebe@profihost.ag> wrote:
> i hope this is the correct list to write to if it would be nice to give me a
> hint where i can ask.
>
> Kernel: 2.6.38
>
> I'm seeing some strange problems on some of our servers after upgrading to
> 2.6.38.
>
> I'm copying a 1GB file via scp from Machine A to Machine B. When B is
> freshly booted the file transfer is done with about 80 to 85 Mb/s. I can
> repeat that various times to performance degrease.
>
> Then after some days copying is only done with about 900kb/s up to 3Mb/s
> going up and down while transfering the file.
>
> When i then do drop_caches it works again on 80Mb/s.
>
> sync && echo 3 >/proc/sys/vm/drop_caches && sleep 2 && echo 0
>>/proc/sys/vm/drop_caches
>
> Attached is also an output of meminfo before and after drop_caches.
>
> What's going on here? MemFree is pretty high.
>
> Please CC me i'm not on list.

Interesting. I can imagine one or more of the following to be
involved: networking, vmscan, block, and writeback. Lets CC all of
them!

> # before drop_caches
>
> # cat /proc/meminfo
> MemTotal:        8185544 kB
> MemFree:         6670292 kB
> Buffers:          105164 kB
> Cached:           166672 kB
> SwapCached:            0 kB
> Active:           728308 kB
> Inactive:         567428 kB
> Active(anon):     639204 kB
> Inactive(anon):   394932 kB
> Active(file):      89104 kB
> Inactive(file):   172496 kB
> Unevictable:        2976 kB
> Mlocked:            2992 kB
> SwapTotal:       1464316 kB
> SwapFree:        1464316 kB
> Dirty:                52 kB
> Writeback:             0 kB
> AnonPages:       1026920 kB
> Mapped:            54208 kB
> Shmem:              8380 kB
> Slab:              80724 kB
> SReclaimable:      22844 kB
> SUnreclaim:        57880 kB
> KernelStack:        2872 kB
> PageTables:        35448 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     5557088 kB
> Committed_AS:    6187972 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      292360 kB
> VmallocChunk:   34359425327 kB
> HardwareCorrupted:     0 kB
> DirectMap4k:        5632 kB
> DirectMap2M:     2082816 kB
> DirectMap1G:     6291456 kB
>
> # cat /proc/meminfo
> MemTotal:        8185544 kB
> MemFree:         6888060 kB
> Buffers:             372 kB
> Cached:            61492 kB
> SwapCached:            0 kB
> Active:           659156 kB
> Inactive:         426664 kB
> Active(anon):     638892 kB
> Inactive(anon):   395200 kB
> Active(file):      20264 kB
> Inactive(file):    31464 kB
> Unevictable:        2976 kB
> Mlocked:            2992 kB
> SwapTotal:       1464316 kB
> SwapFree:        1464316 kB
> Dirty:                 0 kB
> Writeback:             0 kB
> AnonPages:       1026952 kB
> Mapped:            54236 kB
> Shmem:              8316 kB
> Slab:              70616 kB
> SReclaimable:      12264 kB
> SUnreclaim:        58352 kB
> KernelStack:        2864 kB
> PageTables:        35448 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     5557088 kB
> Committed_AS:    6187932 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      292360 kB
> VmallocChunk:   34359425327 kB
> HardwareCorrupted:     0 kB
> DirectMap4k:        5632 kB
> DirectMap2M:     2082816 kB
> DirectMap1G:     6291456 kB
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-24  6:20 ` slow performance on disk/network i/o full speed after drop_caches Pekka Enberg
@ 2011-08-24  9:01   ` Stefan Priebe - Profihost AG
  2011-08-24  9:33     ` Wu Fengguang
  2011-08-24  9:32   ` Wu Fengguang
  1 sibling, 1 reply; 16+ messages in thread
From: Stefan Priebe - Profihost AG @ 2011-08-24  9:01 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: LKML, linux-mm, Andrew Morton, Mel Gorman, Jens Axboe,
	Wu Fengguang, Linux Netdev List


>> sync&&  echo 3>/proc/sys/vm/drop_caches&&  sleep 2&&  echo 0
>>> /proc/sys/vm/drop_caches

Another way to get it working again is to stop some processes. Could be 
mysql or apache or php fcgi doesn't matter. Just free some memory. 
Although there are already 5GB free.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-24  6:20 ` slow performance on disk/network i/o full speed after drop_caches Pekka Enberg
  2011-08-24  9:01   ` Stefan Priebe - Profihost AG
@ 2011-08-24  9:32   ` Wu Fengguang
  1 sibling, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2011-08-24  9:32 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Stefan Priebe - Profihost AG, LKML, linux-mm@kvack.org,
	Andrew Morton, Mel Gorman, Jens Axboe, Linux Netdev List

On Wed, Aug 24, 2011 at 02:20:07PM +0800, Pekka Enberg wrote:
> On Wed, Aug 24, 2011 at 9:06 AM, Stefan Priebe - Profihost AG
> <s.priebe@profihost.ag> wrote:
> > i hope this is the correct list to write to if it would be nice to give me a
> > hint where i can ask.
> >
> > Kernel: 2.6.38
> >
> > I'm seeing some strange problems on some of our servers after upgrading to
> > 2.6.38.
> >
> > I'm copying a 1GB file via scp from Machine A to Machine B. When B is
> > freshly booted the file transfer is done with about 80 to 85 Mb/s. I can
> > repeat that various times to performance degrease.
> >
> > Then after some days copying is only done with about 900kb/s up to 3Mb/s
> > going up and down while transfering the file.
> >
> > When i then do drop_caches it works again on 80Mb/s.
> >
> > sync && echo 3 >/proc/sys/vm/drop_caches && sleep 2 && echo 0
> >>/proc/sys/vm/drop_caches
> >
> > Attached is also an output of meminfo before and after drop_caches.
> >
> > What's going on here? MemFree is pretty high.
> >
> > Please CC me i'm not on list.
> 
> Interesting. I can imagine one or more of the following to be
> involved: networking, vmscan, block, and writeback. Lets CC all of
> them!
> 
> > # before drop_caches
> >
> > # cat /proc/meminfo
> > MemTotal: A  A  A  A 8185544 kB
> > MemFree: A  A  A  A  6670292 kB
> > Buffers: A  A  A  A  A 105164 kB
> > Cached: A  A  A  A  A  166672 kB
> > SwapCached: A  A  A  A  A  A 0 kB
> > Active: A  A  A  A  A  728308 kB
> > Inactive: A  A  A  A  567428 kB
> > Active(anon): A  A  639204 kB
> > Inactive(anon): A  394932 kB
> > Active(file): A  A  A 89104 kB
> > Inactive(file): A  172496 kB
> > Unevictable: A  A  A  A 2976 kB
> > Mlocked: A  A  A  A  A  A 2992 kB
> > SwapTotal: A  A  A  1464316 kB
> > SwapFree: A  A  A  A 1464316 kB
> > Dirty: A  A  A  A  A  A  A  A 52 kB
> > Writeback: A  A  A  A  A  A  0 kB

Since dirty/writeback pages are low, it seems not being throttled by
balance_dirty_pages().

Stefan, would you please run this several times on the server?

ps -eo user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd | grep scp

It will show where the scp task is blocked (the wchan field). Hope it helps.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-24  9:01   ` Stefan Priebe - Profihost AG
@ 2011-08-24  9:33     ` Wu Fengguang
  2011-08-25  9:00       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2011-08-24  9:33 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List

On Wed, Aug 24, 2011 at 05:01:03PM +0800, Stefan Priebe - Profihost AG wrote:
> 
> >> sync&&  echo 3>/proc/sys/vm/drop_caches&&  sleep 2&&  echo 0
> >>> /proc/sys/vm/drop_caches
> 
> Another way to get it working again is to stop some processes. Could be 
> mysql or apache or php fcgi doesn't matter. Just free some memory. 
> Although there are already 5GB free.

Is it a NUMA machine and _every_ node has enough free pages?

        grep . /sys/devices/system/node/node*/vmstat

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-24  9:33     ` Wu Fengguang
@ 2011-08-25  9:00       ` Stefan Priebe - Profihost AG
  2011-08-26  2:16         ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Priebe - Profihost AG @ 2011-08-25  9:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List

Am 24.08.2011 11:33, schrieb Wu Fengguang:
> On Wed, Aug 24, 2011 at 05:01:03PM +0800, Stefan Priebe - Profihost AG wrote:
>>
>>>> sync&&   echo 3>/proc/sys/vm/drop_caches&&   sleep 2&&   echo 0
>>>>> /proc/sys/vm/drop_caches
>>
>> Another way to get it working again is to stop some processes. Could be
>> mysql or apache or php fcgi doesn't matter. Just free some memory.
>> Although there are already 5GB free.
>
> Is it a NUMA machine and _every_ node has enough free pages?
>
>          grep . /sys/devices/system/node/node*/vmstat
>
> Thanks,
> Fengguang
Hi Fengguang,

thanks for your fast reply.

Here is the data you requested:

root@server1015-han:~# grep . /sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node0/vmstat:nr_written 5546561
/sys/devices/system/node/node0/vmstat:nr_dirtied 5572497
/sys/devices/system/node/node1/vmstat:nr_written 3936
/sys/devices/system/node/node1/vmstat:nr_dirtied 4190

modified it a little bit:
~# while [ true ]; do ps -eo 
user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd 
| grep scp | grep -v grep; sleep 1; done

root     12409 12409 TS       -   0  19   0 59.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 64.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 67.7  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 70.6  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 76.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 78.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 80.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 80.9  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   2 76.7  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 75.6  42136  1724  0.0 Ds 
pipe_read                    scp -t /tmp/
root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 75.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 76.6  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 77.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 79.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 72.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 73.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 73.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 74.3  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.4  42136  1724  0.0 Ss 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 71.3  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 71.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 72.7  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   3 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   3 74.4  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   3 75.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 76.6  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 74.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.2  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.9  42136  1724  0.0 Rs 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 72.4  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.5  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1  0.0  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 23.0  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 49.5  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   2 63.3  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 71.5  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 77.4  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 70.3  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 73.1  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12566 12566 TS       -   0  19   0 65.7  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12566 12566 TS       -   0  19   1 61.2  42136  1728  0.0 Ss 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 63.7  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12636 12636 TS       -   0  19   8  0.0  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/


Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-25  9:00       ` Stefan Priebe - Profihost AG
@ 2011-08-26  2:16         ` Wu Fengguang
  2011-08-26  2:54           ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2011-08-26  2:16 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List

Hi Stefan,

> Here is the data you requested:
> 
> root@server1015-han:~# grep . /sys/devices/system/node/node*/vmstat
> /sys/devices/system/node/node0/vmstat:nr_written 5546561
> /sys/devices/system/node/node0/vmstat:nr_dirtied 5572497
> /sys/devices/system/node/node1/vmstat:nr_written 3936
> /sys/devices/system/node/node1/vmstat:nr_dirtied 4190

Ah you are running an older kernel that didn't show all the vmstat
numbers. But still it's revealing that node 0 is used heavily and node
1 is almost idle. So I won't be surprised to see most free pages lie
in node 1.

> modified it a little bit:
> ~# while [ true ]; do ps -eo 
> user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd 
> | grep scp | grep -v grep; sleep 1; done
> 
> root     12409 12409 TS       -   0  19   0 59.8  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/

It's mostly doing poll() waits. There must be some dependency on
something other to make progress. Would you post the full ps output
for all tasks, and even better, run

        echo t > /proc/sysrq-trigger

To dump the kernel stacks?

Thanks,
Fengguang


> root     12409 12409 TS       -   0  19   0 64.0  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 67.7  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 70.6  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 76.0  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 78.2  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 80.0  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 80.9  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   2 76.7  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 75.6  42136  1724  0.0 Ds 
> pipe_read                    scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 75.2  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 76.6  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 77.9  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 79.0  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 72.8  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 73.0  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 73.8  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 74.3  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 73.4  42136  1724  0.0 Ss 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 71.3  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 71.9  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 72.7  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   3 73.5  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   3 74.4  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   3 75.2  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 76.6  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 74.8  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 73.2  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   1 73.9  42136  1724  0.0 Rs 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   0 72.4  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 72.0  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 72.5  42136  1724  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 72.9  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1  0.0  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 23.0  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 49.5  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   2 63.3  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 71.5  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 77.4  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 70.3  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 73.1  42136  1728  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12566 12566 TS       -   0  19   0 65.7  42136  1728  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 61.2  42136  1728  0.0 Ss 
> -                            scp -t /tmp/
> root     12566 12566 TS       -   0  19   1 63.7  42136  1728  0.0 Rs 
> -                            scp -t /tmp/
> root     12636 12636 TS       -   0  19   8  0.0  42136  1728  0.0 Ss 
> poll_schedule_timeout        scp -t /tmp/
> 
> 
> Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  2:16         ` Wu Fengguang
@ 2011-08-26  2:54           ` Stefan Priebe - Profihost AG
  2011-08-26  3:03             ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Priebe - Profihost AG @ 2011-08-26  2:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List

Hi Wu,

> Ah you are running an older kernel that didn't show all the vmstat
> numbers. But still it's revealing that node 0 is used heavily and node
> 1 is almost idle. So I won't be surprised to see most free pages lie
> in node 1.
I'm running a 2.6.38 kernel.

There is at least a numastat proc file.
grep . /sys/devices/system/node/node*/numastat
/sys/devices/system/node/node0/numastat:numa_hit 5958586
/sys/devices/system/node/node0/numastat:numa_miss 0
/sys/devices/system/node/node0/numastat:numa_foreign 0
/sys/devices/system/node/node0/numastat:interleave_hit 4191
/sys/devices/system/node/node0/numastat:local_node 5885189
/sys/devices/system/node/node0/numastat:other_node 73397
/sys/devices/system/node/node1/numastat:numa_hit 488922
/sys/devices/system/node/node1/numastat:numa_miss 0
/sys/devices/system/node/node1/numastat:numa_foreign 0
/sys/devices/system/node/node1/numastat:interleave_hit 4187
/sys/devices/system/node/node1/numastat:local_node 386741
/sys/devices/system/node/node1/numastat:other_node 102181

>> modified it a little bit:
>> ~# while [ true ]; do ps -eo
>> user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd
>> | grep scp | grep -v grep; sleep 1; done
>>
>> root     12409 12409 TS       -   0  19   0 59.8  42136  1724  0.0 Ss
>> poll_schedule_timeout        scp -t /tmp/
>
> It's mostly doing poll() waits. There must be some dependency on
> something other to make progress. Would you post the full ps output
> for all tasks, and even better, run
complete ps output:
http://pastebin.com/raw.php?i=b948svzN

>          echo t>  /proc/sysrq-trigger
sadly i wa sonly able to grab the output in this crazy format:
http://pastebin.com/raw.php?i=MBXvvyH1

Hope that still helps.

Thanks Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  2:54           ` Stefan Priebe - Profihost AG
@ 2011-08-26  3:03             ` Wu Fengguang
  2011-08-26  3:13               ` Stefan Priebe
  0 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2011-08-26  3:03 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List

On Fri, Aug 26, 2011 at 10:54:35AM +0800, Stefan Priebe - Profihost AG wrote:
> Hi Wu,
> 
> > Ah you are running an older kernel that didn't show all the vmstat
> > numbers. But still it's revealing that node 0 is used heavily and node
> > 1 is almost idle. So I won't be surprised to see most free pages lie
> > in node 1.
> I'm running a 2.6.38 kernel.
> 
> There is at least a numastat proc file.

Thanks. This shows that node0 is accessed 10x more than node1.

> grep . /sys/devices/system/node/node*/numastat
> /sys/devices/system/node/node0/numastat:numa_hit 5958586
> /sys/devices/system/node/node0/numastat:numa_miss 0
> /sys/devices/system/node/node0/numastat:numa_foreign 0
> /sys/devices/system/node/node0/numastat:interleave_hit 4191
> /sys/devices/system/node/node0/numastat:local_node 5885189
> /sys/devices/system/node/node0/numastat:other_node 73397
> /sys/devices/system/node/node1/numastat:numa_hit 488922
> /sys/devices/system/node/node1/numastat:numa_miss 0
> /sys/devices/system/node/node1/numastat:numa_foreign 0
> /sys/devices/system/node/node1/numastat:interleave_hit 4187
> /sys/devices/system/node/node1/numastat:local_node 386741
> /sys/devices/system/node/node1/numastat:other_node 102181
> 
> >> modified it a little bit:
> >> ~# while [ true ]; do ps -eo
> >> user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd
> >> | grep scp | grep -v grep; sleep 1; done
> >>
> >> root     12409 12409 TS       -   0  19   0 59.8  42136  1724  0.0 Ss
> >> poll_schedule_timeout        scp -t /tmp/
> >
> > It's mostly doing poll() waits. There must be some dependency on
> > something other to make progress. Would you post the full ps output
> > for all tasks, and even better, run
> complete ps output:
> http://pastebin.com/raw.php?i=b948svzN

In that log, scp happens to be in R state and also no other tasks in D
state. Would you retry in the hope of catching some stucked state?

> >          echo t>  /proc/sysrq-trigger
> sadly i wa sonly able to grab the output in this crazy format:
> http://pastebin.com/raw.php?i=MBXvvyH1

It's pretty readable dmesg, except that the data is incomplete and
there are nothing valuable in the uploaded portion..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  3:03             ` Wu Fengguang
@ 2011-08-26  3:13               ` Stefan Priebe
  2011-08-26  3:26                 ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Priebe @ 2011-08-26  3:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List


>> There is at least a numastat proc file.
> 
> Thanks. This shows that node0 is accessed 10x more than node1.

What can i do to prevent this or isn't this normal when a machine mostly idles so processes are mostly processed by cpu0.

> 
>> complete ps output:
>> http://pastebin.com/raw.php?i=b948svzN
> 
> In that log, scp happens to be in R state and also no other tasks in D
> state. Would you retry in the hope of catching some stucked state?
Sadly not as the sysrq trigger has rebootet the machine and it will now run fine for 1 or 2 days.

> 
>>>         echo t>  /proc/sysrq-trigger
>> sadly i wa sonly able to grab the output in this crazy format:
>> http://pastebin.com/raw.php?i=MBXvvyH1
> 
> It's pretty readable dmesg, except that the data is incomplete and
> there are nothing valuable in the uploaded portion..
That was everything i could grab through netconsole. Is there a better way?

Stefan
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  3:13               ` Stefan Priebe
@ 2011-08-26  3:26                 ` Wu Fengguang
  2011-08-26  3:30                   ` Zhu Yanhai
  0 siblings, 1 reply; 16+ messages in thread
From: Wu Fengguang @ 2011-08-26  3:26 UTC (permalink / raw)
  To: Stefan Priebe
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List

On Fri, Aug 26, 2011 at 11:13:07AM +0800, Stefan Priebe wrote:
> 
> >> There is at least a numastat proc file.
> > 
> > Thanks. This shows that node0 is accessed 10x more than node1.
> 
> What can i do to prevent this or isn't this normal when a machine mostly idles so processes are mostly processed by cpu0.

Yes, that's normal. However it should explain why it's slow even when
there are lots of free pages _globally_.

> > 
> >> complete ps output:
> >> http://pastebin.com/raw.php?i=b948svzN
> > 
> > In that log, scp happens to be in R state and also no other tasks in D
> > state. Would you retry in the hope of catching some stucked state?
> Sadly not as the sysrq trigger has rebootet the machine and it will now run fine for 1 or 2 days.

Oops, sorry! It might be possible to reproduce the issue by manually
eating all of the memory with sparse file data:

        truncate -s 1T 1T
        cp 1T /dev/null

> > 
> >>>         echo t>  /proc/sysrq-trigger
> >> sadly i wa sonly able to grab the output in this crazy format:
> >> http://pastebin.com/raw.php?i=MBXvvyH1
> > 
> > It's pretty readable dmesg, except that the data is incomplete and
> > there are nothing valuable in the uploaded portion..
> That was everything i could grab through netconsole. Is there a better way?

netconsole is enough.  The partial output should be due to the reboot...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  3:26                 ` Wu Fengguang
@ 2011-08-26  3:30                   ` Zhu Yanhai
  2011-08-26  6:18                     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 16+ messages in thread
From: Zhu Yanhai @ 2011-08-26  3:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Stefan Priebe, Pekka Enberg, LKML, linux-mm@kvack.org,
	Andrew Morton, Mel Gorman, Jens Axboe, Linux Netdev List

Fengguang,
Maybe it's because zone_reclaim_mode? We often have received some reports that
scp or something like that is slow with no reason, and mostly it's due
to someone
enabled zone_reclaim_mode by mistake.

Stefan, is your zone_reclaim_mode enabled? try 'cat
/proc/sys/vm/zone_reclaim_mode',
and echo 0 to it to disable.

Thanks,
Zhu Yanhai

2011/8/26 Wu Fengguang <fengguang.wu@intel.com>:
> On Fri, Aug 26, 2011 at 11:13:07AM +0800, Stefan Priebe wrote:
>>
>> >> There is at least a numastat proc file.
>> >
>> > Thanks. This shows that node0 is accessed 10x more than node1.
>>
>> What can i do to prevent this or isn't this normal when a machine mostly idles so processes are mostly processed by cpu0.
>
> Yes, that's normal. However it should explain why it's slow even when
> there are lots of free pages _globally_.
>
>> >
>> >> complete ps output:
>> >> http://pastebin.com/raw.php?i=b948svzN
>> >
>> > In that log, scp happens to be in R state and also no other tasks in D
>> > state. Would you retry in the hope of catching some stucked state?
>> Sadly not as the sysrq trigger has rebootet the machine and it will now run fine for 1 or 2 days.
>
> Oops, sorry! It might be possible to reproduce the issue by manually
> eating all of the memory with sparse file data:
>
>        truncate -s 1T 1T
>        cp 1T /dev/null
>
>> >
>> >>>         echo t>  /proc/sysrq-trigger
>> >> sadly i wa sonly able to grab the output in this crazy format:
>> >> http://pastebin.com/raw.php?i=MBXvvyH1
>> >
>> > It's pretty readable dmesg, except that the data is incomplete and
>> > there are nothing valuable in the uploaded portion..
>> That was everything i could grab through netconsole. Is there a better way?
>
> netconsole is enough.  The partial output should be due to the reboot...
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  3:30                   ` Zhu Yanhai
@ 2011-08-26  6:18                     ` Stefan Priebe - Profihost AG
  2011-08-31  7:11                       ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Priebe - Profihost AG @ 2011-08-26  6:18 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Wu Fengguang, Pekka Enberg, LKML, linux-mm@kvack.org,
	Andrew Morton, Mel Gorman, Jens Axboe, Linux Netdev List

Yanhai,

> Stefan, is your zone_reclaim_mode enabled? try 'cat
> /proc/sys/vm/zone_reclaim_mode',
> and echo 0 to it to disable.

you're abssolutely corect zone_reclaim_mode is on - but why?
There must be some linux software which switches it on.

~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
~#

also
~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
~#

tells us nothing.

I've then read this:

"zone_reclaim_mode is set during bootup to 1 if it is determined that 
pages from remote zones will cause a measurable performance reduction. 
The page allocator will then reclaim easily reusable pages (those page 
cache pages that are currently not used) before allocating off node pages."

Why does the kernel do that here in our case on these machines.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-26  6:18                     ` Stefan Priebe - Profihost AG
@ 2011-08-31  7:11                       ` Stefan Priebe - Profihost AG
  2011-09-01  4:14                         ` Wu Fengguang
  0 siblings, 1 reply; 16+ messages in thread
From: Stefan Priebe - Profihost AG @ 2011-08-31  7:11 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Wu Fengguang, Pekka Enberg, LKML, linux-mm@kvack.org,
	Andrew Morton, Mel Gorman, Jens Axboe, Linux Netdev List

Hi Fengguang,
Hi Yanhai,

> you're abssolutely corect zone_reclaim_mode is on - but why?
> There must be some linux software which switches it on.
>
> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> ~#
>
> also
> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> ~#
>
> tells us nothing.
>
> I've then read this:
>
> "zone_reclaim_mode is set during bootup to 1 if it is determined that
> pages from remote zones will cause a measurable performance reduction.
> The page allocator will then reclaim easily reusable pages (those page
> cache pages that are currently not used) before allocating off node pages."
>
> Why does the kernel do that here in our case on these machines.

Can nobody help why the kernel in this case set it to 1?

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-08-31  7:11                       ` Stefan Priebe - Profihost AG
@ 2011-09-01  4:14                         ` Wu Fengguang
  2011-09-01  5:41                           ` Stefan Priebe - Profihost AG
  2011-09-01 12:57                           ` Mel Gorman
  0 siblings, 2 replies; 16+ messages in thread
From: Wu Fengguang @ 2011-09-01  4:14 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Zhu Yanhai, Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton,
	Mel Gorman, Jens Axboe, Linux Netdev List, KOSAKI Motohiro

Hi Stefan,

On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
> Hi Fengguang,
> Hi Yanhai,
> 
> > you're abssolutely corect zone_reclaim_mode is on - but why?
> > There must be some linux software which switches it on.
> >
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > also
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > tells us nothing.
> >
> > I've then read this:
> >
> > "zone_reclaim_mode is set during bootup to 1 if it is determined that
> > pages from remote zones will cause a measurable performance reduction.
> > The page allocator will then reclaim easily reusable pages (those page
> > cache pages that are currently not used) before allocating off node pages."
> >
> > Why does the kernel do that here in our case on these machines.
> 
> Can nobody help why the kernel in this case set it to 1?

It's determined by RECLAIM_DISTANCE.

build_zonelists():

                /*
                 * If another node is sufficiently far away then it is better
                 * to reclaim pages in a zone before going off node.
                 */
                if (distance > RECLAIM_DISTANCE)
                        zone_reclaim_mode = 1;

Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
It may well help your case, too.

commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date:   Wed Jun 15 15:08:20 2011 -0700

    mm: increase RECLAIM_DISTANCE to 30
    
    Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
    that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
    Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
    a very traditional single-process model.
    
      * a master process which reads config files and manages the other
        process
      * multiple imapd processes, one per connection
      * multiple pop3d processes, one per connection
      * multiple lmtpd processes, one per connection
      * periodical "cleanup" processes.
    
    There are thousands of independent processes.  The problem is, recent
    Intel motherboard turn on zone_reclaim_mode by default and traditional
    prefork model software don't work well on it.  Unfortunatelly, such models
    are still typical even in the 21st century.  We can't ignore them.
    
    This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
    any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
    such relatively cheap 2-4 socket machine are often used for traditional
    servers as above.  The intention is that these machines don't use
    zone_reclaim_mode.
    
    Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
    This patch doesn't change such high-end NUMA machine behavior.
    
    Dave Hansen said:
    
    : I know specifically of pieces of x86 hardware that set the information
    : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
    : behavior which that implies.
    :
    : They've done performance testing and run very large and scary benchmarks
    : to make sure that they _want_ this turned on.  What this means for them
    : is that they'll probably be de-optimized, at least on newer versions of
    : the kernel.
    :
    : If you want to do this for particular systems, maybe _that_'s what we
    : should do.  Have a list of specific configurations that need the
    : defaults overridden either because they're buggy, or they have an
    : unusual hardware configuration not really reflected in the distance
    : table.

    And later said:
    
    : The original change in the hardware tables was for the benefit of a
    : benchmark.  Said benchmark isn't going to get run on mainline until the
    : next batch of enterprise distros drops, at which point the hardware where
    : this was done will be irrelevant for the benchmark.  I'm sure any new
    : hardware will just set this distance to another yet arbitrary value to
    : make the kernel do what it wants.  :)
    :
    : Also, when the hardware got _set_ to this initially, I complained.  So, I
    : guess I'm getting my way now, with this patch.  I'm cool with it.

diff --git a/include/linux/topology.h b/include/linux/topology.h
index b91a40e..fc839bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
  * (in whatever arch specific measurement units returned by node_distance())
  * then switch on zone reclaim on boot.
  */
-#define RECLAIM_DISTANCE 20
+#define RECLAIM_DISTANCE 30
 #endif
 #ifndef PENALTY_FOR_NODE_WITH_CPUS
 #define PENALTY_FOR_NODE_WITH_CPUS     (1)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-09-01  4:14                         ` Wu Fengguang
@ 2011-09-01  5:41                           ` Stefan Priebe - Profihost AG
  2011-09-01 12:57                           ` Mel Gorman
  1 sibling, 0 replies; 16+ messages in thread
From: Stefan Priebe - Profihost AG @ 2011-09-01  5:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Zhu Yanhai, Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton,
	Mel Gorman, Jens Axboe, Linux Netdev List, KOSAKI Motohiro

Thanks!

Am 01.09.2011 06:14, schrieb Wu Fengguang:
> Hi Stefan,
>
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
>> Hi Fengguang,
>> Hi Yanhai,
>>
>>> you're abssolutely corect zone_reclaim_mode is on - but why?
>>> There must be some linux software which switches it on.
>>>
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> also
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> tells us nothing.
>>>
>>> I've then read this:
>>>
>>> "zone_reclaim_mode is set during bootup to 1 if it is determined that
>>> pages from remote zones will cause a measurable performance reduction.
>>> The page allocator will then reclaim easily reusable pages (those page
>>> cache pages that are currently not used) before allocating off node pages."
>>>
>>> Why does the kernel do that here in our case on these machines.
>>
>> Can nobody help why the kernel in this case set it to 1?
>
> It's determined by RECLAIM_DISTANCE.
>
> build_zonelists():
>
>                  /*
>                   * If another node is sufficiently far away then it is better
>                   * to reclaim pages in a zone before going off node.
>                   */
>                  if (distance>  RECLAIM_DISTANCE)
>                          zone_reclaim_mode = 1;
>
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
>
> commit 32e45ff43eaf5c17f5a82c9ad358d515622c2562
> Author: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Date:   Wed Jun 15 15:08:20 2011 -0700
>
>      mm: increase RECLAIM_DISTANCE to 30
>
>      Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
>      that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
>      Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
>      a very traditional single-process model.
>
>        * a master process which reads config files and manages the other
>          process
>        * multiple imapd processes, one per connection
>        * multiple pop3d processes, one per connection
>        * multiple lmtpd processes, one per connection
>        * periodical "cleanup" processes.
>
>      There are thousands of independent processes.  The problem is, recent
>      Intel motherboard turn on zone_reclaim_mode by default and traditional
>      prefork model software don't work well on it.  Unfortunatelly, such models
>      are still typical even in the 21st century.  We can't ignore them.
>
>      This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
>      any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
>      such relatively cheap 2-4 socket machine are often used for traditional
>      servers as above.  The intention is that these machines don't use
>      zone_reclaim_mode.
>
>      Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
>      This patch doesn't change such high-end NUMA machine behavior.
>
>      Dave Hansen said:
>
>      : I know specifically of pieces of x86 hardware that set the information
>      : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
>      : behavior which that implies.
>      :
>      : They've done performance testing and run very large and scary benchmarks
>      : to make sure that they _want_ this turned on.  What this means for them
>      : is that they'll probably be de-optimized, at least on newer versions of
>      : the kernel.
>      :
>      : If you want to do this for particular systems, maybe _that_'s what we
>      : should do.  Have a list of specific configurations that need the
>      : defaults overridden either because they're buggy, or they have an
>      : unusual hardware configuration not really reflected in the distance
>      : table.
>
>      And later said:
>
>      : The original change in the hardware tables was for the benefit of a
>      : benchmark.  Said benchmark isn't going to get run on mainline until the
>      : next batch of enterprise distros drops, at which point the hardware where
>      : this was done will be irrelevant for the benchmark.  I'm sure any new
>      : hardware will just set this distance to another yet arbitrary value to
>      : make the kernel do what it wants.  :)
>      :
>      : Also, when the hardware got _set_ to this initially, I complained.  So, I
>      : guess I'm getting my way now, with this patch.  I'm cool with it.
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index b91a40e..fc839bf 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -60,7 +60,7 @@ int arch_update_cpu_topology(void);
>    * (in whatever arch specific measurement units returned by node_distance())
>    * then switch on zone reclaim on boot.
>    */
> -#define RECLAIM_DISTANCE 20
> +#define RECLAIM_DISTANCE 30
>   #endif
>   #ifndef PENALTY_FOR_NODE_WITH_CPUS
>   #define PENALTY_FOR_NODE_WITH_CPUS     (1)
>
> Thanks,
> Fengguang
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: slow performance on disk/network i/o full speed after drop_caches
  2011-09-01  4:14                         ` Wu Fengguang
  2011-09-01  5:41                           ` Stefan Priebe - Profihost AG
@ 2011-09-01 12:57                           ` Mel Gorman
  1 sibling, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-09-01 12:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Stefan Priebe - Profihost AG, Zhu Yanhai, Pekka Enberg, LKML,
	linux-mm@kvack.org, Andrew Morton, Jens Axboe, Linux Netdev List,
	KOSAKI Motohiro

On Thu, Sep 01, 2011 at 12:14:58PM +0800, Wu Fengguang wrote:
> Hi Stefan,
> 
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
> > Hi Fengguang,
> > Hi Yanhai,
> > 
> > > you're abssolutely corect zone_reclaim_mode is on - but why?
> > > There must be some linux software which switches it on.
> > >
> > > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > > ~#
> > >
> > > also
> > > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > > ~#
> > >
> > > tells us nothing.
> > >
> > > I've then read this:
> > >
> > > "zone_reclaim_mode is set during bootup to 1 if it is determined that
> > > pages from remote zones will cause a measurable performance reduction.
> > > The page allocator will then reclaim easily reusable pages (those page
> > > cache pages that are currently not used) before allocating off node pages."
> > >
> > > Why does the kernel do that here in our case on these machines.
> > 
> > Can nobody help why the kernel in this case set it to 1?
> 
> It's determined by RECLAIM_DISTANCE.
> 
> build_zonelists():
> 
>                 /*
>                  * If another node is sufficiently far away then it is better
>                  * to reclaim pages in a zone before going off node.
>                  */
>                 if (distance > RECLAIM_DISTANCE)
>                         zone_reclaim_mode = 1;
> 
> Since Linux v3.0 RECLAIM_DISTANCE is increased from 20 to 30 by this commit.
> It may well help your case, too.
> 

Even with that, it's known that zone_reclaim() can be a disaster when
it runs into problems. This should be fixed in 3.1 by the following
commits;

[cd38b115 mm: page allocator: initialise ZLC for first zone eligible for zone_reclaim]
[76d3fbf8 mm: page allocator: reconsider zones for allocation after direct reclaim]

The description in cd38b115 has the interesting details.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-09-01 12:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4E5494D4.1050605@profihost.ag>
2011-08-24  6:20 ` slow performance on disk/network i/o full speed after drop_caches Pekka Enberg
2011-08-24  9:01   ` Stefan Priebe - Profihost AG
2011-08-24  9:33     ` Wu Fengguang
2011-08-25  9:00       ` Stefan Priebe - Profihost AG
2011-08-26  2:16         ` Wu Fengguang
2011-08-26  2:54           ` Stefan Priebe - Profihost AG
2011-08-26  3:03             ` Wu Fengguang
2011-08-26  3:13               ` Stefan Priebe
2011-08-26  3:26                 ` Wu Fengguang
2011-08-26  3:30                   ` Zhu Yanhai
2011-08-26  6:18                     ` Stefan Priebe - Profihost AG
2011-08-31  7:11                       ` Stefan Priebe - Profihost AG
2011-09-01  4:14                         ` Wu Fengguang
2011-09-01  5:41                           ` Stefan Priebe - Profihost AG
2011-09-01 12:57                           ` Mel Gorman
2011-08-24  9:32   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).