[Resend] Puzzling behaviour with multiple swap targets

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Resend] Puzzling behaviour with multiple swap targets
@ 2014-01-17 12:39 Christian Ehrhardt
  2014-01-20  1:05 ` Shaohua Li
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Ehrhardt @ 2014-01-17 12:39 UTC (permalink / raw)
  To: linux-mm, Shaohua Li
  Cc: Christian Borntraeger, Heiko Carstens, Martin Schwidefsky,
	Eberhard Pasch

Hi,

/*
  * RESEND - due the vacation time we all hopefully shared this might
  * have slipped through mail filters and mass deletes - so I wanted to
  * give the question another chance.
  */

I've analyzed swapping for a while now. I made some progress tuning my 
system for better, faster and more efficient swapping. However one thing 
still eludes me.
I think by asking here we can only win. Either it is trivial to you and 
I get a better understanding or you can take it as brain teaser over 
Christmas time :-)

Long Story Short - the Issue:
The more Swap targets I use, the slower the swapping becomes.

Details - Issue:
As mentioned before I made a lot of analysis already including 
simplifications of the testcase.
Therefore I only describe the most simplified setup and scenario.
I run a testcase (see below) accessing overcommitted (1.25:1) memory in 
4k chunks selecting the offset randomly.
When swapping to a single disk I achieve about 20% more throughput 
compared to just taking this disk, partitioning it into 4 equal pieces 
and activate those as swap.
The workload does read only in that overcommitted memory.

According to my understanding for read only the exact location shouldn't 
matter.
The fault will find a page that was swapped out and discarded, start the 
I/O to bring it back going via the swap extends.
There is just no code caring a lot about the partitions in the fault-IN 
path.
Also as the workload is uniform random locality on disk should be 
irrelevant as the accesses to the four partitions will be mapped to just 
the same disk.

Still the number of partitions on the same physical resource changes the 
throughput I can achieve on memory.

Details - Setup
My Main System is a System zEnterprise zEC12 s390 machine with 10GB Memory.
I have 2 CPUs (FYI the issue appears no matter how much cpus - tested 1-64).
The working set of the workload is 12.5 GB,so the overcommit ratio is a 
light 1.25:1 (also tested from 1.02 up to 3:1 - it was visible in each 
case, but 1.25:1 was the most stable)
As swap device I use 1 FCP attached Disk served by a IBM DS8870 attached 
via 8x8Gb FCP adapters on Server and Storage Server.
The disk holds 256GB which leaves my case far away from 50% swap.
Initially I used multiple disks, but the problem is more puzzling (as it 
leaves less room for speculation) when just changing the #partitions on 
the same physical resource.

I verified it on an IBM X5 (Xeon X7560) and while the (local raid 5) 
disk devices there are much slower, they still show the same issue when 
comparing 1 disk 1 partition vs the same 1 disk 4 partitions.

Remaining Leads:
Using iostat to compare swap disk activity vs what my testcase can 
achieve in memory identified that the "bad case" is less efficient.
That means it doesn't have less/slower disk I/O, no in fact it has 
usually slightly more disk I/O at about the same performance 
characteristics than the "good case".
That implies that the "efficiency" in the good case is better meaning 
that it is more likely to have the "correct next page" at hand and in 
swap cache.
That is confirmed by the fact that setting page_cluster to 0 eliminates 
the difference of 1 to many partitions.
Unfortunately the meet at the lower throughput level.
Also I don't see what the mm/swap code can make right/wrong for a 
workload accessing 4k pages in a randomized way.
There should be no statistically relevant value in the locality of the 
workload that can be handled right.

Rejected theories:
I tested a lot of things already and some made it into tunings (IO 
scheduler, page_cluster, ...), but non of them fixed the "more swap 
targets -> slower" issue.
- locking: Lockstat showed nothing changing a lot between 1 and 4 
partitions. In fact the 5 most busy locks were related to huge pages and 
disabling those got rid of the locks in lockstat, but didn't affect the 
throughput at all.
- scsi/blkdev: as complex multipath setups can often be a source of 
issues I used a special s390 only memory device called xpram. It 
essentially is a block device that fulfils I/O requests at make_request 
level at memory speed. That sped up my test a lot, but taking the same 
xpram memory once in one chunk and once broken into 4 pieces it still 
was worse with the four pieces.
- already fixed: there was an upstream patch commit ec8acf20 "swap: add 
per-partition lock for swapfile" from "Shaohua Li <shli@kernel.org>" 
that pretty much sounds like the same issue. But it was already applied.
- Kernel Versions: while the majority of my tests were on 3.10.7 I 
tested up to 3.12.2 and still saw the same issue.
- Scaling in general: when I go from 1 to 4 partitions on a single disk 
I see the mentioned ~20% drop in throughput.
   But going further like 6 disks with 4 partitions each is at almost 
the same level.
   So it gets a bit worse, but the black magic seems to happen between 1->4.

Details - Workload:
While my original workload can be complex with configurable threads, 
background load and all kind of accounting I thought it is better to 
simplify it for this discussion. Therefore the code now is rather simple 
and even lacking the majority of e.g. null pointer checks - but it is 
easy to understand what it does.
Essentially I allocate a given amount of memory - 12500MB by default. 
Then I initialize that memory followed by a warmup phase of three runs 
through the full working set. Then the real workload starts accessing 4k 
chunks at random offsets.

Since the code is so small now I think it qualifies as inline
---cut here---
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>

#define MB (1024*1024)
int stopme = 0;
unsigned chunk_size = 4096;
unsigned no_chunks = 0;
int duration = 600;
size_t mem_size = 12500;

void * uniform_random_access(char *buffer) {
	unsigned long offset;
	offset = ((unsigned long)(drand48()*(mem_size/chunk_size)))*chunk_size;
	return (void*)(((unsigned long)buffer)+offset);
}

void  alrmhandler(int sig) {
	signal(SIGALRM, SIG_IGN);
	printf("\n\nGot alarm, set stopme\n");
	stopme=1;
}

int main(int argc, char * argv[])
{
	unsigned long j, i = 0;
	double rmem;
	unsigned long local_reads = 0;
	void *read_buffer;
	char *c;
	mem_size = mem_size * MB;
	signal(SIGALRM, alrmhandler);
	c=malloc(mem_size);
	read_buffer = malloc(chunk_size);
	memset(read_buffer,1,chunk_size);
	memset(c,1,mem_size);
	for (i=0; i<3; i++) {
		for (j=0; j<(mem_size/chunk_size); j++) {
			memcpy(read_buffer,uniform_random_access(c),chunk_size);
		}
	}
	i=0;
	alarm(duration);

	while (1) {
		for (j=0; j<(mem_size/chunk_size); j++) {
			memcpy(read_buffer,uniform_random_access(c),chunk_size);
			local_reads++;

			if (stopme)
				goto out;
		}
		i++;
	}
out:
	rmem = ((mem_size/MB)*i*1) + ((local_reads*chunk_size)/MB);
	printf("Accumulated Read Throughput (mb/s):  %20.2lf\n", rmem/duration);
	printf("%% of working set covered:            %20.2lf\n", 
(rmem/(mem_size/MB))*100.0 );
	free(c);
	free(read_buffer);
	exit(0);
}
---cut here---

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Resend] Puzzling behaviour with multiple swap targets
  2014-01-17 12:39 [Resend] Puzzling behaviour with multiple swap targets Christian Ehrhardt
@ 2014-01-20  1:05 ` Shaohua Li
  2014-01-20  8:54   ` Christian Ehrhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Shaohua Li @ 2014-01-20  1:05 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: linux-mm, Christian Borntraeger, Heiko Carstens,
	Martin Schwidefsky, Eberhard Pasch

On Fri, Jan 17, 2014 at 01:39:43PM +0100, Christian Ehrhardt wrote:
> Hi,
> 
> /*
>  * RESEND - due the vacation time we all hopefully shared this might
>  * have slipped through mail filters and mass deletes - so I wanted to
>  * give the question another chance.
>  */
> 
> I've analyzed swapping for a while now. I made some progress tuning
> my system for better, faster and more efficient swapping. However
> one thing still eludes me.
> I think by asking here we can only win. Either it is trivial to you
> and I get a better understanding or you can take it as brain teaser
> over Christmas time :-)
> 
> Long Story Short - the Issue:
> The more Swap targets I use, the slower the swapping becomes.
> 
> 
> Details - Issue:
> As mentioned before I made a lot of analysis already including
> simplifications of the testcase.
> Therefore I only describe the most simplified setup and scenario.
> I run a testcase (see below) accessing overcommitted (1.25:1) memory
> in 4k chunks selecting the offset randomly.
> When swapping to a single disk I achieve about 20% more throughput
> compared to just taking this disk, partitioning it into 4 equal
> pieces and activate those as swap.
> The workload does read only in that overcommitted memory.
> 
> According to my understanding for read only the exact location
> shouldn't matter.
> The fault will find a page that was swapped out and discarded, start
> the I/O to bring it back going via the swap extends.
> There is just no code caring a lot about the partitions in the
> fault-IN path.
> Also as the workload is uniform random locality on disk should be
> irrelevant as the accesses to the four partitions will be mapped to
> just the same disk.
> 
> Still the number of partitions on the same physical resource changes
> the throughput I can achieve on memory.
> 
> 
> 
> Details - Setup
> My Main System is a System zEnterprise zEC12 s390 machine with 10GB Memory.
> I have 2 CPUs (FYI the issue appears no matter how much cpus - tested 1-64).
> The working set of the workload is 12.5 GB,so the overcommit ratio
> is a light 1.25:1 (also tested from 1.02 up to 3:1 - it was visible
> in each case, but 1.25:1 was the most stable)
> As swap device I use 1 FCP attached Disk served by a IBM DS8870
> attached via 8x8Gb FCP adapters on Server and Storage Server.
> The disk holds 256GB which leaves my case far away from 50% swap.
> Initially I used multiple disks, but the problem is more puzzling
> (as it leaves less room for speculation) when just changing the
> #partitions on the same physical resource.
> 
> I verified it on an IBM X5 (Xeon X7560) and while the (local raid 5)
> disk devices there are much slower, they still show the same issue
> when comparing 1 disk 1 partition vs the same 1 disk 4 partitions.
> 
> 
> 
> Remaining Leads:
> Using iostat to compare swap disk activity vs what my testcase can
> achieve in memory identified that the "bad case" is less efficient.
> That means it doesn't have less/slower disk I/O, no in fact it has
> usually slightly more disk I/O at about the same performance
> characteristics than the "good case".
> That implies that the "efficiency" in the good case is better
> meaning that it is more likely to have the "correct next page" at
> hand and in swap cache.
> That is confirmed by the fact that setting page_cluster to 0
> eliminates the difference of 1 to many partitions.
> Unfortunately the meet at the lower throughput level.
> Also I don't see what the mm/swap code can make right/wrong for a
> workload accessing 4k pages in a randomized way.
> There should be no statistically relevant value in the locality of
> the workload that can be handled right.
> 
> 
> 
> Rejected theories:
> I tested a lot of things already and some made it into tunings (IO
> scheduler, page_cluster, ...), but non of them fixed the "more swap
> targets -> slower" issue.
> - locking: Lockstat showed nothing changing a lot between 1 and 4
> partitions. In fact the 5 most busy locks were related to huge pages
> and disabling those got rid of the locks in lockstat, but didn't
> affect the throughput at all.
> - scsi/blkdev: as complex multipath setups can often be a source of
> issues I used a special s390 only memory device called xpram. It
> essentially is a block device that fulfils I/O requests at
> make_request level at memory speed. That sped up my test a lot, but
> taking the same xpram memory once in one chunk and once broken into
> 4 pieces it still was worse with the four pieces.
> - already fixed: there was an upstream patch commit ec8acf20 "swap:
> add per-partition lock for swapfile" from "Shaohua Li
> <shli@kernel.org>" that pretty much sounds like the same issue. But
> it was already applied.
> - Kernel Versions: while the majority of my tests were on 3.10.7 I
> tested up to 3.12.2 and still saw the same issue.
> - Scaling in general: when I go from 1 to 4 partitions on a single
> disk I see the mentioned ~20% drop in throughput.
>   But going further like 6 disks with 4 partitions each is at almost
> the same level.
>   So it gets a bit worse, but the black magic seems to happen between 1->4.

Is the swap disk a SSD? If not, there is no point to partition the disk. Do you
see any changes in iostat in the bad/good case, for example, request size,
iodepth? 

There is one patch can avoid swapin reads more than swapout for random case,
but still not in upstream yet. You can try it here:
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/swap_state.c?id=5d19b04a2dae73382fb607f16e2acfb594d1c63f

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Resend] Puzzling behaviour with multiple swap targets
  2014-01-20  1:05 ` Shaohua Li
@ 2014-01-20  8:54   ` Christian Ehrhardt
  2014-01-28 15:47     ` Christian Ehrhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Ehrhardt @ 2014-01-20  8:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, Christian Borntraeger, Heiko Carstens,
	Martin Schwidefsky, Eberhard Pasch

On 20/01/14 02:05, Shaohua Li wrote:
> On Fri, Jan 17, 2014 at 01:39:43PM +0100, Christian Ehrhardt wrote:
>> Hi,
>>
>> /*
>>   * RESEND - due the vacation time we all hopefully shared this might
>>   * have slipped through mail filters and mass deletes - so I wanted to
>>   * give the question another chance.
>>   */
>>
>> I've analyzed swapping for a while now. I made some progress tuning
>> my system for better, faster and more efficient swapping. However
>> one thing still eludes me.
>> I think by asking here we can only win. Either it is trivial to you
>> and I get a better understanding or you can take it as brain teaser
>> over Christmas time :-)
>>
>> Long Story Short - the Issue:
>> The more Swap targets I use, the slower the swapping becomes.
>>
>>
>> Details - Issue:
>> As mentioned before I made a lot of analysis already including
>> simplifications of the testcase.
>> Therefore I only describe the most simplified setup and scenario.
>> I run a testcase (see below) accessing overcommitted (1.25:1) memory
>> in 4k chunks selecting the offset randomly.
>> When swapping to a single disk I achieve about 20% more throughput
>> compared to just taking this disk, partitioning it into 4 equal
>> pieces and activate those as swap.
>> The workload does read only in that overcommitted memory.
>>
>> According to my understanding for read only the exact location
>> shouldn't matter.
>> The fault will find a page that was swapped out and discarded, start
>> the I/O to bring it back going via the swap extends.
>> There is just no code caring a lot about the partitions in the
>> fault-IN path.
>> Also as the workload is uniform random locality on disk should be
>> irrelevant as the accesses to the four partitions will be mapped to
>> just the same disk.
>>
>> Still the number of partitions on the same physical resource changes
>> the throughput I can achieve on memory.
>>
>>
>>
>> Details - Setup
>> My Main System is a System zEnterprise zEC12 s390 machine with 10GB Memory.
>> I have 2 CPUs (FYI the issue appears no matter how much cpus - tested 1-64).
>> The working set of the workload is 12.5 GB,so the overcommit ratio
>> is a light 1.25:1 (also tested from 1.02 up to 3:1 - it was visible
>> in each case, but 1.25:1 was the most stable)
>> As swap device I use 1 FCP attached Disk served by a IBM DS8870
>> attached via 8x8Gb FCP adapters on Server and Storage Server.
>> The disk holds 256GB which leaves my case far away from 50% swap.
>> Initially I used multiple disks, but the problem is more puzzling
>> (as it leaves less room for speculation) when just changing the
>> #partitions on the same physical resource.
>>
>> I verified it on an IBM X5 (Xeon X7560) and while the (local raid 5)
>> disk devices there are much slower, they still show the same issue
>> when comparing 1 disk 1 partition vs the same 1 disk 4 partitions.
>>
>>
>>
>> Remaining Leads:
>> Using iostat to compare swap disk activity vs what my testcase can
>> achieve in memory identified that the "bad case" is less efficient.
>> That means it doesn't have less/slower disk I/O, no in fact it has
>> usually slightly more disk I/O at about the same performance
>> characteristics than the "good case".
>> That implies that the "efficiency" in the good case is better
>> meaning that it is more likely to have the "correct next page" at
>> hand and in swap cache.
>> That is confirmed by the fact that setting page_cluster to 0
>> eliminates the difference of 1 to many partitions.
>> Unfortunately the meet at the lower throughput level.
>> Also I don't see what the mm/swap code can make right/wrong for a
>> workload accessing 4k pages in a randomized way.
>> There should be no statistically relevant value in the locality of
>> the workload that can be handled right.
>>
>>
>>
>> Rejected theories:
>> I tested a lot of things already and some made it into tunings (IO
>> scheduler, page_cluster, ...), but non of them fixed the "more swap
>> targets -> slower" issue.
>> - locking: Lockstat showed nothing changing a lot between 1 and 4
>> partitions. In fact the 5 most busy locks were related to huge pages
>> and disabling those got rid of the locks in lockstat, but didn't
>> affect the throughput at all.
>> - scsi/blkdev: as complex multipath setups can often be a source of
>> issues I used a special s390 only memory device called xpram. It
>> essentially is a block device that fulfils I/O requests at
>> make_request level at memory speed. That sped up my test a lot, but
>> taking the same xpram memory once in one chunk and once broken into
>> 4 pieces it still was worse with the four pieces.
>> - already fixed: there was an upstream patch commit ec8acf20 "swap:
>> add per-partition lock for swapfile" from "Shaohua Li
>> <shli@kernel.org>" that pretty much sounds like the same issue. But
>> it was already applied.
>> - Kernel Versions: while the majority of my tests were on 3.10.7 I
>> tested up to 3.12.2 and still saw the same issue.
>> - Scaling in general: when I go from 1 to 4 partitions on a single
>> disk I see the mentioned ~20% drop in throughput.
>>    But going further like 6 disks with 4 partitions each is at almost
>> the same level.
>>    So it gets a bit worse, but the black magic seems to happen between 1->4.
>
> Is the swap disk a SSD? If not, there is no point to partition the disk. Do you
> see any changes in iostat in the bad/good case, for example, request size,
> iodepth?

Hi,
I use normal disks and SSDs or even the special s390 ramdisks - I agree 
that partitioning makes no sense in a real case, but it doesn't matter 
atm. I just partition to better show the effect that "more swap targets 
-> less throughput" - and partitioning makes it easy for me to guarantee 
that the HW ressources serving that I/O stay the same.

IOstat and such things don't report very significant changes regarding 
I/O depth. Sizes are more interesting with the bad case having slightly 
more (16%) read I/Os and dropping average request size from 14.62 to 
11.89. Along with that goes a drop in read request merges of 28%.

But I don't see how a workload that is random in memory would create 
significantly better/worse chances for request merging depending on the 
case if the disk is partitioned more or less often.
On the read path swap doesn't care about iterating disks, it just goes 
by associated swap extends -> offsets to the disk.
And I thought in a random load that should be purely random and hit each 
partition in e.g. the 4 partition case just by 25%.
I checked some blocktraces I had and can confirm as expected each got an 
equal share.

> There is one patch can avoid swapin reads more than swapout for random case,
> but still not in upstream yet. You can try it here:
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/swap_state.c?id=5d19b04a2dae73382fb607f16e2acfb594d1c63f

Great suggestion - it sounds very interesting to me, I'll give it a try 
in a few days since I'm out Tue/Wed.

> Thanks,
> Shaohua
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Resend] Puzzling behaviour with multiple swap targets
  2014-01-20  8:54   ` Christian Ehrhardt
@ 2014-01-28 15:47     ` Christian Ehrhardt
  2014-02-13 16:12       ` Christian Ehrhardt
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Ehrhardt @ 2014-01-28 15:47 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, Christian Borntraeger, Heiko Carstens,
	Martin Schwidefsky, Eberhard Pasch

On 20/01/14 09:54, Christian Ehrhardt wrote:
> On 20/01/14 02:05, Shaohua Li wrote:
>> On Fri, Jan 17, 2014 at 01:39:43PM +0100, Christian Ehrhardt wrote:
[...]
>>
>> Is the swap disk a SSD? If not, there is no point to partition the
>> disk. Do you
>> see any changes in iostat in the bad/good case, for example, request
>> size,
>> iodepth?
>
> Hi,
> I use normal disks and SSDs or even the special s390 ramdisks - I agree
> that partitioning makes no sense in a real case, but it doesn't matter
> atm. I just partition to better show the effect that "more swap targets
> -> less throughput" - and partitioning makes it easy for me to guarantee
> that the HW ressources serving that I/O stay the same.
>
> IOstat and such things don't report very significant changes regarding
> I/O depth. Sizes are more interesting with the bad case having slightly
> more (16%) read I/Os and dropping average request size from 14.62 to
> 11.89. Along with that goes a drop in read request merges of 28%.
>
> But I don't see how a workload that is random in memory would create
> significantly better/worse chances for request merging depending on the
> case if the disk is partitioned more or less often.
> On the read path swap doesn't care about iterating disks, it just goes
> by associated swap extends -> offsets to the disk.
> And I thought in a random load that should be purely random and hit each
> partition in e.g. the 4 partition case just by 25%.
> I checked some blocktraces I had and can confirm as expected each got an
> equal share.
>
>> There is one patch can avoid swapin reads more than swapout for random
>> case,
>> but still not in upstream yet. You can try it here:
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/swap_state.c?id=5d19b04a2dae73382fb607f16e2acfb594d1c63f
>>
>
> Great suggestion - it sounds very interesting to me, I'll give it a try
> in a few days since I'm out Tue/Wed.

I had already a patch prepared and successfully tested that allows to 
configure page cluster for read/write separately from userspace. That 
worked well but would require an admin to configure the "right" value 
for his system.
Since that fails with so much kernel tunables and would also not be 
adaptive if the behaviour changes over time I very much prefer your 
solution to it.
That is why I tried to verify your patch in my environment with at least 
some of the cases I used recently for swap analysis and improvement.

The environment has 10G of real memory and drives a working set of 12.5Gb.
So just a slight 1.25:1 overcommit (While s390 often runs in higher 
overcommits for most of the linux Swap issues so far 1.25:1 was enough 
to trigger and produced more reliable results).
To swap I use 8x16G xpram devices which one can imagine as SSDs at main 
memory speed (good to make forecasts how ssds might behave in a few years).

I compared a 3.10 kernel (I know a bit old already, but I knew that my 
env works fine with it) with and without the patch for swap readahead 
scaling.

All memory is initially completely faulted in (memset) and thne warmed 
up with two full sweeps of the entire working set following the current 
workload configuration.
The unit reported is MB/s the workload can achieve in its 
(overcommitted) memory being an average of 2 runs for 5 minutes each (+ 
the init and warmup as described).
(Noise is usually ~+/-5%, maybe a bit more in non exclusive runs like 
this when other things are on the machine)

Memory Access is done via memcpy in either direction (R/W) with 
alternating sizes of:
5% 65536 bytes
5%  8192 bytes
90% 4096 bytes

Further abbreviations
PC = the currently configured page cluster size (0,3,5)
M - Multi threaded (=32)
S - Single threaded
Seq/Rnd - Sequential/Random

                       No Swap RA   With Swap RA     Diff
PC0-M-Rnd        ~=     10732.97        9891.87   -7.84%
PC0-M-Seq        ~=     10780.56       10587.76   -1.79%
PC0-S-Rnd        ~=      2010.47        2067.51    2.84%
PC0-S-Seq        ~=      1783.74        1834.28    2.83%
PC3-M-Rnd        ~=     10745.19       10990.90    2.29%
PC3-M-Seq        ~=     11792.67       11107.79   -5.81%
PC3-S-Rnd        ~=      1301.28        2017.61   55.05%
PC3-S-Seq        ~=      1664.40        1637.72   -1.60%
PC5-M-Rnd        ~=      7568.56       10733.60   41.82%
PC5-M-Seq        ~=          n/a       11208.40      n/a
PC5-S-Rnd        ~=       608.48        2052.17  237.26%
PC5-S-Seq        ~=      1604.97        1685.65    5.03%
(for and PC5-M-Seq I ran out of time, but the remaining results are 
interesting enough already)

I like what I see, there is nothing significantly out of the noise range 
which shouldn't be.
The Page Cluster 0 cases didn't show an effect as expected.
For page cluster 3 the multithreaded cases have hidden the impact to TP 
due to the fact that then just another thread can continue.
But I checked sar data and see that PC3-M-Rnd has avoided about 50% of 
swapins while staying at equal throughput (1000k vs 500k pswpin/s).
Other than that Random loads had the biggest improvements matching what 
I had with splitting up read/write page-cluster size.
Eventually with page cluster 5 even the multi threaded cases start to 
show benefits of the readahead scaling code.
In all that time sequential cases didn't change a lot.

So I think that test worked fine. I see there were some discussion son 
the form of the implementation, but in terms of results I really like it 
as far as I had time to check it out.

*** Context switch ***

Now back to my original question about why swapping to multiple targets 
makes things slower.
Your patch helps there a but as the Workload with the biggest issue was 
a random workload and I knew that with pagecluster set to zero the loss 
of efficiency with those multiple swap targets is stopped.
But I consider that only a fix of the symptom and would love if one 
comes up with an idea actually *why* things get worse with more swap 
targets.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Resend] Puzzling behaviour with multiple swap targets
  2014-01-28 15:47     ` Christian Ehrhardt
@ 2014-02-13 16:12       ` Christian Ehrhardt
  0 siblings, 0 replies; 5+ messages in thread
From: Christian Ehrhardt @ 2014-02-13 16:12 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, Christian Borntraeger, Heiko Carstens,
	Martin Schwidefsky, Eberhard Pasch

Hi,
regarding another issue I was working together with Mel Gorman and I can 
now confirm that his patch https://lkml.org/lkml/2014/2/13/181
also fixes the issue discussed in this thread.

Thereby I want to encourage you to review and if appropriate pick Mels 
patch fixing not only the issue he described but also this one.

On 28/01/14 16:47, Christian Ehrhardt wrote:
> On 20/01/14 09:54, Christian Ehrhardt wrote:
>> On 20/01/14 02:05, Shaohua Li wrote:
>>> On Fri, Jan 17, 2014 at 01:39:43PM +0100, Christian Ehrhardt wrote:
> [...]
>>>
>>> Is the swap disk a SSD? If not, there is no point to partition the
>>> disk. Do you
>>> see any changes in iostat in the bad/good case, for example, request
>>> size,
>>> iodepth?
>>
>> Hi,
>> I use normal disks and SSDs or even the special s390 ramdisks - I agree
>> that partitioning makes no sense in a real case, but it doesn't matter
>> atm. I just partition to better show the effect that "more swap targets
>> -> less throughput" - and partitioning makes it easy for me to guarantee
>> that the HW ressources serving that I/O stay the same.
>>
>> IOstat and such things don't report very significant changes regarding
>> I/O depth. Sizes are more interesting with the bad case having slightly
>> more (16%) read I/Os and dropping average request size from 14.62 to
>> 11.89. Along with that goes a drop in read request merges of 28%.
>>
>> But I don't see how a workload that is random in memory would create
>> significantly better/worse chances for request merging depending on the
>> case if the disk is partitioned more or less often.
>> On the read path swap doesn't care about iterating disks, it just goes
>> by associated swap extends -> offsets to the disk.
>> And I thought in a random load that should be purely random and hit each
>> partition in e.g. the 4 partition case just by 25%.
>> I checked some blocktraces I had and can confirm as expected each got an
>> equal share.
>>
>>> There is one patch can avoid swapin reads more than swapout for random
>>> case,
>>> but still not in upstream yet. You can try it here:
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/swap_state.c?id=5d19b04a2dae73382fb607f16e2acfb594d1c63f
>>>
>>>
>>
>> Great suggestion - it sounds very interesting to me, I'll give it a try
>> in a few days since I'm out Tue/Wed.
>
> I had already a patch prepared and successfully tested that allows to
> configure page cluster for read/write separately from userspace. That
> worked well but would require an admin to configure the "right" value
> for his system.
> Since that fails with so much kernel tunables and would also not be
> adaptive if the behaviour changes over time I very much prefer your
> solution to it.
> That is why I tried to verify your patch in my environment with at least
> some of the cases I used recently for swap analysis and improvement.
>
> The environment has 10G of real memory and drives a working set of 12.5Gb.
> So just a slight 1.25:1 overcommit (While s390 often runs in higher
> overcommits for most of the linux Swap issues so far 1.25:1 was enough
> to trigger and produced more reliable results).
> To swap I use 8x16G xpram devices which one can imagine as SSDs at main
> memory speed (good to make forecasts how ssds might behave in a few years).
>
> I compared a 3.10 kernel (I know a bit old already, but I knew that my
> env works fine with it) with and without the patch for swap readahead
> scaling.
>
> All memory is initially completely faulted in (memset) and thne warmed
> up with two full sweeps of the entire working set following the current
> workload configuration.
> The unit reported is MB/s the workload can achieve in its
> (overcommitted) memory being an average of 2 runs for 5 minutes each (+
> the init and warmup as described).
> (Noise is usually ~+/-5%, maybe a bit more in non exclusive runs like
> this when other things are on the machine)
>
> Memory Access is done via memcpy in either direction (R/W) with
> alternating sizes of:
> 5% 65536 bytes
> 5%  8192 bytes
> 90% 4096 bytes
>
> Further abbreviations
> PC = the currently configured page cluster size (0,3,5)
> M - Multi threaded (=32)
> S - Single threaded
> Seq/Rnd - Sequential/Random
>
>                        No Swap RA   With Swap RA     Diff
> PC0-M-Rnd        ~=     10732.97        9891.87   -7.84%
> PC0-M-Seq        ~=     10780.56       10587.76   -1.79%
> PC0-S-Rnd        ~=      2010.47        2067.51    2.84%
> PC0-S-Seq        ~=      1783.74        1834.28    2.83%
> PC3-M-Rnd        ~=     10745.19       10990.90    2.29%
> PC3-M-Seq        ~=     11792.67       11107.79   -5.81%
> PC3-S-Rnd        ~=      1301.28        2017.61   55.05%
> PC3-S-Seq        ~=      1664.40        1637.72   -1.60%
> PC5-M-Rnd        ~=      7568.56       10733.60   41.82%
> PC5-M-Seq        ~=          n/a       11208.40      n/a
> PC5-S-Rnd        ~=       608.48        2052.17  237.26%
> PC5-S-Seq        ~=      1604.97        1685.65    5.03%
> (for and PC5-M-Seq I ran out of time, but the remaining results are
> interesting enough already)
>
> I like what I see, there is nothing significantly out of the noise range
> which shouldn't be.
> The Page Cluster 0 cases didn't show an effect as expected.
> For page cluster 3 the multithreaded cases have hidden the impact to TP
> due to the fact that then just another thread can continue.
> But I checked sar data and see that PC3-M-Rnd has avoided about 50% of
> swapins while staying at equal throughput (1000k vs 500k pswpin/s).
> Other than that Random loads had the biggest improvements matching what
> I had with splitting up read/write page-cluster size.
> Eventually with page cluster 5 even the multi threaded cases start to
> show benefits of the readahead scaling code.
> In all that time sequential cases didn't change a lot.
>
> So I think that test worked fine. I see there were some discussion son
> the form of the implementation, but in terms of results I really like it
> as far as I had time to check it out.
>
>
>
> *** Context switch ***
>
> Now back to my original question about why swapping to multiple targets
> makes things slower.
> Your patch helps there a but as the Workload with the biggest issue was
> a random workload and I knew that with pagecluster set to zero the loss
> of efficiency with those multiple swap targets is stopped.
> But I consider that only a fix of the symptom and would love if one
> comes up with an idea actually *why* things get worse with more swap
> targets.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-02-13 16:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-17 12:39 [Resend] Puzzling behaviour with multiple swap targets Christian Ehrhardt
2014-01-20  1:05 ` Shaohua Li
2014-01-20  8:54   ` Christian Ehrhardt
2014-01-28 15:47     ` Christian Ehrhardt
2014-02-13 16:12       ` Christian Ehrhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).