linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
To: Shaohua Li <shli@kernel.org>
Cc: linux-mm@kvack.org,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Eberhard Pasch <epasch@de.ibm.com>
Subject: Re: [Resend] Puzzling behaviour with multiple swap targets
Date: Thu, 13 Feb 2014 17:12:40 +0100	[thread overview]
Message-ID: <52FCEEF8.6000108@linux.vnet.ibm.com> (raw)
In-Reply-To: <52E7D12E.4070703@linux.vnet.ibm.com>

Hi,
regarding another issue I was working together with Mel Gorman and I can 
now confirm that his patch https://lkml.org/lkml/2014/2/13/181
also fixes the issue discussed in this thread.

Thereby I want to encourage you to review and if appropriate pick Mels 
patch fixing not only the issue he described but also this one.

On 28/01/14 16:47, Christian Ehrhardt wrote:
> On 20/01/14 09:54, Christian Ehrhardt wrote:
>> On 20/01/14 02:05, Shaohua Li wrote:
>>> On Fri, Jan 17, 2014 at 01:39:43PM +0100, Christian Ehrhardt wrote:
> [...]
>>>
>>> Is the swap disk a SSD? If not, there is no point to partition the
>>> disk. Do you
>>> see any changes in iostat in the bad/good case, for example, request
>>> size,
>>> iodepth?
>>
>> Hi,
>> I use normal disks and SSDs or even the special s390 ramdisks - I agree
>> that partitioning makes no sense in a real case, but it doesn't matter
>> atm. I just partition to better show the effect that "more swap targets
>> -> less throughput" - and partitioning makes it easy for me to guarantee
>> that the HW ressources serving that I/O stay the same.
>>
>> IOstat and such things don't report very significant changes regarding
>> I/O depth. Sizes are more interesting with the bad case having slightly
>> more (16%) read I/Os and dropping average request size from 14.62 to
>> 11.89. Along with that goes a drop in read request merges of 28%.
>>
>> But I don't see how a workload that is random in memory would create
>> significantly better/worse chances for request merging depending on the
>> case if the disk is partitioned more or less often.
>> On the read path swap doesn't care about iterating disks, it just goes
>> by associated swap extends -> offsets to the disk.
>> And I thought in a random load that should be purely random and hit each
>> partition in e.g. the 4 partition case just by 25%.
>> I checked some blocktraces I had and can confirm as expected each got an
>> equal share.
>>
>>> There is one patch can avoid swapin reads more than swapout for random
>>> case,
>>> but still not in upstream yet. You can try it here:
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/swap_state.c?id=5d19b04a2dae73382fb607f16e2acfb594d1c63f
>>>
>>>
>>
>> Great suggestion - it sounds very interesting to me, I'll give it a try
>> in a few days since I'm out Tue/Wed.
>
> I had already a patch prepared and successfully tested that allows to
> configure page cluster for read/write separately from userspace. That
> worked well but would require an admin to configure the "right" value
> for his system.
> Since that fails with so much kernel tunables and would also not be
> adaptive if the behaviour changes over time I very much prefer your
> solution to it.
> That is why I tried to verify your patch in my environment with at least
> some of the cases I used recently for swap analysis and improvement.
>
> The environment has 10G of real memory and drives a working set of 12.5Gb.
> So just a slight 1.25:1 overcommit (While s390 often runs in higher
> overcommits for most of the linux Swap issues so far 1.25:1 was enough
> to trigger and produced more reliable results).
> To swap I use 8x16G xpram devices which one can imagine as SSDs at main
> memory speed (good to make forecasts how ssds might behave in a few years).
>
> I compared a 3.10 kernel (I know a bit old already, but I knew that my
> env works fine with it) with and without the patch for swap readahead
> scaling.
>
> All memory is initially completely faulted in (memset) and thne warmed
> up with two full sweeps of the entire working set following the current
> workload configuration.
> The unit reported is MB/s the workload can achieve in its
> (overcommitted) memory being an average of 2 runs for 5 minutes each (+
> the init and warmup as described).
> (Noise is usually ~+/-5%, maybe a bit more in non exclusive runs like
> this when other things are on the machine)
>
> Memory Access is done via memcpy in either direction (R/W) with
> alternating sizes of:
> 5% 65536 bytes
> 5%  8192 bytes
> 90% 4096 bytes
>
> Further abbreviations
> PC = the currently configured page cluster size (0,3,5)
> M - Multi threaded (=32)
> S - Single threaded
> Seq/Rnd - Sequential/Random
>
>                        No Swap RA   With Swap RA     Diff
> PC0-M-Rnd        ~=     10732.97        9891.87   -7.84%
> PC0-M-Seq        ~=     10780.56       10587.76   -1.79%
> PC0-S-Rnd        ~=      2010.47        2067.51    2.84%
> PC0-S-Seq        ~=      1783.74        1834.28    2.83%
> PC3-M-Rnd        ~=     10745.19       10990.90    2.29%
> PC3-M-Seq        ~=     11792.67       11107.79   -5.81%
> PC3-S-Rnd        ~=      1301.28        2017.61   55.05%
> PC3-S-Seq        ~=      1664.40        1637.72   -1.60%
> PC5-M-Rnd        ~=      7568.56       10733.60   41.82%
> PC5-M-Seq        ~=          n/a       11208.40      n/a
> PC5-S-Rnd        ~=       608.48        2052.17  237.26%
> PC5-S-Seq        ~=      1604.97        1685.65    5.03%
> (for and PC5-M-Seq I ran out of time, but the remaining results are
> interesting enough already)
>
> I like what I see, there is nothing significantly out of the noise range
> which shouldn't be.
> The Page Cluster 0 cases didn't show an effect as expected.
> For page cluster 3 the multithreaded cases have hidden the impact to TP
> due to the fact that then just another thread can continue.
> But I checked sar data and see that PC3-M-Rnd has avoided about 50% of
> swapins while staying at equal throughput (1000k vs 500k pswpin/s).
> Other than that Random loads had the biggest improvements matching what
> I had with splitting up read/write page-cluster size.
> Eventually with page cluster 5 even the multi threaded cases start to
> show benefits of the readahead scaling code.
> In all that time sequential cases didn't change a lot.
>
> So I think that test worked fine. I see there were some discussion son
> the form of the implementation, but in terms of results I really like it
> as far as I had time to check it out.
>
>
>
> *** Context switch ***
>
> Now back to my original question about why swapping to multiple targets
> makes things slower.
> Your patch helps there a but as the Workload with the biggest issue was
> a random workload and I knew that with pagecluster set to zero the loss
> of efficiency with those multiple swap targets is stopped.
> But I consider that only a fix of the symptom and would love if one
> comes up with an idea actually *why* things get worse with more swap
> targets.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      reply	other threads:[~2014-02-13 16:12 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-17 12:39 [Resend] Puzzling behaviour with multiple swap targets Christian Ehrhardt
2014-01-20  1:05 ` Shaohua Li
2014-01-20  8:54   ` Christian Ehrhardt
2014-01-28 15:47     ` Christian Ehrhardt
2014-02-13 16:12       ` Christian Ehrhardt [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52FCEEF8.6000108@linux.vnet.ibm.com \
    --to=ehrhardt@linux.vnet.ibm.com \
    --cc=borntraeger@de.ibm.com \
    --cc=epasch@de.ibm.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-mm@kvack.org \
    --cc=schwidefsky@de.ibm.com \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).