From: Jan Kara <jack@suse.cz>
To: Viktor Nagy <viktor.nagy@thx4games.com>
Cc: Jan Kara <jack@suse.cz>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
"Darrick J. Wong" <djwong@us.ibm.com>,
chris.mason@fusionio.com
Subject: Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour
Date: Wed, 10 Oct 2012 23:27:25 +0200 [thread overview]
Message-ID: <20121010212725.GA11642@quack.suse.cz> (raw)
In-Reply-To: <5075DE39.8090303@thx4games.com>
On Wed 10-10-12 22:44:41, Viktor Nagy wrote:
> On 10/10/2012 06:57 PM, Jan Kara wrote:
> > Hello,
> >
> >On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
> >>Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
> >>are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
> >>The kernel 2.6.39 works nice.
> >>
> >>How this hurt us in the real life: We have a very high performance
> >>game server where the MySQL have to do many writes along the reads.
> >>All writes and reads are very simple and have to be very quick. If
> >>we run the system with Linux 3.2 we get unacceptable performance.
> >>Now we are stuck with 2.6.32 kernel here because this problem.
> >>
> >>I attach the test program wrote by me which shows the problem. The
> >>program just writes blocks continously to random position to a given
> >>big file. The write rate limited to 100 MByte/s. In a well-working
> >>kernel it have to run with constant 100 MBit/s speed for indefinite
> >>long. The test have to be run on a simple HDD.
> >>
> >>Test steps:
> >>1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
> >>Ext4 forces flushes periodically. I recommend to use XFS.
> >>2. create a big file on the test partiton. For 8 GByte RAM you can
> >>create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
> >>file. File creation can be done with this command: dd if=/dev/zero
> >>of=bigfile2048M.bin bs=1M count=2048
> >>3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
> >>4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
> >>
> >>In the beginning there can be some slowness even on well-working
> >>kernels. If you create the bigfile in the same run then it runs
> >>usually smootly from the beginning.
> >>
> >>I don't know a setting of /proc/sys/vm variables which runs this
> >>test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
> >>bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
> >>testfile size the test program should never be blocked.
> > I've run your program and I can confirm your results. As a side note,
> >your test program as a bug as it uses 'int' for offset arithmetics so when
> >the file is larger than 2 GB, you can hit some problems but for our case
> >that's not really important.
> Sorry for the bug and maybe the poor implementation. I am much
> better in Pascal than in C.
> (You can not make such mistake in Pascal (FreePascal). Is there a
> way (compiler switch) in C/C++ to get there a warning?)
Actually I somewhat doubt that even FreePascal is able to give you a
warning that arithmetic can overflow...
> >The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
> >writeback when grabbing pages to begin a write". At the first sight I was
> >somewhat surprised when I saw that code path in the traces but later when I
> >did some math it's clear. What the commit does is that when a page is just
> >being written out to disk, we don't allow it's contents to be changed and
> >wait for IO to finish before letting next write to proceed. Now if you have
> >1 GB file, that's 256000 pages. By the observation from my test machine,
> >writeback code keeps around 10000 pages in flight to disk at any moment
> >(this number fluctuates a lot but average is around that number). Your
> >program dirties about 25600 pages per second. So the probability one of
> >dirtied pages is a page under writeback is equal to 1 for all practical
> >purposes (precisely it is 1-(1-10000/256000)^25600). Actually, on average
> >you are going to hit about 1000 pages under writeback per second which
> >clearly has a noticeable impact (even single page can have). Pity I didn't
> >do the math when we were considering those patches.
> >
> >There were plans to avoid waiting if underlying storage doesn't need it but
> >I'm not sure how far that plans got (added a couple of relevant CCs).
> >Anyway you are about second or third real workload that sees regression due
> >to "stable pages" so we have to fix that sooner rather than later... Thanks
> >for your detailed report!
> >
> > Honza
> Thank you for your response!
>
> I'm very happy that I've found the right people.
>
> We develop a game server which gets very high load in some
> countries. We are trying to serve as much players as possible with
> one server.
> Currently the CPU usage is below the 50% at the peak times. And with
> the old kernel it runs smoothly. The pdflush runs non-stop on the
> database disk with ~3 MByte/s write (minimal read).
> This is at 43000 active sockets, 18000 rq/s, ~40000 packets/s.
> I think we are still below the theoratical limits of this server...
> but only if the disk writes are never done in sync.
>
> I will try the 3.2.31 kernel without the problematic commit
> (3d08bcc8 "mm: Wait for writeback when grabbing pages to begin a
> write").
> Is it a good idea? Will it be worse than 2.6.32?
Running without that commit should work just fine unless you use
something exotic like DIF/DIX or similar. Whether things will be worse than
in 2.6.32 I cannot say. For me, your test program behaves fine without that
commit but whether your real workload won't hit some other problem is
always a question. But if you hit another regression I'm interested in
hearing about it :).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2012-10-10 21:27 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <5073F13C.5050807@thx4games.com>
2012-10-10 16:57 ` Linux 3.0+ Disk performance problem - wrong pdflush behaviour Jan Kara
2012-10-10 20:44 ` Viktor Nagy
2012-10-10 21:27 ` Jan Kara [this message]
2012-10-11 10:52 ` Viktor Nagy
2012-10-11 10:10 ` Jan Kara
2012-10-11 12:58 ` Viktor Nagy
2012-10-11 15:47 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121010212725.GA11642@quack.suse.cz \
--to=jack@suse.cz \
--cc=chris.mason@fusionio.com \
--cc=djwong@us.ibm.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viktor.nagy@thx4games.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).