From: Andres Freund <andres@anarazel.de>
To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Triggering non-integrity writeback from userspace
Date: Thu, 22 Oct 2015 15:15:55 +0200 [thread overview]
Message-ID: <20151022131555.GC4378@alap3.anarazel.de> (raw)
Hi,
postgres regularly has to checkpoint data to disk to be able to free
data from its journal. We currently use buffered IO and that's not
going to change short term.
In a busy database this checkpointing process can write out a lot of
data. Currently that frequently leads to massive latency spikes
(c.f. 20140326191113.GF9066@alap3.anarazel.de) for other processed doing
IO. These happen either when the kernel starts writeback or when, at the
end of the checkpoint, we issue an fsync() on the datafiles.
One odd issue there is that the kernel tends to do writeback in a very
irregular manner. Even if we write data at a constant rate writeback
very often happens in bulk - not a good idea for preserving
interactivity.
What we're preparing to do now is to regularly issue
sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after
we've written them to to the OS. That way there's not too much dirty
data in the page cache, so writeback won't cause latency spikes, and the
fsync at the end doesn't have to write much if anything.
That improves things a lot.
But I still see latency spikes that shouldn't be there given the amount
of IO. I'm wondering if that is related to the fact that
SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with
WB_SYNC_ALL specified. Given the the documentation for
SYNC_FILE_RANGE_WRITE I did not expect that:
* SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
* are not presently under writeout. This is an asynchronous flush-to-disk
* operation. Not suitable for data integrity operations.
If I followed the code correctly - not a sure thing at all - that means
bios are submitted with WRITE_SYNC specified. Not really what's needed
in this case.
Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't
there for data integrity, but it might be that people rely on in
nonetheless. so I'm loathe to suggest changing that. But I do wonder if
there's a way non-integrity writeback triggering could be exposed to
userspace. A new fadvise flags seems like a good way to do that -
POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does
other things, so it's not suitable for us.
Greetings,
Andres Freund
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next reply other threads:[~2015-10-22 13:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-22 13:15 Andres Freund [this message]
2015-10-24 19:09 ` Triggering non-integrity writeback from userspace Jan Kara
2015-10-24 21:39 ` Dave Chinner
2015-10-28 9:27 ` Andres Freund
2015-10-28 20:48 ` Dave Chinner
2015-10-28 23:23 ` Andres Freund
2015-10-29 1:54 ` Dave Chinner
2015-10-29 16:23 ` Andres Freund
2015-10-29 22:10 ` Dave Chinner
2015-10-28 23:26 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151022131555.GC4378@alap3.anarazel.de \
--to=andres@anarazel.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).