dirty_expire_centisecs, msync behavior

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* dirty_expire_centisecs, msync behavior
@ 2013-09-08  0:01 Howard Chu
  2013-09-10 20:45 ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Howard Chu @ 2013-09-08  0:01 UTC (permalink / raw)
  To: Linux Kernel Mailing List

The documentation for dirty_expire_centisecs states: "Data which has been 
dirty in-memory for longer than this interval will be written out next time a 
flusher thread wakes up."

In practice, it appears that once the expire time has passed, all dirty pages 
get flushed, regardless of their age. This behavior makes this setting fairly 
useless. This appears to have been the behavior for most of 2.6 and 3.x. Can 
anyone explain, is the current behavior really as intended, and is the doc 
just out of date?

On a slightly related note, what was the key problem with this patch "msync: 
support syncing a small part of the file"? 
http://thread.gmane.org/gmane.linux.kernel/1313767/focus=1317498

Andrew Morton's message states that Paolo's patch would break nonlinear 
mappings, and the matter was dropped. Why wasn't it possible to write a patch 
that would also work with nonlinear mappings? I couldn't find any earlier 
context for that subject, pointers welcome.

My interest in both of these questions stems from what I've observed while 
testing the LMDB memory-mapped database. On a machine with 32GB RAM, using a 
database that occupies about 18GB of memory, doing continuous writes to the DB 
without ever calling msync, and default writeback settings, I see DB 
throughput spike downward every time the flusher wakes up. The DB is a mmap'd 
file on an XFS partition, and a DB write operation simply dirties a random set 
of pages. After the program has been running for more than 
dirty_expire_centisecs, every dirty_writeback_centisecs the DB app basically 
stops while the flusher writes out all the dirty pages.

I'm curious about a couple things - since the DB knows which pages it is 
dirtying in a given transaction, would it help overall throughput if the DB 
told the OS (via msync) exactly which ranges to flush? Obviously not, in the 
current implementation of msync, but can a patch like Paolo's make this 
better? And can the dirty_expire_centisecs behavior be fixed, so that it's 
only writing out a smaller set of pages on each wakeup? What else can we do to 
minimize the impact of the flusher? If I turn it off completely the throughput 
nearly doubles, from 5100 DB writes/sec to 9000/sec. If I turn off the timed 
flush and just use dirty_background_bytes the throughput just slows to around 
7000/sec.

It seems to me the main slowdown is because the OS is locking dirty pages 
indiscriminately. The DB does copy-on-write, so pages that it dirties in one 
transaction will not be written again in the next transaction. I would have 
expected read-only accesses to these pages to be able to progress without any 
delay but that doesn't seem to be the case.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dirty_expire_centisecs, msync behavior
  2013-09-08  0:01 dirty_expire_centisecs, msync behavior Howard Chu
@ 2013-09-10 20:45 ` Jan Kara
  2013-09-10 21:46   ` Howard Chu
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2013-09-10 20:45 UTC (permalink / raw)
  To: Howard Chu; +Cc: Linux Kernel Mailing List

  Hello,

On Sat 07-09-13 17:01:10, Howard Chu wrote:
> The documentation for dirty_expire_centisecs states: "Data which has
> been dirty in-memory for longer than this interval will be written
> out next time a flusher thread wakes up."
> 
> In practice, it appears that once the expire time has passed, all
> dirty pages get flushed, regardless of their age. This behavior
> makes this setting fairly useless. This appears to have been the
> behavior for most of 2.6 and 3.x. Can anyone explain, is the current
> behavior really as intended, and is the doc just out of date?
  What really happens is that all inodes which have been dirtied before
'expire time' are completely flushed.

> On a slightly related note, what was the key problem with this patch
> "msync: support syncing a small part of the file"?
> http://thread.gmane.org/gmane.linux.kernel/1313767/focus=1317498
> 
> Andrew Morton's message states that Paolo's patch would break
> nonlinear mappings, and the matter was dropped. Why wasn't it
> possible to write a patch that would also work with nonlinear
> mappings? I couldn't find any earlier context for that subject,
> pointers welcome.
  It is certainly possible. But actually I'm not 100% sure it is worth it.
Because each fsync() call has a certain overhead in the filesystem and that
is rather considerable - forcing a journal transaction to disk, flushing
disk caches, ... So splitting one large fsync() into several smaller ones
(even if they together write significantly less pages) is often slower.
 
> My interest in both of these questions stems from what I've observed
> while testing the LMDB memory-mapped database. On a machine with
> 32GB RAM, using a database that occupies about 18GB of memory, doing
> continuous writes to the DB without ever calling msync, and default
> writeback settings, I see DB throughput spike downward every time
> the flusher wakes up. The DB is a mmap'd file on an XFS partition,
> and a DB write operation simply dirties a random set of pages. After
> the program has been running for more than dirty_expire_centisecs,
> every dirty_writeback_centisecs the DB app basically stops while the
> flusher writes out all the dirty pages.
  What kernel version are you using? What you describe sounds like the
problems that happened due to 'stable pages under writeback' work. We
didn't allow page to be redirtied while it was under writeback. In 3.10
we fixed that so workloads that are redirtying pages should be improved.

> I'm curious about a couple things - since the DB knows which pages
> it is dirtying in a given transaction, would it help overall
> throughput if the DB told the OS (via msync) exactly which ranges to
> flush? Obviously not, in the current implementation of msync, but
> can a patch like Paolo's make this better? And can the
> dirty_expire_centisecs behavior be fixed, so that it's only writing
> out a smaller set of pages on each wakeup? What else can we do to
> minimize the impact of the flusher? If I turn it off completely the
> throughput nearly doubles, from 5100 DB writes/sec to 9000/sec. If I
> turn off the timed flush and just use dirty_background_bytes the
> throughput just slows to around 7000/sec.
  After 3.10 running flusher should have rather minimal impact on the
parallel mmap workload. It still locks the page when submitting it for IO
but when the underlying blocks are allocated (which is your case I believe)
this interval when the page is locked is very short.

> It seems to me the main slowdown is because the OS is locking dirty
> pages indiscriminately. The DB does copy-on-write, so pages that it
> dirties in one transaction will not be written again in the next
> transaction. I would have expected read-only accesses to these pages
> to be able to progress without any delay but that doesn't seem to be
> the case.
  So I would be really surprised if read-only access to the pages blocked
because you shouldn't really enter the kernel at all if those pages are
already mapped and faulted in.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dirty_expire_centisecs, msync behavior
  2013-09-10 20:45 ` Jan Kara
@ 2013-09-10 21:46   ` Howard Chu
  2013-09-10 22:39     ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Howard Chu @ 2013-09-10 21:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Linux Kernel Mailing List

Jan Kara wrote:
>    Hello,

Hi Jan, thanks for your answers.

> On Sat 07-09-13 17:01:10, Howard Chu wrote:
>> The documentation for dirty_expire_centisecs states: "Data which has
>> been dirty in-memory for longer than this interval will be written
>> out next time a flusher thread wakes up."
>>
>> In practice, it appears that once the expire time has passed, all
>> dirty pages get flushed, regardless of their age. This behavior
>> makes this setting fairly useless. This appears to have been the
>> behavior for most of 2.6 and 3.x. Can anyone explain, is the current
>> behavior really as intended, and is the doc just out of date?
>    What really happens is that all inodes which have been dirtied before
> 'expire time' are completely flushed.

Still it appears to be more than that. If I suspend the writer, I can see 
(using atop) that the flusher always keeps writing until the number of dirty 
pages is zero, and that happens in much shorter than the expire time. This is 
on an Ubuntu build 3.5.0-23-generic. Perhaps this behavior has also changed in 
more recent kernels? Another person has reported the same thing using 3.0

http://stackoverflow.com/questions/18353467/implementation-of-dirty-expire-centisecs

>> On a slightly related note, what was the key problem with this patch
>> "msync: support syncing a small part of the file"?
>> http://thread.gmane.org/gmane.linux.kernel/1313767/focus=1317498
>>
>> Andrew Morton's message states that Paolo's patch would break
>> nonlinear mappings, and the matter was dropped. Why wasn't it
>> possible to write a patch that would also work with nonlinear
>> mappings? I couldn't find any earlier context for that subject,
>> pointers welcome.
>    It is certainly possible. But actually I'm not 100% sure it is worth it.
> Because each fsync() call has a certain overhead in the filesystem and that
> is rather considerable - forcing a journal transaction to disk, flushing
> disk caches, ... So splitting one large fsync() into several smaller ones
> (even if they together write significantly less pages) is often slower.

OK... But does msync() have to do that? Is msync() closer to fsync() in 
behavior, or just fdatasync()? And also, if you're using something without 
journaling, like ext2, I would think it's a pure win.

>> My interest in both of these questions stems from what I've observed
>> while testing the LMDB memory-mapped database. On a machine with
>> 32GB RAM, using a database that occupies about 18GB of memory, doing
>> continuous writes to the DB without ever calling msync, and default
>> writeback settings, I see DB throughput spike downward every time
>> the flusher wakes up. The DB is a mmap'd file on an XFS partition,
>> and a DB write operation simply dirties a random set of pages. After
>> the program has been running for more than dirty_expire_centisecs,
>> every dirty_writeback_centisecs the DB app basically stops while the
>> flusher writes out all the dirty pages.
>    What kernel version are you using? What you describe sounds like the
> problems that happened due to 'stable pages under writeback' work. We
> didn't allow page to be redirtied while it was under writeback. In 3.10
> we fixed that so workloads that are redirtying pages should be improved.

Currently using 3.5 (as noted earlier in this reply). Out of curiosity, do you 
happen to know how long the pre-3.10 behavior has existed? Is it a 3.x change 
that wasn't present in 2.6?

>> I'm curious about a couple things - since the DB knows which pages
>> it is dirtying in a given transaction, would it help overall
>> throughput if the DB told the OS (via msync) exactly which ranges to
>> flush? Obviously not, in the current implementation of msync, but
>> can a patch like Paolo's make this better? And can the
>> dirty_expire_centisecs behavior be fixed, so that it's only writing
>> out a smaller set of pages on each wakeup? What else can we do to
>> minimize the impact of the flusher? If I turn it off completely the
>> throughput nearly doubles, from 5100 DB writes/sec to 9000/sec. If I
>> turn off the timed flush and just use dirty_background_bytes the
>> throughput just slows to around 7000/sec.
>    After 3.10 running flusher should have rather minimal impact on the
> parallel mmap workload. It still locks the page when submitting it for IO
> but when the underlying blocks are allocated (which is your case I believe)
> this interval when the page is locked is very short.

Sounds promising, will have to look into retesting with a 3.10 kernel.

>> It seems to me the main slowdown is because the OS is locking dirty
>> pages indiscriminately. The DB does copy-on-write, so pages that it
>> dirties in one transaction will not be written again in the next
>> transaction. I would have expected read-only accesses to these pages
>> to be able to progress without any delay but that doesn't seem to be
>> the case.
>    So I would be really surprised if read-only access to the pages blocked
> because you shouldn't really enter the kernel at all if those pages are
> already mapped and faulted in.

OK. Quite sure that all the pages are mapped and present. Perhaps it was all 
due to the writes. I'll know more when I've had a chance to test 3.10.

Thanks again for the info.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: dirty_expire_centisecs, msync behavior
  2013-09-10 21:46   ` Howard Chu
@ 2013-09-10 22:39     ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2013-09-10 22:39 UTC (permalink / raw)
  To: Howard Chu; +Cc: Jan Kara, Linux Kernel Mailing List

On Tue 10-09-13 14:46:52, Howard Chu wrote:
> >On Sat 07-09-13 17:01:10, Howard Chu wrote:
> >>The documentation for dirty_expire_centisecs states: "Data which has
> >>been dirty in-memory for longer than this interval will be written
> >>out next time a flusher thread wakes up."
> >>
> >>In practice, it appears that once the expire time has passed, all
> >>dirty pages get flushed, regardless of their age. This behavior
> >>makes this setting fairly useless. This appears to have been the
> >>behavior for most of 2.6 and 3.x. Can anyone explain, is the current
> >>behavior really as intended, and is the doc just out of date?
> >   What really happens is that all inodes which have been dirtied before
> >'expire time' are completely flushed.
> 
> Still it appears to be more than that. If I suspend the writer, I
> can see (using atop) that the flusher always keeps writing until the
> number of dirty pages is zero, and that happens in much shorter than
> the expire time. This is on an Ubuntu build 3.5.0-23-generic.
> Perhaps this behavior has also changed in more recent kernels?
> Another person has reported the same thing using 3.0
> 
> http://stackoverflow.com/questions/18353467/implementation-of-dirty-expire-centisecs
  Well, let me explain the mechanism in more detail: When the first page is
dirtied in an inode, the current time is recorded in the inode. When this
time gets older than dirty_expire_centisecs, all dirty pages in the inode
are written. So with this mechanism in mind the behavior you describe looks
expected to me.

> >>On a slightly related note, what was the key problem with this patch
> >>"msync: support syncing a small part of the file"?
> >>http://thread.gmane.org/gmane.linux.kernel/1313767/focus=1317498
> >>
> >>Andrew Morton's message states that Paolo's patch would break
> >>nonlinear mappings, and the matter was dropped. Why wasn't it
> >>possible to write a patch that would also work with nonlinear
> >>mappings? I couldn't find any earlier context for that subject,
> >>pointers welcome.
> >   It is certainly possible. But actually I'm not 100% sure it is worth it.
> >Because each fsync() call has a certain overhead in the filesystem and that
> >is rather considerable - forcing a journal transaction to disk, flushing
> >disk caches, ... So splitting one large fsync() into several smaller ones
> >(even if they together write significantly less pages) is often slower.
> 
> OK... But does msync() have to do that? Is msync() closer to fsync()
> in behavior, or just fdatasync()?
  Requirements of msync() seem equivalent to fdatasync(). But both fsync()
and fdatasync() have similar requirements wrt journalling and cache
flushes. We can save commiting some transactions if the only updates to the
inode are timestamps (and we do that optimization in ext4) but still it is
pretty expensive.

> And also, if you're using something without journaling, like ext2, I
> would think it's a pure win.
  True, for some filesystems, some workloads, or some HW configurations it
will be a win. For others it will be a loss. BTW, even ext2 should flush
disk caches after fsync(2). Otherwise you can still loose the data after a
power failure.
 
> >>My interest in both of these questions stems from what I've observed
> >>while testing the LMDB memory-mapped database. On a machine with
> >>32GB RAM, using a database that occupies about 18GB of memory, doing
> >>continuous writes to the DB without ever calling msync, and default
> >>writeback settings, I see DB throughput spike downward every time
> >>the flusher wakes up. The DB is a mmap'd file on an XFS partition,
> >>and a DB write operation simply dirties a random set of pages. After
> >>the program has been running for more than dirty_expire_centisecs,
> >>every dirty_writeback_centisecs the DB app basically stops while the
> >>flusher writes out all the dirty pages.
> >   What kernel version are you using? What you describe sounds like the
> >problems that happened due to 'stable pages under writeback' work. We
> >didn't allow page to be redirtied while it was under writeback. In 3.10
> >we fixed that so workloads that are redirtying pages should be improved.
> 
> Currently using 3.5 (as noted earlier in this reply). Out of
> curiosity, do you happen to know how long the pre-3.10 behavior has
> existed? Is it a 3.x change that wasn't present in 2.6?
  3.0 was the first kernel with this problematic logic.
 
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-09-10 22:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-08  0:01 dirty_expire_centisecs, msync behavior Howard Chu
2013-09-10 20:45 ` Jan Kara
2013-09-10 21:46   ` Howard Chu
2013-09-10 22:39     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox