write_cache_pages inefficiency

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* write_cache_pages inefficiency
@ 2011-11-06 21:48 Phillip Susi
  2011-11-09 16:45 ` Jan Kara
  0 siblings, 1 reply; 2+ messages in thread
From: Phillip Susi @ 2011-11-06 21:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've read over write_cache_pages() in page-writeback.c, and related
writepages() functions, and it seems to me that it suffers from a
performance problem whenever an fsync is done on a file and some of
its pages have already begun writeback.  The comment in the code says:

 * If a page is already under I/O, write_cache_pages() skips it, even
 * if it's dirty.  This is desirable behaviour for memory-cleaning
writeback,
 * but it is INCORRECT for data-integrity system calls such as
fsync().  fsync()
 * and msync() need to guarantee that all the data which was dirty at
the time
 * the call was made get new I/O started against them.  If
wbc->sync_mode is
 * WB_SYNC_ALL then we were called for data integrity and we must wait for
 * existing IO to complete.

Based on this, I would expect the function to wait for an existing
write to complete only if the page is also dirty.  Instead, it waits
for existing page writes to complete regardless of the dirty bit.
Additionally, it does each wait serially, so if you are trying to
fsync 1000 dirty pages, and the first 10 are already being written
out, the thread will block on each of those 10 pages write completion
before it begins queuing any new writes.

Instead, shouldn't it go ahead and initiate pagewrite on all pages not
already being written, and then come back and wait on those that were
already in flight to complete, then initiate a second write on them if
they are dirty?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk63ALEACgkQJ4UciIs+XuL/NgCfXBftM2PRN10u0i3DBG94hny6
dVoAoKbQp3yiY6ZotjbqHyd+kOEXiLgf
=dK4Q
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: write_cache_pages inefficiency
  2011-11-06 21:48 write_cache_pages inefficiency Phillip Susi
@ 2011-11-09 16:45 ` Jan Kara
  0 siblings, 0 replies; 2+ messages in thread
From: Jan Kara @ 2011-11-09 16:45 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-kernel, linux-mm

On Sun 06-11-11 16:48:33, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I've read over write_cache_pages() in page-writeback.c, and related
> writepages() functions, and it seems to me that it suffers from a
> performance problem whenever an fsync is done on a file and some of
> its pages have already begun writeback.  The comment in the code says:
> 
>  * If a page is already under I/O, write_cache_pages() skips it, even
>  * if it's dirty.  This is desirable behaviour for memory-cleaning
> writeback,
>  * but it is INCORRECT for data-integrity system calls such as
> fsync().  fsync()
>  * and msync() need to guarantee that all the data which was dirty at
> the time
>  * the call was made get new I/O started against them.  If
> wbc->sync_mode is
>  * WB_SYNC_ALL then we were called for data integrity and we must wait for
>  * existing IO to complete.
> 
> Based on this, I would expect the function to wait for an existing
> write to complete only if the page is also dirty.  Instead, it waits
> for existing page writes to complete regardless of the dirty bit.
  Are you sure? I can see in the code:
                        lock_page(page);
                        if (unlikely(page->mapping != mapping)) {
continue_unlock:
                                unlock_page(page);
                                continue;
                        }
                        if (!PageDirty(page)) {
                                /* someone wrote it for us */
                                goto continue_unlock;
                        }
                        if (PageWriteback(page)) {
                                if (wbc->sync_mode != WB_SYNC_NONE)
                                        wait_on_page_writeback(page);
                                else
                                        goto continue_unlock;
                        }
  So we skip clean pages...

> Additionally, it does each wait serially, so if you are trying to
> fsync 1000 dirty pages, and the first 10 are already being written
> out, the thread will block on each of those 10 pages write completion
> before it begins queuing any new writes.
  Yes, this is correct.

> Instead, shouldn't it go ahead and initiate pagewrite on all pages not
> already being written, and then come back and wait on those that were
> already in flight to complete, then initiate a second write on them if
> they are dirty?
  Well, if you can *demonstrate* with real numbers it has performance benefit
we could do it. But it's not clear there will be any benefit - skipping
pages which need writing can introduce additional seeks to the IO stream
and that is costly - sometimes much more costly than just waiting for IO to
complete...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-11-09 16:45 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-06 21:48 write_cache_pages inefficiency Phillip Susi
2011-11-09 16:45 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).