All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Phillip Susi <psusi@cfl.rr.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: write_cache_pages inefficiency
Date: Wed, 9 Nov 2011 17:45:37 +0100	[thread overview]
Message-ID: <20111109164537.GA7495@quack.suse.cz> (raw)
In-Reply-To: <4EB700B1.3050205@cfl.rr.com>

On Sun 06-11-11 16:48:33, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I've read over write_cache_pages() in page-writeback.c, and related
> writepages() functions, and it seems to me that it suffers from a
> performance problem whenever an fsync is done on a file and some of
> its pages have already begun writeback.  The comment in the code says:
> 
>  * If a page is already under I/O, write_cache_pages() skips it, even
>  * if it's dirty.  This is desirable behaviour for memory-cleaning
> writeback,
>  * but it is INCORRECT for data-integrity system calls such as
> fsync().  fsync()
>  * and msync() need to guarantee that all the data which was dirty at
> the time
>  * the call was made get new I/O started against them.  If
> wbc->sync_mode is
>  * WB_SYNC_ALL then we were called for data integrity and we must wait for
>  * existing IO to complete.
> 
> Based on this, I would expect the function to wait for an existing
> write to complete only if the page is also dirty.  Instead, it waits
> for existing page writes to complete regardless of the dirty bit.
  Are you sure? I can see in the code:
                        lock_page(page);
                        if (unlikely(page->mapping != mapping)) {
continue_unlock:
                                unlock_page(page);
                                continue;
                        }
                        if (!PageDirty(page)) {
                                /* someone wrote it for us */
                                goto continue_unlock;
                        }
                        if (PageWriteback(page)) {
                                if (wbc->sync_mode != WB_SYNC_NONE)
                                        wait_on_page_writeback(page);
                                else
                                        goto continue_unlock;
                        }
  So we skip clean pages...

> Additionally, it does each wait serially, so if you are trying to
> fsync 1000 dirty pages, and the first 10 are already being written
> out, the thread will block on each of those 10 pages write completion
> before it begins queuing any new writes.
  Yes, this is correct.

> Instead, shouldn't it go ahead and initiate pagewrite on all pages not
> already being written, and then come back and wait on those that were
> already in flight to complete, then initiate a second write on them if
> they are dirty?
  Well, if you can *demonstrate* with real numbers it has performance benefit
we could do it. But it's not clear there will be any benefit - skipping
pages which need writing can introduce additional seeks to the IO stream
and that is costly - sometimes much more costly than just waiting for IO to
complete...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Phillip Susi <psusi@cfl.rr.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: write_cache_pages inefficiency
Date: Wed, 9 Nov 2011 17:45:37 +0100	[thread overview]
Message-ID: <20111109164537.GA7495@quack.suse.cz> (raw)
In-Reply-To: <4EB700B1.3050205@cfl.rr.com>

On Sun 06-11-11 16:48:33, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I've read over write_cache_pages() in page-writeback.c, and related
> writepages() functions, and it seems to me that it suffers from a
> performance problem whenever an fsync is done on a file and some of
> its pages have already begun writeback.  The comment in the code says:
> 
>  * If a page is already under I/O, write_cache_pages() skips it, even
>  * if it's dirty.  This is desirable behaviour for memory-cleaning
> writeback,
>  * but it is INCORRECT for data-integrity system calls such as
> fsync().  fsync()
>  * and msync() need to guarantee that all the data which was dirty at
> the time
>  * the call was made get new I/O started against them.  If
> wbc->sync_mode is
>  * WB_SYNC_ALL then we were called for data integrity and we must wait for
>  * existing IO to complete.
> 
> Based on this, I would expect the function to wait for an existing
> write to complete only if the page is also dirty.  Instead, it waits
> for existing page writes to complete regardless of the dirty bit.
  Are you sure? I can see in the code:
                        lock_page(page);
                        if (unlikely(page->mapping != mapping)) {
continue_unlock:
                                unlock_page(page);
                                continue;
                        }
                        if (!PageDirty(page)) {
                                /* someone wrote it for us */
                                goto continue_unlock;
                        }
                        if (PageWriteback(page)) {
                                if (wbc->sync_mode != WB_SYNC_NONE)
                                        wait_on_page_writeback(page);
                                else
                                        goto continue_unlock;
                        }
  So we skip clean pages...

> Additionally, it does each wait serially, so if you are trying to
> fsync 1000 dirty pages, and the first 10 are already being written
> out, the thread will block on each of those 10 pages write completion
> before it begins queuing any new writes.
  Yes, this is correct.

> Instead, shouldn't it go ahead and initiate pagewrite on all pages not
> already being written, and then come back and wait on those that were
> already in flight to complete, then initiate a second write on them if
> they are dirty?
  Well, if you can *demonstrate* with real numbers it has performance benefit
we could do it. But it's not clear there will be any benefit - skipping
pages which need writing can introduce additional seeks to the IO stream
and that is costly - sometimes much more costly than just waiting for IO to
complete...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-11-09 16:45 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-06 21:48 write_cache_pages inefficiency Phillip Susi
2011-11-06 21:48 ` Phillip Susi
2011-11-09 16:45 ` Jan Kara [this message]
2011-11-09 16:45   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111109164537.GA7495@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=psusi@cfl.rr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.