* [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
@ 2009-01-04 21:52 Theodore Ts'o
2009-01-04 22:23 ` Andrew Morton
0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2009-01-04 21:52 UTC (permalink / raw)
To: Andrew Morton, linux-mm, linux-ext4, Arjan van de Ven; +Cc: Jens Axboe
If wbc.sync_mode is WBC_SYNC_ALL, then in the page writeback paths we
will be waiting for the write to complete. So the I/O should be
submitted via submit_bh() with WRITE_SYNC so the block layer should
properly prioritize the I/O.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-mm@vger.kernel.org
---
Following up with an e-mail thread started by Arjan two months ago,
(subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority), I have
a patch, just sent to linux-ext4@vger.kernel.org, which fixes the jbd2
layer to submit journal writes via submit_bh() with WRITE_SYNC.
Hopefully this might be enough of a priority boost so we don't have to
force a higher I/O priority level via a buffer_head flag. However,
while looking through the code paths, in ordered data mode, we end up
flushing data pages via the page writeback paths on a per-inode basis,
and I noticed that even though we are passing in
wbc.sync_mode=WBC_SYNC_ALL, __block_write_full_page() is using
submit_bh(WRITE, bh) instead of submit_bh(WRITE_SYNC).
I'm not completely confident in my understanding of the page writeback
code paths --- does this change make sense?
- Ted
fs/buffer.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 10179cf..392b1b3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1741,7 +1741,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh(WRITE, bh);
+ submit_bh((wbc->sync_mode == WB_SYNC_ALL) ?
+ WRITE_SYNC : WRITE, bh);
nr_underway++;
}
bh = next;
@@ -1795,7 +1796,8 @@ recover:
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
- submit_bh(WRITE, bh);
+ submit_bh((wbc->sync_mode == WB_SYNC_ALL) ?
+ WRITE_SYNC : WRITE, bh);
nr_underway++;
}
bh = next;
--
1.6.0.4.8.g36f27.dirty
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-04 21:52 [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL Theodore Ts'o
@ 2009-01-04 22:23 ` Andrew Morton
2009-01-04 22:43 ` Theodore Tso
0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2009-01-04 22:23 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-mm, linux-ext4, Arjan van de Ven, Jens Axboe
On Sun, 04 Jan 2009 16:52:46 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote:
> If wbc.sync_mode is WBC_SYNC_ALL, then in the page writeback paths we
> will be waiting for the write to complete. So the I/O should be
> submitted via submit_bh() with WRITE_SYNC so the block layer should
> properly prioritize the I/O.
>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: linux-mm@vger.kernel.org
> ---
>
> Following up with an e-mail thread started by Arjan two months ago,
> (subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority), I have
> a patch, just sent to linux-ext4@vger.kernel.org, which fixes the jbd2
> layer to submit journal writes via submit_bh() with WRITE_SYNC.
> Hopefully this might be enough of a priority boost so we don't have to
> force a higher I/O priority level via a buffer_head flag. However,
> while looking through the code paths, in ordered data mode, we end up
> flushing data pages via the page writeback paths on a per-inode basis,
> and I noticed that even though we are passing in
> wbc.sync_mode=WBC_SYNC_ALL, __block_write_full_page() is using
> submit_bh(WRITE, bh) instead of submit_bh(WRITE_SYNC).
But this is all the wrong way to fix the problem, isn't it?
The problem is that at one particular point, the current transaction
blocks callers behind the committing transaction's IO completion.
Did anyone look at fixing that? ISTR concluding that a data copy and
shadow-bh arrangement might be needed.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-04 22:23 ` Andrew Morton
@ 2009-01-04 22:43 ` Theodore Tso
2009-01-04 23:19 ` Andrew Morton
0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-01-04 22:43 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, linux-ext4, Arjan van de Ven, Jens Axboe
On Sun, Jan 04, 2009 at 02:23:03PM -0800, Andrew Morton wrote:
> > Following up with an e-mail thread started by Arjan two months ago,
> > (subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority), I have
> > a patch, just sent to linux-ext4@vger.kernel.org, which fixes the jbd2
> > layer to submit journal writes via submit_bh() with WRITE_SYNC.
> > Hopefully this might be enough of a priority boost so we don't have to
> > force a higher I/O priority level via a buffer_head flag. However,
> > while looking through the code paths, in ordered data mode, we end up
> > flushing data pages via the page writeback paths on a per-inode basis,
> > and I noticed that even though we are passing in
> > wbc.sync_mode=WBC_SYNC_ALL, __block_write_full_page() is using
> > submit_bh(WRITE, bh) instead of submit_bh(WRITE_SYNC).
>
> But this is all the wrong way to fix the problem, isn't it?
>
> The problem is that at one particular point, the current transaction
> blocks callers behind the committing transaction's IO completion.
>
> Did anyone look at fixing that? ISTR concluding that a data copy and
> shadow-bh arrangement might be needed.
I haven't had time to really drill down into the jbd code yet, and
yes, eventually we probably want to do this. Still, if we are
submitting I/O which we are going to end up waiting on, we really
should submit it with WRITE_SYNC, and this patch should optimize
writes in other situations; for example, if we fsync() a file, we will
also end up calling block_write_full_page(), and so supplying the
WRITE_SYNC hint to the block layer would be a Good Thing.
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-04 22:43 ` Theodore Tso
@ 2009-01-04 23:19 ` Andrew Morton
2009-01-05 0:21 ` Theodore Tso
2009-01-05 8:02 ` Jens Axboe
0 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2009-01-04 23:19 UTC (permalink / raw)
To: Theodore Tso; +Cc: linux-mm, linux-ext4, Arjan van de Ven, Jens Axboe
On Sun, 4 Jan 2009 17:43:51 -0500 Theodore Tso <tytso@mit.edu> wrote:
> On Sun, Jan 04, 2009 at 02:23:03PM -0800, Andrew Morton wrote:
> > > Following up with an e-mail thread started by Arjan two months ago,
> > > (subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority), I have
> > > a patch, just sent to linux-ext4@vger.kernel.org, which fixes the jbd2
> > > layer to submit journal writes via submit_bh() with WRITE_SYNC.
> > > Hopefully this might be enough of a priority boost so we don't have to
> > > force a higher I/O priority level via a buffer_head flag. However,
> > > while looking through the code paths, in ordered data mode, we end up
> > > flushing data pages via the page writeback paths on a per-inode basis,
> > > and I noticed that even though we are passing in
> > > wbc.sync_mode=WBC_SYNC_ALL, __block_write_full_page() is using
> > > submit_bh(WRITE, bh) instead of submit_bh(WRITE_SYNC).
> >
> > But this is all the wrong way to fix the problem, isn't it?
> >
> > The problem is that at one particular point, the current transaction
> > blocks callers behind the committing transaction's IO completion.
> >
> > Did anyone look at fixing that? ISTR concluding that a data copy and
> > shadow-bh arrangement might be needed.
>
> I haven't had time to really drill down into the jbd code yet, and
> yes, eventually we probably want to do this.
We do.
> Still, if we are
> submitting I/O which we are going to end up waiting on, we really
> should submit it with WRITE_SYNC, and this patch should optimize
> writes in other situations; for example, if we fsync() a file, we will
> also end up calling block_write_full_page(), and so supplying the
> WRITE_SYNC hint to the block layer would be a Good Thing.
Is it? WRITE_SYNC means "unplug the queue after this bh/BIO". By setting
it against every bh, don't we risk the generation of more BIOs and
the loss of merging opportunities?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-04 23:19 ` Andrew Morton
@ 2009-01-05 0:21 ` Theodore Tso
2009-01-05 8:02 ` Jens Axboe
1 sibling, 0 replies; 13+ messages in thread
From: Theodore Tso @ 2009-01-05 0:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, linux-ext4, Arjan van de Ven, Jens Axboe
On Sun, Jan 04, 2009 at 03:19:27PM -0800, Andrew Morton wrote:
> > Still, if we are
> > submitting I/O which we are going to end up waiting on, we really
> > should submit it with WRITE_SYNC, and this patch should optimize
> > writes in other situations; for example, if we fsync() a file, we will
> > also end up calling block_write_full_page(), and so supplying the
> > WRITE_SYNC hint to the block layer would be a Good Thing.
>
> Is it? WRITE_SYNC means "unplug the queue after this bh/BIO". By setting
> it against every bh, don't we risk the generation of more BIOs and
> the loss of merging opportunities?
Good point, yeah, that's a problem. Some of IO schedulers also use
REQ_RW_SYNC to prioritize the I/O's above non-sync I/O's. That's an
orthognal issue to unplugging the queue; it would be useful to be able
to mark an I/O as "this is bio is one that we will eventually end up
waiting to complete", separately from "please unplug the the queue
after this bio submitted".
BTW, I notice that the CFQ io scheduler prioritizes REQ_RW_META bio's
behind REQ_RW_SYNC bio's, but ahead of normal bio requeuss. But as
far as I can tell nothing is actually marking requests REQ_RW_META.
What is the intended use for this, and are there plans to make other
I/O schedulers honor REQ_RW_META?
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-04 23:19 ` Andrew Morton
2009-01-05 0:21 ` Theodore Tso
@ 2009-01-05 8:02 ` Jens Axboe
2009-01-05 14:47 ` Theodore Tso
1 sibling, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2009-01-05 8:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: Theodore Tso, linux-mm, linux-ext4, Arjan van de Ven
On Sun, Jan 04 2009, Andrew Morton wrote:
> On Sun, 4 Jan 2009 17:43:51 -0500 Theodore Tso <tytso@mit.edu> wrote:
>
> > On Sun, Jan 04, 2009 at 02:23:03PM -0800, Andrew Morton wrote:
> > > > Following up with an e-mail thread started by Arjan two months ago,
> > > > (subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority), I have
> > > > a patch, just sent to linux-ext4@vger.kernel.org, which fixes the jbd2
> > > > layer to submit journal writes via submit_bh() with WRITE_SYNC.
> > > > Hopefully this might be enough of a priority boost so we don't have to
> > > > force a higher I/O priority level via a buffer_head flag. However,
> > > > while looking through the code paths, in ordered data mode, we end up
> > > > flushing data pages via the page writeback paths on a per-inode basis,
> > > > and I noticed that even though we are passing in
> > > > wbc.sync_mode=WBC_SYNC_ALL, __block_write_full_page() is using
> > > > submit_bh(WRITE, bh) instead of submit_bh(WRITE_SYNC).
> > >
> > > But this is all the wrong way to fix the problem, isn't it?
> > >
> > > The problem is that at one particular point, the current transaction
> > > blocks callers behind the committing transaction's IO completion.
> > >
> > > Did anyone look at fixing that? ISTR concluding that a data copy and
> > > shadow-bh arrangement might be needed.
> >
> > I haven't had time to really drill down into the jbd code yet, and
> > yes, eventually we probably want to do this.
>
> We do.
>
> > Still, if we are
> > submitting I/O which we are going to end up waiting on, we really
> > should submit it with WRITE_SYNC, and this patch should optimize
> > writes in other situations; for example, if we fsync() a file, we will
> > also end up calling block_write_full_page(), and so supplying the
> > WRITE_SYNC hint to the block layer would be a Good Thing.
>
> Is it? WRITE_SYNC means "unplug the queue after this bh/BIO". By setting
> it against every bh, don't we risk the generation of more BIOs and
> the loss of merging opportunities?
But it also implies that the io scheduler will treat the IO as sync even
if it is a write, which seems to be the very effect that Ted is looking
for as well.
Perhaps we should seperate it into two behavioural flags instead and
make the unplugging explicit.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 8:02 ` Jens Axboe
@ 2009-01-05 14:47 ` Theodore Tso
2009-01-05 15:58 ` Jens Axboe
2009-01-05 18:47 ` Andrew Morton
0 siblings, 2 replies; 13+ messages in thread
From: Theodore Tso @ 2009-01-05 14:47 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, linux-ext4, Arjan van de Ven
[ I've removed linux-mm from the cc list since we seem to have
wandered away from any details of page writeback. -- Ted]
On Mon, Jan 05, 2009 at 09:02:43AM +0100, Jens Axboe wrote:
> > Is it? WRITE_SYNC means "unplug the queue after this bh/BIO". By setting
> > it against every bh, don't we risk the generation of more BIOs and
> > the loss of merging opportunities?
>
> But it also implies that the io scheduler will treat the IO as sync even
> if it is a write, which seems to be the very effect that Ted is looking
> for as well.
Yeah, but I suspect the downsides caused by the lack of merging will
far outweigh wins from giving the hints to the I/O scheduler.
Separting things into two behavioural flags sounds like the the right
thing to me. Until that happens, I've dropped my proposed patch and
substituted a patch which allows the user to diddle the I/O priorities
of kjournald2 via a mount option so we can experiment a bit.
I agree that Andrew's long-term solution is probably the right one,
but there will be times (i.e., when we are doing a checkpoint as
opposed to a commit, or in a fsync-heavy workload), where we will end
up getting blocked behind kjournald, so upping the I/O priority really
does make sense. So a tunable mount option seems to make sense even
in the long run, since for some workloads we'll want to adjust
kjournald's I/O priority even after we stop normal (non-fsync) I/O
from blocking against the commit operation done by kjournald.
Jens, one question.... I came across the folllowing in blkdev.h:
__REQ_RW_SYNC, /* request is sync (O_DIRECT) */
Is the comment about O_DIRECT still relevant? I'm not sure it is.
Also, there's another confusing comment in bio.h:
#define BIO_RW 0 /* Must match RW in req flags (blkdev.h) */
#define BIO_RW_AHEAD 1 /* Must match FAILFAST in req flags */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
#define BIO_RW_FAILFAST_DEV 6
#define BIO_RW_FAILFAST_TRANSPORT 7
#define BIO_RW_FAILFAST_DRIVER 8
In fact, submit_bh() depends on BIO_RW_AHEAD matching with the
definition of READA in fs.h. I'm a bit confused about the fact that
we have both BIO_RW_AHEAD and BIO_RW_FAILFAST_DEV, and then in the req
flags in blkdev.h:
/*
* request type modified bits. first two bits match BIO_RW* bits, important
*/
enum rq_flag_bits {
__REQ_RW, /* not set, read. set, write */
__REQ_FAILFAST_DEV, /* no driver retries of device errors */
__REQ_FAILFAST_TRANSPORT, /* no driver retries of transport errors */
__REQ_FAILFAST_DRIVER, /* no driver retries of driver errors */
I assume when doing readhaead, we don't want to retry in the face of
device errors, which is why it's desirable for __REQ_FAILFAST_DEV
overlays with BIO_RW_AHEAD. But if that's the case, why are
BIO_RW_READA and BIO_RW_FAILFAST_DEV not mapped to the same BIO_RW_*
flag?
Am I missing something here? As far as I can tell nothing seems to be
using BIO_RW_FAILFAST_DEV. Is there a reason why it's not just
#define'd to be the same value as BIO_RW_READA?
Thanks, regards,
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 14:47 ` Theodore Tso
@ 2009-01-05 15:58 ` Jens Axboe
2009-01-05 18:47 ` Andrew Morton
1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2009-01-05 15:58 UTC (permalink / raw)
To: Theodore Tso; +Cc: Andrew Morton, linux-ext4, Arjan van de Ven
On Mon, Jan 05 2009, Theodore Tso wrote:
> [ I've removed linux-mm from the cc list since we seem to have
> wandered away from any details of page writeback. -- Ted]
>
> On Mon, Jan 05, 2009 at 09:02:43AM +0100, Jens Axboe wrote:
> > > Is it? WRITE_SYNC means "unplug the queue after this bh/BIO". By setting
> > > it against every bh, don't we risk the generation of more BIOs and
> > > the loss of merging opportunities?
> >
> > But it also implies that the io scheduler will treat the IO as sync even
> > if it is a write, which seems to be the very effect that Ted is looking
> > for as well.
>
> Yeah, but I suspect the downsides caused by the lack of merging will
> far outweigh wins from giving the hints to the I/O scheduler.
> Separting things into two behavioural flags sounds like the the right
> thing to me. Until that happens, I've dropped my proposed patch and
> substituted a patch which allows the user to diddle the I/O priorities
> of kjournald2 via a mount option so we can experiment a bit.
Not necessarily. It's really a latency vs throughput situation. If you
have async IO going on at the same time, writes marked sync will have a
MUCH more deterministic completion time.
> I agree that Andrew's long-term solution is probably the right one,
> but there will be times (i.e., when we are doing a checkpoint as
> opposed to a commit, or in a fsync-heavy workload), where we will end
> up getting blocked behind kjournald, so upping the I/O priority really
> does make sense. So a tunable mount option seems to make sense even
> in the long run, since for some workloads we'll want to adjust
> kjournald's I/O priority even after we stop normal (non-fsync) I/O
> from blocking against the commit operation done by kjournald.
I think you'll find that sync writes are still preferable to higher
priority async ones. You may still want to bump the priority, but I'd
make sure to get them properly marked as sync as well.
> Jens, one question.... I came across the folllowing in blkdev.h:
>
> __REQ_RW_SYNC, /* request is sync (O_DIRECT) */
>
> Is the comment about O_DIRECT still relevant? I'm not sure it is.
It is, and O_DIRECT write is just a normal write with the sync bit
marked as well. So if you use WRITE_SYNC, it's going to be exactly the
same as an O_DIRECT write from the block layer point of view.
> Also, there's another confusing comment in bio.h:
>
> #define BIO_RW 0 /* Must match RW in req flags (blkdev.h) */
> #define BIO_RW_AHEAD 1 /* Must match FAILFAST in req flags */
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ...
> #define BIO_RW_FAILFAST_DEV 6
> #define BIO_RW_FAILFAST_TRANSPORT 7
> #define BIO_RW_FAILFAST_DRIVER 8
>
>
> In fact, submit_bh() depends on BIO_RW_AHEAD matching with the
> definition of READA in fs.h. I'm a bit confused about the fact that
> we have both BIO_RW_AHEAD and BIO_RW_FAILFAST_DEV, and then in the req
> flags in blkdev.h:
>
> /*
> * request type modified bits. first two bits match BIO_RW* bits, important
> */
> enum rq_flag_bits {
> __REQ_RW, /* not set, read. set, write */
> __REQ_FAILFAST_DEV, /* no driver retries of device errors */
> __REQ_FAILFAST_TRANSPORT, /* no driver retries of transport errors */
> __REQ_FAILFAST_DRIVER, /* no driver retries of driver errors */
>
> I assume when doing readhaead, we don't want to retry in the face of
> device errors, which is why it's desirable for __REQ_FAILFAST_DEV
> overlays with BIO_RW_AHEAD. But if that's the case, why are
> BIO_RW_READA and BIO_RW_FAILFAST_DEV not mapped to the same BIO_RW_*
> flag?
>
> Am I missing something here? As far as I can tell nothing seems to be
> using BIO_RW_FAILFAST_DEV. Is there a reason why it's not just
> #define'd to be the same value as BIO_RW_READA?
Stale comment. It used to match even if they had different meanings, the
behaviour would be the same. Now we have a more fine grained setup wrt
retries and failing due to multi pathing.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 14:47 ` Theodore Tso
2009-01-05 15:58 ` Jens Axboe
@ 2009-01-05 18:47 ` Andrew Morton
2009-01-05 19:35 ` Theodore Tso
1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2009-01-05 18:47 UTC (permalink / raw)
To: Theodore Tso; +Cc: Jens Axboe, linux-ext4, Arjan van de Ven
On Mon, 5 Jan 2009 09:47:40 -0500 Theodore Tso <tytso@mit.edu> wrote:
> there will be times (i.e., when we are doing a checkpoint as
> opposed to a commit, or in a fsync-heavy workload), where we will end
> up getting blocked behind kjournald, so upping the I/O priority really
> does make sense.
Not if it will cause kjournald writes to be prioritised ahead of any
reads, I suspect. Writes are rarely synchronous, but with reads,
there's almost always someone waiting.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 18:47 ` Andrew Morton
@ 2009-01-05 19:35 ` Theodore Tso
2009-01-05 19:38 ` Jens Axboe
0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-01-05 19:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jens Axboe, linux-ext4, Arjan van de Ven
On Mon, Jan 05, 2009 at 10:47:13AM -0800, Andrew Morton wrote:
> On Mon, 5 Jan 2009 09:47:40 -0500 Theodore Tso <tytso@mit.edu> wrote:
>
> > there will be times (i.e., when we are doing a checkpoint as
> > opposed to a commit, or in a fsync-heavy workload), where we will end
> > up getting blocked behind kjournald, so upping the I/O priority really
> > does make sense.
>
> Not if it will cause kjournald writes to be prioritised ahead of any
> reads, I suspect. Writes are rarely synchronous, but with reads,
> there's almost always someone waiting.
If there is an fsync() in progress, someone's waiting; ditto in a
checkpoint situation, although arguably those are much rarer.
However, I can easily see situations such as a mail server where
having a slightly increased priority is a good thing.
So long-term, I suspect the hueristic which makes sense is that in the
case where there is an fsync() in progress, any writes which take
place as a result of that fsync (which includes the journal records as
well as ordered writes that are being forced out as a result of
data=ordered and which block the fsync from returning), should get a
hint which propagates down to the block layer that these writes *are*
synchronous in that someone is waiting for them to complete. They
shouldn't necessarily be prioritized ahead of other reads (unless they
are readahead operations that couldn't be combined with reads that
*are* synchronous that someone is waiting for completion), but they
should be prioritized ahead of asynchronous writes.
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 19:35 ` Theodore Tso
@ 2009-01-05 19:38 ` Jens Axboe
2009-01-05 21:16 ` Theodore Tso
0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2009-01-05 19:38 UTC (permalink / raw)
To: Theodore Tso; +Cc: Andrew Morton, linux-ext4, Arjan van de Ven
On Mon, Jan 05 2009, Theodore Tso wrote:
> So long-term, I suspect the hueristic which makes sense is that in the
> case where there is an fsync() in progress, any writes which take
> place as a result of that fsync (which includes the journal records as
> well as ordered writes that are being forced out as a result of
> data=ordered and which block the fsync from returning), should get a
> hint which propagates down to the block layer that these writes *are*
> synchronous in that someone is waiting for them to complete. They
If someone is waiting for them, they are by definition sync!
> shouldn't necessarily be prioritized ahead of other reads (unless they
> are readahead operations that couldn't be combined with reads that
> *are* synchronous that someone is waiting for completion), but they
> should be prioritized ahead of asynchronous writes.
And that is *exactly* what flagging the write as sync will do...
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 19:38 ` Jens Axboe
@ 2009-01-05 21:16 ` Theodore Tso
2009-01-06 7:34 ` Jens Axboe
0 siblings, 1 reply; 13+ messages in thread
From: Theodore Tso @ 2009-01-05 21:16 UTC (permalink / raw)
To: Jens Axboe; +Cc: Andrew Morton, linux-ext4, Arjan van de Ven
On Mon, Jan 05, 2009 at 08:38:20PM +0100, Jens Axboe wrote:
> On Mon, Jan 05 2009, Theodore Tso wrote:
> > So long-term, I suspect the hueristic which makes sense is that in the
> > case where there is an fsync() in progress, any writes which take
> > place as a result of that fsync (which includes the journal records as
> > well as ordered writes that are being forced out as a result of
> > data=ordered and which block the fsync from returning), should get a
> > hint which propagates down to the block layer that these writes *are*
> > synchronous in that someone is waiting for them to complete. They
>
> If someone is waiting for them, they are by definition sync!
Surely. :-)
Andrew's argument is that someone *shouldn't* be waiting for them ---
and he's right, although in the case of fsync() in particular, there's
nothing we can do; there will be a userspace application waiting by
definition.
The bigger problem right now is until we split up the meaning of
"unplug the I/O queue" with "mark the I/O as synchronous", right now
the way data ordered mode works is all of the data blocks get pushed
out in 4k chunks. So in the worst case, if the user has just written
some 200 megabytes of vmlinuz and kernel modules, and then calls
fsync(), the block I/O layer might get flooded with some 50,000+ 4k
writes, and if they are all BIO_RW_SYNC, they might not get coalesced
properly, and the result would be badness. One could argue that
journal layer should do doing a better job of coalescing the write
requests, but historically the block layer has done this for us, so
why add duplicate functionality at the journalling layer?
In any case, that's why I'm really not convinced we can afford to use
BIO_RW_SYNC until we separate out the queue unplug functionality.
Maybe what makes sence is to have two flags, BIO_RW_UNPLUG and
BIO_RW_SYNCIO, and then make BIO_RW_SYNC be defined to be
(BIO_RW_UNPLUG|BIO_RW_SYNCIO)?
> > shouldn't necessarily be prioritized ahead of other reads (unless they
> > are readahead operations that couldn't be combined with reads that
> > *are* synchronous that someone is waiting for completion), but they
> > should be prioritized ahead of asynchronous writes.
>
> And that is *exactly* what flagging the write as sync will do...
Great, so once we separate out the queue unplug request, I think this
should be exactly what we need.
- Ted
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL
2009-01-05 21:16 ` Theodore Tso
@ 2009-01-06 7:34 ` Jens Axboe
0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2009-01-06 7:34 UTC (permalink / raw)
To: Theodore Tso; +Cc: Andrew Morton, linux-ext4, Arjan van de Ven
On Mon, Jan 05 2009, Theodore Tso wrote:
> On Mon, Jan 05, 2009 at 08:38:20PM +0100, Jens Axboe wrote:
> > On Mon, Jan 05 2009, Theodore Tso wrote:
> > > So long-term, I suspect the hueristic which makes sense is that in the
> > > case where there is an fsync() in progress, any writes which take
> > > place as a result of that fsync (which includes the journal records as
> > > well as ordered writes that are being forced out as a result of
> > > data=ordered and which block the fsync from returning), should get a
> > > hint which propagates down to the block layer that these writes *are*
> > > synchronous in that someone is waiting for them to complete. They
> >
> > If someone is waiting for them, they are by definition sync!
>
> Surely. :-)
>
> Andrew's argument is that someone *shouldn't* be waiting for them ---
> and he's right, although in the case of fsync() in particular, there's
> nothing we can do; there will be a userspace application waiting by
> definition.
>
> The bigger problem right now is until we split up the meaning of
> "unplug the I/O queue" with "mark the I/O as synchronous", right now
> the way data ordered mode works is all of the data blocks get pushed
> out in 4k chunks. So in the worst case, if the user has just written
> some 200 megabytes of vmlinuz and kernel modules, and then calls
> fsync(), the block I/O layer might get flooded with some 50,000+ 4k
> writes, and if they are all BIO_RW_SYNC, they might not get coalesced
> properly, and the result would be badness. One could argue that
The flag doesn't mean "don't merge", it merely starts device queuing
right away instead of waiting for the unplug. So there will be merging
going on, especially for cases like the above where you push lots and
lots of IO. For that particular case, there should be essentially zero
difference in performance, just the few initial ios may be smaller. So
usually it's not going to be a big difference in IO behaviour.
So I'd be more worried about the case of smallish files and lots of
fsync(), since that'll be a lot more randomized writes.
> journal layer should do doing a better job of coalescing the write
> requests, but historically the block layer has done this for us, so
> why add duplicate functionality at the journalling layer?
I agree, we've always done this coalescing in the block layer, so no
point in changing that now.
> In any case, that's why I'm really not convinced we can afford to use
> BIO_RW_SYNC until we separate out the queue unplug functionality.
> Maybe what makes sence is to have two flags, BIO_RW_UNPLUG and
> BIO_RW_SYNCIO, and then make BIO_RW_SYNC be defined to be
> (BIO_RW_UNPLUG|BIO_RW_SYNCIO)?
Yep, that's exactly what I'll do!
> > > shouldn't necessarily be prioritized ahead of other reads (unless they
> > > are readahead operations that couldn't be combined with reads that
> > > *are* synchronous that someone is waiting for completion), but they
> > > should be prioritized ahead of asynchronous writes.
> >
> > And that is *exactly* what flagging the write as sync will do...
>
> Great, so once we separate out the queue unplug request, I think this
> should be exactly what we need.
Should fit nicely then.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-01-06 7:35 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-04 21:52 [PATCH, RFC] Use WRITE_SYNC in __block_write_full_page() if WBC_SYNC_ALL Theodore Ts'o
2009-01-04 22:23 ` Andrew Morton
2009-01-04 22:43 ` Theodore Tso
2009-01-04 23:19 ` Andrew Morton
2009-01-05 0:21 ` Theodore Tso
2009-01-05 8:02 ` Jens Axboe
2009-01-05 14:47 ` Theodore Tso
2009-01-05 15:58 ` Jens Axboe
2009-01-05 18:47 ` Andrew Morton
2009-01-05 19:35 ` Theodore Tso
2009-01-05 19:38 ` Jens Axboe
2009-01-05 21:16 ` Theodore Tso
2009-01-06 7:34 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).