From: Dave Chinner <david@fromorbit.com>
To: Zhi Yong Wu <zwu.kernel@gmail.com>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>,
linux-kernel mlist <linux-kernel@vger.kernel.org>,
xfstests <xfs@oss.sgi.com>
Subject: Re: [PATCH] xfs: introduce object readahead to log recovery
Date: Fri, 26 Jul 2013 21:35:22 +1000 [thread overview]
Message-ID: <20130726113521.GM13468@dastard> (raw)
In-Reply-To: <CAEH94Lh-UCCEs7hQi_t5v+X+ER1DH9dCtjr6e9GVNX5KJ-f1hQ@mail.gmail.com>
On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote:
> Dave,
>
> All comments are good to me, and will be applied to next version, thanks a lot.
>
> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@gmail.com wrote:
> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >>
> >> It can take a long time to run log recovery operation because it is
> >> single threaded and is bound by read latency. We can find that it took
> >> most of the time to wait for the read IO to occur, so if one object
> >> readahead is introduced to log recovery, it will obviously reduce the
> >> log recovery time.
> >>
> >> In dirty log case as below:
> >> data device: 0xfd10
> >> log device: 0xfd10 daddr: 20480032 length: 20480
> >>
> >> log tail: 7941 head: 11077 state: <DIRTY>
> >
> > That's only a small log (10MB). As I've said on irc, readahead won't
> Yeah, it is one 10MB log, but how do you calculate it based on the above info?
length = 20480 blocks. 20480 * 512 = 10MB....
> > And the recovery time from this is between 15-17s:
> >
> > ....
> > log device: 0xfd20 daddr: 107374182032 length: 4173824
> > ^^^^^^^ almost 2GB
> > log tail: 19288 head: 264809 state: <DIRTY>
> > ....
> > real 0m17.913s
> > user 0m0.000s
> > sys 0m2.381s
> >
> > And runs at 3-4000 read IOPs for most of that time. It's largely IO
> > bound, even on SSDs.
> >
> > With your patch:
> >
> > log tail: 35871 head: 308393 state: <DIRTY>
> > real 0m12.715s
> > user 0m0.000s
> > sys 0m2.247s
> >
> > And it peaked at ~5000 read IOPS.
> How do you know its READ IOPS is ~5000?
Other monitoring. iostat can tell you this, though I use PCP...
> > Ok, so you've based the readahead on the transaction item list
> > having a next pointer. What I think you should do is turn this into
> > a readahead queue by moving objects to a new list. i.e.
> >
> > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
> >
> > case XLOG_RECOVER_PASS2:
> > if (ra_qdepth++ >= MAX_QDEPTH) {
> > recover_items(log, trans, &buffer_list, &ra_item_list);
> > ra_qdepth = 0;
> > } else {
> > xlog_recover_item_readahead(log, item);
> > list_move_tail(&item->ri_list, &ra_item_list);
> > }
> > break;
> > ...
> > }
> > }
> > if (!list_empty(&ra_item_list))
> > recover_items(log, trans, &buffer_list, &ra_item_list);
> >
> > I'd suggest that a queue depth somewhere between 10 and 100 will
> > be necessary to keep enough IO in flight to keep the pipeline full
> > and prevent recovery from having to wait on IO...
> Good suggestion, will apply it to next version, thanks.
FWIW, I hacked a quick test of this into your patch here and a depth
of 100 brought the reocvery time down to under 8s. For other
workloads which have nothing but dirty inodes (like fsmark) a depth
of 100 drops the recovery time from ~100s to ~25s, and the iop rate
is peaking at well over 15,000 IOPS. So we definitely want to queue
up more than a single readahead...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Zhi Yong Wu <zwu.kernel@gmail.com>
Cc: xfstests <xfs@oss.sgi.com>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
linux-kernel mlist <linux-kernel@vger.kernel.org>,
Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Subject: Re: [PATCH] xfs: introduce object readahead to log recovery
Date: Fri, 26 Jul 2013 21:35:22 +1000 [thread overview]
Message-ID: <20130726113521.GM13468@dastard> (raw)
In-Reply-To: <CAEH94Lh-UCCEs7hQi_t5v+X+ER1DH9dCtjr6e9GVNX5KJ-f1hQ@mail.gmail.com>
On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote:
> Dave,
>
> All comments are good to me, and will be applied to next version, thanks a lot.
>
> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.kernel@gmail.com wrote:
> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >>
> >> It can take a long time to run log recovery operation because it is
> >> single threaded and is bound by read latency. We can find that it took
> >> most of the time to wait for the read IO to occur, so if one object
> >> readahead is introduced to log recovery, it will obviously reduce the
> >> log recovery time.
> >>
> >> In dirty log case as below:
> >> data device: 0xfd10
> >> log device: 0xfd10 daddr: 20480032 length: 20480
> >>
> >> log tail: 7941 head: 11077 state: <DIRTY>
> >
> > That's only a small log (10MB). As I've said on irc, readahead won't
> Yeah, it is one 10MB log, but how do you calculate it based on the above info?
length = 20480 blocks. 20480 * 512 = 10MB....
> > And the recovery time from this is between 15-17s:
> >
> > ....
> > log device: 0xfd20 daddr: 107374182032 length: 4173824
> > ^^^^^^^ almost 2GB
> > log tail: 19288 head: 264809 state: <DIRTY>
> > ....
> > real 0m17.913s
> > user 0m0.000s
> > sys 0m2.381s
> >
> > And runs at 3-4000 read IOPs for most of that time. It's largely IO
> > bound, even on SSDs.
> >
> > With your patch:
> >
> > log tail: 35871 head: 308393 state: <DIRTY>
> > real 0m12.715s
> > user 0m0.000s
> > sys 0m2.247s
> >
> > And it peaked at ~5000 read IOPS.
> How do you know its READ IOPS is ~5000?
Other monitoring. iostat can tell you this, though I use PCP...
> > Ok, so you've based the readahead on the transaction item list
> > having a next pointer. What I think you should do is turn this into
> > a readahead queue by moving objects to a new list. i.e.
> >
> > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) {
> >
> > case XLOG_RECOVER_PASS2:
> > if (ra_qdepth++ >= MAX_QDEPTH) {
> > recover_items(log, trans, &buffer_list, &ra_item_list);
> > ra_qdepth = 0;
> > } else {
> > xlog_recover_item_readahead(log, item);
> > list_move_tail(&item->ri_list, &ra_item_list);
> > }
> > break;
> > ...
> > }
> > }
> > if (!list_empty(&ra_item_list))
> > recover_items(log, trans, &buffer_list, &ra_item_list);
> >
> > I'd suggest that a queue depth somewhere between 10 and 100 will
> > be necessary to keep enough IO in flight to keep the pipeline full
> > and prevent recovery from having to wait on IO...
> Good suggestion, will apply it to next version, thanks.
FWIW, I hacked a quick test of this into your patch here and a depth
of 100 brought the reocvery time down to under 8s. For other
workloads which have nothing but dirty inodes (like fsmark) a depth
of 100 drops the recovery time from ~100s to ~25s, and the iop rate
is peaking at well over 15,000 IOPS. So we definitely want to queue
up more than a single readahead...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2013-07-26 11:35 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-25 8:23 [PATCH] xfs: introduce object readahead to log recovery zwu.kernel
2013-07-25 8:23 ` zwu.kernel
2013-07-26 2:50 ` Dave Chinner
2013-07-26 2:50 ` Dave Chinner
2013-07-26 6:36 ` Zhi Yong Wu
2013-07-26 6:36 ` Zhi Yong Wu
2013-07-26 11:35 ` Dave Chinner [this message]
2013-07-26 11:35 ` Dave Chinner
2013-07-29 1:38 ` Zhi Yong Wu
2013-07-29 1:38 ` Zhi Yong Wu
2013-07-29 2:45 ` Dave Chinner
2013-07-29 2:45 ` Dave Chinner
2013-07-29 3:12 ` Zhi Yong Wu
2013-07-29 3:12 ` Zhi Yong Wu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130726113521.GM13468@dastard \
--to=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=wuzhy@linux.vnet.ibm.com \
--cc=xfs@oss.sgi.com \
--cc=zwu.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.