From: NeilBrown <neilb@suse.de>
To: Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org
Subject: Re: [patch 1/2]MD: raid5 trim support
Date: Tue, 25 Sep 2012 17:00:28 +1000 [thread overview]
Message-ID: <20120925170028.0748a304@notabene.brown> (raw)
In-Reply-To: <20120920103138.GA25389@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 7057 bytes --]
On Thu, 20 Sep 2012 18:31:38 +0800 Shaohua Li <shli@kernel.org> wrote:
> On Thu, Sep 20, 2012 at 12:25:41PM +0800, Shaohua Li wrote:
> > On Thu, Sep 20, 2012 at 01:59:14PM +1000, NeilBrown wrote:
> > > On Thu, 20 Sep 2012 10:27:17 +0800 Shaohua Li <shli@kernel.org> wrote:
> > >
> > > in which wrong sectors were trimmed....
> > > >
> > > > Ok, just confirmed, delete raid5_compute_sector is ok if I adjust
> > > > logical_sector calculation. Here is the new patch.
> > > >
> > >
> > > Thanks. That looks better. I've applied it with some minor formatting
> > > changes.
> > >
> > > I then went to look at the follow-up page and .....
> > > I count 11 separate places where you test the new flag and possibly memset a
> > > page to zero. This doesn't seem like an improvement to me.
> >
> > We do the zero page just before the stripe is hit in cache, which is rare case.
> >
> > > Why don't we just mark the page as not up-to-date when we discard it? That
> > > would avoid storing inconsistent data, and would avoid needing to zero pages.
> >
> > We need re-read the strip if it's hit in cache, but it's rare case, we don't
> > care. So when we clear the up-to-date flag? I saw a lot of places checking
> > up-to-date flag in the write path. Need close look to check if there is race.
>
> Alright, appears ok to use the uptodate flag. Here is the new patch.
>
>
> Subject: MD: raid5 avoid unnecessary zero page for trim
>
> We want to avoid zero discarded dev page, because it's useless for discard.
> But if we don't zero it, another read/write hit such page in the cache and will
> get inconsistent data.
>
> To avoid zero the page, we don't set R5_UPTODATE flag after construction is
> done. In this way, discard write request is still issued and finished, but read
> will not hit the page. If the stripe gets accessed soon, we need reread the
> stripe, but since the chance is low, the reread isn't a big deal.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
Thanks. It didn't turn out quite as clean as I hoped, but I suspect it is
the best we will get.
applied, thanks.
NeilBrown
> ---
> drivers/md/raid5.c | 35 +++++++++++++++++------------------
> 1 file changed, 17 insertions(+), 18 deletions(-)
>
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c 2012-09-20 18:10:38.546836309 +0800
> +++ linux/drivers/md/raid5.c 2012-09-20 18:17:50.849399945 +0800
> @@ -547,7 +547,7 @@ static void ops_run_io(struct stripe_hea
> rw = WRITE_FUA;
> else
> rw = WRITE;
> - if (test_and_clear_bit(R5_Discard, &sh->dev[i].flags))
> + if (test_bit(R5_Discard, &sh->dev[i].flags))
> rw |= REQ_DISCARD;
> } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
> rw = READ;
> @@ -1172,11 +1172,9 @@ ops_run_biodrain(struct stripe_head *sh,
> set_bit(R5_WantFUA, &dev->flags);
> if (wbi->bi_rw & REQ_SYNC)
> set_bit(R5_SyncIO, &dev->flags);
> - if (wbi->bi_rw & REQ_DISCARD) {
> - memset(page_address(dev->page), 0,
> - STRIPE_SECTORS << 9);
> + if (wbi->bi_rw & REQ_DISCARD)
> set_bit(R5_Discard, &dev->flags);
> - } else
> + else
> tx = async_copy_data(1, wbi, dev->page,
> dev->sector, tx);
> wbi = r5_next_bio(wbi, dev->sector);
> @@ -1194,7 +1192,7 @@ static void ops_complete_reconstruct(voi
> int pd_idx = sh->pd_idx;
> int qd_idx = sh->qd_idx;
> int i;
> - bool fua = false, sync = false;
> + bool fua = false, sync = false, discard = false;
>
> pr_debug("%s: stripe %llu\n", __func__,
> (unsigned long long)sh->sector);
> @@ -1202,13 +1200,15 @@ static void ops_complete_reconstruct(voi
> for (i = disks; i--; ) {
> fua |= test_bit(R5_WantFUA, &sh->dev[i].flags);
> sync |= test_bit(R5_SyncIO, &sh->dev[i].flags);
> + discard |= test_bit(R5_Discard, &sh->dev[i].flags);
> }
>
> for (i = disks; i--; ) {
> struct r5dev *dev = &sh->dev[i];
>
> if (dev->written || i == pd_idx || i == qd_idx) {
> - set_bit(R5_UPTODATE, &dev->flags);
> + if (!discard)
> + set_bit(R5_UPTODATE, &dev->flags);
> if (fua)
> set_bit(R5_WantFUA, &dev->flags);
> if (sync)
> @@ -1252,8 +1252,6 @@ ops_run_reconstruct5(struct stripe_head
> }
> if (i >= sh->disks) {
> atomic_inc(&sh->count);
> - memset(page_address(sh->dev[pd_idx].page), 0,
> - STRIPE_SECTORS << 9);
> set_bit(R5_Discard, &sh->dev[pd_idx].flags);
> ops_complete_reconstruct(sh);
> return;
> @@ -1314,10 +1312,6 @@ ops_run_reconstruct6(struct stripe_head
> }
> if (i >= sh->disks) {
> atomic_inc(&sh->count);
> - memset(page_address(sh->dev[sh->pd_idx].page), 0,
> - STRIPE_SECTORS << 9);
> - memset(page_address(sh->dev[sh->qd_idx].page), 0,
> - STRIPE_SECTORS << 9);
> set_bit(R5_Discard, &sh->dev[sh->pd_idx].flags);
> set_bit(R5_Discard, &sh->dev[sh->qd_idx].flags);
> ops_complete_reconstruct(sh);
> @@ -2775,7 +2769,8 @@ static void handle_stripe_clean_event(st
> if (sh->dev[i].written) {
> dev = &sh->dev[i];
> if (!test_bit(R5_LOCKED, &dev->flags) &&
> - test_bit(R5_UPTODATE, &dev->flags)) {
> + (test_bit(R5_UPTODATE, &dev->flags) ||
> + test_and_clear_bit(R5_Discard, &dev->flags))) {
> /* We can return any write requests */
> struct bio *wbi, *wbi2;
> pr_debug("Return write for disc %d\n", i);
> @@ -3493,10 +3488,12 @@ static void handle_stripe(struct stripe_
> if (s.written &&
> (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
> && !test_bit(R5_LOCKED, &pdev->flags)
> - && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> + && (test_bit(R5_UPTODATE, &pdev->flags) ||
> + test_bit(R5_Discard, &pdev->flags))))) &&
> (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> && !test_bit(R5_LOCKED, &qdev->flags)
> - && test_bit(R5_UPTODATE, &qdev->flags)))))
> + && (test_bit(R5_UPTODATE, &qdev->flags) ||
> + test_bit(R5_Discard, &qdev->flags))))))
> handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>
> /* Now we might consider reading some blocks, either to check/generate
> @@ -3523,9 +3520,11 @@ static void handle_stripe(struct stripe_
> /* All the 'written' buffers and the parity block are ready to
> * be written back to disk
> */
> - BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
> + BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags) &&
> + !test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags));
> BUG_ON(sh->qd_idx >= 0 &&
> - !test_bit(R5_UPTODATE, &sh->dev[sh->qd_idx].flags));
> + !test_bit(R5_UPTODATE, &sh->dev[sh->qd_idx].flags) &&
> + !test_bit(R5_Discard, &sh->dev[sh->qd_idx].flags));
> for (i = disks; i--; ) {
> struct r5dev *dev = &sh->dev[i];
> if (test_bit(R5_LOCKED, &dev->flags) &&
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
prev parent reply other threads:[~2012-09-25 7:00 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-18 8:25 [patch 1/2]MD: raid5 trim support Shaohua Li
2012-09-20 1:15 ` NeilBrown
2012-09-20 1:36 ` Shaohua Li
2012-09-20 1:47 ` NeilBrown
2012-09-20 2:06 ` Shaohua Li
2012-09-20 2:27 ` Shaohua Li
2012-09-20 3:59 ` NeilBrown
2012-09-20 4:25 ` Shaohua Li
2012-09-20 10:31 ` Shaohua Li
2012-09-25 7:00 ` NeilBrown [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120925170028.0748a304@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).