linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.de>
To: Shaohua Li <shli@kernel.org>
Cc: fengguang.wu@intel.com, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [raid5] kernel BUG at drivers/md/raid5.c:701!
Date: Wed, 13 Nov 2013 13:45:32 +1100	[thread overview]
Message-ID: <20131113134532.107de176@notabene.brown> (raw)
In-Reply-To: <20131113002851.GA8051@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 7091 bytes --]

On Wed, 13 Nov 2013 08:28:51 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Tue, Nov 12, 2013 at 11:55:56AM +1100, NeilBrown wrote:
> > On Mon, 11 Nov 2013 22:47:57 +0800 fengguang.wu@intel.com wrote:
> > 
> > > 28cc2127527dcba2a0817afa8fd5a33c9e023090 is the first bad commit
> > > commit 28cc2127527dcba2a0817afa8fd5a33c9e023090
> > > Author: Shaohua Li <shli@kernel.org>
> > > Date:   Tue Sep 10 15:37:56 2013 +0800
> > > 
> > >     raid5: relieve lock contention in get_active_stripe()
> > >     
> > >     get_active_stripe() is the last place we have lock contention. It has two
> > >     paths. One is stripe isn't found and new stripe is allocated, the other is
> > >     stripe is found.
> > >     
> > >     The first path basically calls __find_stripe and init_stripe. It accesses
> > >     conf->generation, conf->previous_raid_disks, conf->raid_disks,
> > >     conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
> > >     conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
> > >     stripe_hashtbl and inactive_list, other fields are changed very rarely.
> > >     
> > >     With this patch, we split inactive_list and add new hash locks. Each free
> > >     stripe belongs to a specific inactive list. Which inactive list is determined
> > >     by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
> > >     lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
> > >     is determined by it's lock_hash too. The lock_hash is derivied from current
> > >     stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
> > >     to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
> > >     list too. The goal of the new hash locks introduced is we can only use the new
> > >     locks in the first path of get_active_stripe(). Since we have several hash
> > >     locks, lock contention is relieved significantly.
> > >     
> > >     The first path of get_active_stripe() accesses other fields, since they are
> > >     changed rarely, changing them now need take conf->device_lock and all hash
> > >     locks. For a slow path, this isn't a problem.
> > >     
> > >     If we need lock device_lock and hash lock, we always lock hash lock first. The
> > >     tricky part is release_stripe and friends. We need take device_lock first.
> > >     Neil's suggestion is we put inactive stripes to a temporary list and readd it
> > >     to inactive_list after device_lock is released. In this way, we add stripes to
> > >     temporary list with device_lock hold and remove stripes from the list with hash
> > >     lock hold. So we don't allow concurrent access to the temporary list, which
> > >     means we need allocate temporary list for all participants of release_stripe.
> > >     
> > >     One downside is free stripes are maintained in their inactive list, they can't
> > >     across between the lists. By default, we have total 256 stripes and 8 lists, so
> > >     each list will have 32 stripes. It's possible one list has free stripe but
> > >     other list hasn't. The chance should be rare because stripes allocation are
> > >     even distributed. And we can always allocate more stripes for cache, several
> > >     mega bytes memory isn't a big deal.
> > >     
> > >     This completely removes the lock contention of the first path of
> > >     get_active_stripe(). It slows down the second code path a little bit though
> > >     because we now need takes two locks, but since the hash lock isn't contended,
> > >     the overhead should be quite small (several atomic instructions). The second
> > >     path of get_active_stripe() (basically sequential write or big request size
> > >     randwrite) still has lock contentions.
> > >     
> > >     Signed-off-by: Shaohua Li <shli@fusionio.com>
> > >     Signed-off-by: NeilBrown <neilb@suse.de>
> > > 
> > > :040000 040000 84ab47136c389751c7c08ded47b1761b1bee7184 351047cfe3ac66013fc5c77f23d9bb04f869081d M	drivers
> > > bisect run success
> > > 
> > > # bad: [86737931c2be292ec985df48f2e7fbafb4467f0e] Merge 'md/master' into devel-hourly-2013111107
> > > # good: [5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52] Linux 3.12
> > > git bisect start '86737931c2be292ec985df48f2e7fbafb4467f0e' '5e01dc7b26d9f24f39abace5da98ccbd6a5ceb52' '--'
> > > # good: [21136946c495b0e1e0f7e25a8de6f170efbdeadf] drm/vmwgfx: fix warning if config intel iommu is off.
> > > git bisect good 21136946c495b0e1e0f7e25a8de6f170efbdeadf
> > > # good: [ee360d688c8e37f81c92039f76bebaddbe36befe] Merge branch 'acpi-assorted'
> > > git bisect good ee360d688c8e37f81c92039f76bebaddbe36befe
> > > # good: [cf0613d242805797f252535fcf7bb019512beb46] Merge branch 'gma500-next' of git://github.com/patjak/drm-gma500 into drm-next
> > > git bisect good cf0613d242805797f252535fcf7bb019512beb46
> > > # good: [feba070dbac6f7b477570e590a7dc960b7b0f784] Merge branch 'pm-sleep'
> > > git bisect good feba070dbac6f7b477570e590a7dc960b7b0f784
> > > # good: [6f092343855a71e03b8d209815d8c45bf3a27fcd] net: flow_dissector: fail on evil iph->ihl
> > > git bisect good 6f092343855a71e03b8d209815d8c45bf3a27fcd
> > > # good: [7d963128c95a790b40ae5bac6af23646ceffcb54] Merge 'drm/drm-next' into devel-hourly-2013111107
> > > git bisect good 7d963128c95a790b40ae5bac6af23646ceffcb54
> > > # bad: [917bb50339fbbea9c5f47d257ea42f9652129c3f] raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
> > > git bisect bad 917bb50339fbbea9c5f47d257ea42f9652129c3f
> > > # bad: [28cc2127527dcba2a0817afa8fd5a33c9e023090] raid5: relieve lock contention in get_active_stripe()
> > > git bisect bad 28cc2127527dcba2a0817afa8fd5a33c9e023090
> > > # good: [09b06d70464536cb3cce7cf1c52f23a321604f62] md: fix calculation of stacking limits on level change.
> > > git bisect good 09b06d70464536cb3cce7cf1c52f23a321604f62
> > > # good: [c0f773604d33616b07d61841463a056a780f5ae7] wait: add wait_event_cmd()
> > > git bisect good c0f773604d33616b07d61841463a056a780f5ae7
> > > # first bad commit: [28cc2127527dcba2a0817afa8fd5a33c9e023090] raid5: relieve lock contention in get_active_stripe()
> > > 
> > 
> > I think I've fixed this by merging in the follow.
> > 
> > Shaohua: could you please review and confirm if you  agree?
> > 
> > Thanks,
> > NeilBrown
> > 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index c37ffca1b13c..93090b2afab4 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -697,6 +697,7 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
> >  				if (!test_bit(STRIPE_HANDLE, &sh->state))
> >  					atomic_inc(&conf->active_stripes);
> >  				if (list_empty(&sh->lru) &&
> > +				    !test_bit(STRIPE_ON_RELEASE_LIST, &sh->state) &&
> >  				    !test_bit(STRIPE_EXPANDING, &sh->state))
> >  					BUG();
> >  				list_del_init(&sh->lru);
> 
> Yes, makes a lot of sense. Thanks!

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

      reply	other threads:[~2013-11-13  2:45 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-11 14:47 [raid5] kernel BUG at drivers/md/raid5.c:701! fengguang.wu
2013-11-12  0:55 ` NeilBrown
2013-11-13  0:28   ` Shaohua Li
2013-11-13  2:45     ` NeilBrown [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131113134532.107de176@notabene.brown \
    --to=neilb@suse.de \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shli@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).