From: Benny Halevy <bhalevy@panasas.com>
To: Boaz Harrosh <bharrosh@panasas.com>, Andy Adamson <andros@netapp.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH] SQUASHME: pnfs: move synchronize_rcu out side of spin_lock
Date: Thu, 18 Nov 2010 19:18:54 +0200 [thread overview]
Message-ID: <4CE55FFE.5070607@panasas.com> (raw)
In-Reply-To: <4CE51FBA.6060806@panasas.com>
On 2010-11-18 14:44, Boaz Harrosh wrote:
> On 11/17/2010 07:41 PM, Benny Halevy wrote:
>> squash into 4a28356 pnfs: CB_NOTIFY_DEVICEID
>>
>> We currently call pnfs_unhash_deviceid under spin_lock c->dc_lock
>> But pnfs_unhash_deviceid calls synchronize_rcu which may sched.
>> This resulted in the following BUG with the cthon tests:
>>
>> Nov 17 18:54:56 tl1 kernel: BUG: spinlock wrong CPU on CPU#1, test5/2170
>> Nov 17 18:54:56 tl1 kernel: lock: ffff880070559ff0, .magic: dead4ead, .owner: test5/2170, .owner_cpu: 0
>> Nov 17 18:54:56 tl1 kernel: Pid: 2170, comm: test5 Not tainted 2.6.37-rc2-pnfs #167
>> Nov 17 18:54:56 tl1 kernel: Call Trace:
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff8122cfc5>] spin_bug+0x9c/0xa3
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff8122d042>] do_raw_spin_unlock+0x76/0x8d
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff814534ea>] _raw_spin_unlock+0x2b/0x30
>> Nov 17 18:54:56 tl1 kernel: [<ffffffffa03c62fc>] pnfs_put_deviceid+0x59/0x64 [nfs]
>> Nov 17 18:54:56 tl1 kernel: [<ffffffffa0018551>] filelayout_free_lseg+0x5a/0x6f [nfs_layout_nfsv41_files]
>> Nov 17 18:54:56 tl1 kernel: [<ffffffffa03c6ea8>] pnfs_free_lseg_list+0x4e/0x8b [nfs]
>> Nov 17 18:54:56 tl1 kernel: [<ffffffffa03c85a4>] _pnfs_return_layout+0xe3/0x213 [nfs]
>> Nov 17 18:54:56 tl1 kernel: [<ffffffffa039bcb2>] nfs4_evict_inode+0x41/0x74 [nfs]
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff8112cab6>] evict+0x27/0x97
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff8112d051>] iput+0x20c/0x243
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff8112476c>] do_unlinkat+0x11c/0x16f
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff81118750>] ? fsnotify_modify+0x66/0x6e
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff81452cbe>] ? lockdep_sys_exit_thunk+0x35/0x67
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff810a0b51>] ? audit_syscall_entry+0x11e/0x14a
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff811247d5>] sys_unlink+0x16/0x18
>> Nov 17 18:54:56 tl1 kernel: [<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b
>>
>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> ---
>> fs/nfs/pnfs.c | 5 ++---
>> 1 files changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
>> index 559fcce..39c7d9f 100644
>> --- a/fs/nfs/pnfs.c
>> +++ b/fs/nfs/pnfs.c
>> @@ -1651,7 +1651,6 @@ pnfs_unhash_deviceid(struct pnfs_deviceid_cache *c,
>> hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[h], de_node)
>> if (!memcmp(&d->de_id, id, sizeof(*id))) {
>> hlist_del_rcu(&d->de_node);
>> - synchronize_rcu();
>> return d;
>> }
>>
>> @@ -1672,7 +1671,7 @@ pnfs_put_deviceid(struct pnfs_deviceid_cache *c,
>>
>> pnfs_unhash_deviceid(c, &devid->de_id);
>> spin_unlock(&c->dc_lock);
>> -
>> + synchronize_rcu();
>> c->dc_free_callback(devid);
>> }
>> EXPORT_SYMBOL_GPL(pnfs_put_deviceid);
>> @@ -1686,7 +1685,7 @@ pnfs_delete_deviceid(struct pnfs_deviceid_cache *c,
>> spin_lock(&c->dc_lock);
>> devid = pnfs_unhash_deviceid(c, id);
>> spin_unlock(&c->dc_lock);
>> -
>> + synchronize_rcu();
>> dprintk("%s [%d]\n", __func__, atomic_read(&devid->de_ref));
>> if (atomic_dec_and_test(&devid->de_ref))
>> c->dc_free_callback(devid);
>
> OK, so I don't like this fix because before we only synchronize_rcu()
> after an actual hlist_del_rcu.
>
> If we look at the two callers
>
> ()
> if (!atomic_dec_and_lock(&devid->de_ref, &c->dc_lock))
> return;
>
> pnfs_unhash_deviceid(c, &devid->de_id);
> spin_unlock(&c->dc_lock);
>
> pnfs_delete_deviceid
> spin_lock(&c->dc_lock);
> devid = pnfs_unhash_deviceid(c, id);
> spin_unlock(&c->dc_lock);
>
>
> The pnfs_put_deviceid does an atomic_dec_and_lock so not to race with
> pnfs_find_get_deviceid. But in pnfs_find_get_deviceid we do atomic_inc_not_zero
> so we don't need that we can change pnfs_put_deviceid to just atomic_dec_and_test
> and then take the lock inside pnfs_unhash_deviceid, and remove from callers.
> Some thing like below. (Untested, only compiled)
OK, that makes sense, BUT
If pnfs_find_get_deviceid can see de_ref==0 it's racing with pnfs_put_deviceid
which is about to free the structure - i.e. pnfs_find_get_deviceid may cause
use after free bugs.
So I'm confused if the current scheme works at all.
Andy, it seems to me this optimization is a bit premature
and we'd be better off reverting to using simple spin locks rather the
rcu locks, certainly until we have a good testing infrastructure
for cb_notifydeviceid. I'm not sure how much we're saving anyhow.
Benny
>
> ---
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index 61310c7..5d4e14d 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1598,16 +1598,21 @@ pnfs_unhash_deviceid(struct pnfs_deviceid_cache *c,
> {
> struct pnfs_deviceid_node *d;
> struct hlist_node *n;
> - long h = nfs4_deviceid_hash(id);
> + long h;
> +
> + spin_lock(&c->dc_lock);
> + h = nfs4_deviceid_hash(id);
>
> dprintk("%s hash %ld\n", __func__, h);
> hlist_for_each_entry_rcu(d, n, &c->dc_deviceids[h], de_node)
> if (!memcmp(&d->de_id, id, sizeof(*id))) {
> hlist_del_rcu(&d->de_node);
> + spin_unlock(&c->dc_lock);
> synchronize_rcu();
> return d;
> }
>
> + spin_unlock(&c->dc_lock);
> return NULL;
> }
>
> @@ -1620,11 +1625,10 @@ pnfs_put_deviceid(struct pnfs_deviceid_cache *c,
> struct pnfs_deviceid_node *devid)
> {
> dprintk("%s [%d]\n", __func__, atomic_read(&devid->de_ref));
> - if (!atomic_dec_and_lock(&devid->de_ref, &c->dc_lock))
> + if (!atomic_dec_and_test(&devid->de_ref))
> return;
>
> pnfs_unhash_deviceid(c, &devid->de_id);
> - spin_unlock(&c->dc_lock);
>
> c->dc_free_callback(devid);
> }
> @@ -1636,9 +1640,7 @@ pnfs_delete_deviceid(struct pnfs_deviceid_cache *c,
> {
> struct pnfs_deviceid_node *devid;
>
> - spin_lock(&c->dc_lock);
> devid = pnfs_unhash_deviceid(c, id);
> - spin_unlock(&c->dc_lock);
>
> dprintk("%s [%d]\n", __func__, atomic_read(&devid->de_ref));
> if (atomic_dec_and_test(&devid->de_ref))
>
>
next prev parent reply other threads:[~2010-11-18 17:18 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-17 17:41 [PATCH] SQUASHME: pnfs: move synchronize_rcu out side of spin_lock Benny Halevy
2010-11-18 12:44 ` Boaz Harrosh
2010-11-18 17:18 ` Benny Halevy [this message]
2010-11-19 15:27 ` Andy Adamson
2010-11-21 9:37 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CE55FFE.5070607@panasas.com \
--to=bhalevy@panasas.com \
--cc=andros@netapp.com \
--cc=bharrosh@panasas.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.