linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/2] sysfs: handle 'parent deleted before child added'
@ 2012-04-06 20:41 Dan Williams
  2012-04-06 20:41 ` [PATCH v2 2/2] sysfs: provide more diagnostic info for kobject_add_internal() failures Dan Williams
  2012-04-06 20:45 ` [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Greg KH
  0 siblings, 2 replies; 7+ messages in thread
From: Dan Williams @ 2012-04-06 20:41 UTC (permalink / raw)
  To: gregkh; +Cc: James Bottomley, linux-kernel, linux-scsi

In scsi at least two cases of the parent device being deleted before the
child is added have been observed.

1/ scsi is performing async scans and the device is removed prior to the
   async can thread running (can happen with an in-opportune / unlikely
   unplug during initial scan).

2/ libsas discovery event running after the parent port has been torn
   down (this is a bug in libsas).

Result in crash signatures like:
 BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
 IP: [<ffffffff8115e100>] sysfs_create_dir+0x32/0xb6
 ...
 Process scsi_scan_8 (pid: 5417, threadinfo ffff88080bd16000, task ffff880801b8a0b0)
 Stack:
  00000000fffffffe ffff880813470628 ffff88080bd17cd0 ffff88080614b7e8
  ffff88080b45c108 00000000fffffffe ffff88080bd17d20 ffffffff8125e4a8
  ffff88080bd17cf0 ffffffff81075149 ffff88080bd17d30 ffff88080614b7e8
 Call Trace:
  [<ffffffff8125e4a8>] kobject_add_internal+0x120/0x1e3
  [<ffffffff81075149>] ? trace_hardirqs_on+0xd/0xf
  [<ffffffff8125e641>] kobject_add_varg+0x41/0x50
  [<ffffffff8125e70b>] kobject_add+0x64/0x66
  [<ffffffff8131122b>] device_add+0x12d/0x63a

In this scenario the parent is still valid (because we have a
reference), but it has been device_del()'d which means its kobj->sd
pointer is NULL'd via:

 device_del()->kobject_del()->sysfs_remove_dir()

...and then sysfs_create_dir() (without this fix) goes ahead and
de-references parent_sd via sysfs_ns_type():

 return (sd->s_flags & SYSFS_NS_TYPE_MASK) >> SYSFS_NS_TYPE_SHIFT;

This scenario is being fixed in scsi/libsas, but if other subsystems
present the same ordering the system need not immediately crash.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

Changes since v1: http://marc.info/?l=linux-scsi&m=133239707303441&w=2

1/ split out the kobject_add_internal() warning message changes into
   its own patch

2/ clarified the changelog a bit due to the discussion from v1.

I think this patch can wait until 3.5, since it is just closing a hole I
found along the way to fixing the real bug.  Now, the system won't crash
at device_add() time, but the system may still be de-stabilized if the
caller does not handle device_add() errors properly.

 fs/sysfs/dir.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 7fdf6a7..86521ee 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -714,6 +714,9 @@ int sysfs_create_dir(struct kobject * kobj)
 	else
 		parent_sd = &sysfs_root;
 
+	if (!parent_sd)
+		return -ENOENT;
+
 	if (sysfs_ns_type(parent_sd))
 		ns = kobj->ktype->namespace(kobj);
 	type = sysfs_read_ns_type(kobj);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] sysfs: provide more diagnostic info for kobject_add_internal() failures
  2012-04-06 20:41 [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Dan Williams
@ 2012-04-06 20:41 ` Dan Williams
  2012-04-06 20:45 ` [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Greg KH
  1 sibling, 0 replies; 7+ messages in thread
From: Dan Williams @ 2012-04-06 20:41 UTC (permalink / raw)
  To: gregkh; +Cc: James Bottomley, linux-kernel, linux-scsi

1/ convert open-coded KERN_ERR+dump_stack() to WARN(), so that automated
   tools pick up this warning.

2/ include the 'child' and 'parent' kobject names.  This information was
   useful for tracking down the case where scsi invoked device_del() on a
   parent object and subsequently invoked device_add() on a child.  Now the
   warning looks like:

     kobject_add_internal failed for target8:0:16 (error: -2 parent: end_device-8:0:24)
     Pid: 2942, comm: scsi_scan_8 Not tainted 3.3.0-rc7-isci+ #2
     Call Trace:
      [<ffffffff8125e551>] kobject_add_internal+0x1c1/0x1f3
      [<ffffffff81075149>] ? trace_hardirqs_on+0xd/0xf
      [<ffffffff8125e659>] kobject_add_varg+0x41/0x50
      [<ffffffff8125e723>] kobject_add+0x64/0x66
      [<ffffffff8131124b>] device_add+0x12d/0x63a
      [<ffffffff8125e0ef>] ? kobject_put+0x4c/0x50
      [<ffffffff8132f370>] scsi_sysfs_add_sdev+0x4e/0x28a
      [<ffffffff8132dce3>] do_scan_async+0x9c/0x145

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 lib/kobject.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/kobject.c b/lib/kobject.c
index c33d7a1..1bd0893 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -192,14 +192,14 @@ static int kobject_add_internal(struct kobject *kobj)
 
 		/* be noisy on error issues */
 		if (error == -EEXIST)
-			printk(KERN_ERR "%s failed for %s with "
-			       "-EEXIST, don't try to register things with "
-			       "the same name in the same directory.\n",
-			       __func__, kobject_name(kobj));
+			WARN(1, "%s failed for %s with "
+			     "-EEXIST, don't try to register things with "
+			     "the same name in the same directory.\n",
+			     __func__, kobject_name(kobj));
 		else
-			printk(KERN_ERR "%s failed for %s (%d)\n",
-			       __func__, kobject_name(kobj), error);
-		dump_stack();
+			WARN(1, "%s failed for %s (error: %d parent: %s)\n",
+			     __func__, kobject_name(kobj), error,
+			     parent ? kobject_name(parent) : "'none'");
 	} else
 		kobj->state_in_sysfs = 1;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] sysfs: handle 'parent deleted before child added'
  2012-04-06 20:41 [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Dan Williams
  2012-04-06 20:41 ` [PATCH v2 2/2] sysfs: provide more diagnostic info for kobject_add_internal() failures Dan Williams
@ 2012-04-06 20:45 ` Greg KH
  2012-04-06 21:06   ` Williams, Dan J
  1 sibling, 1 reply; 7+ messages in thread
From: Greg KH @ 2012-04-06 20:45 UTC (permalink / raw)
  To: Dan Williams; +Cc: James Bottomley, linux-kernel, linux-scsi

On Fri, Apr 06, 2012 at 01:41:06PM -0700, Dan Williams wrote:
> In scsi at least two cases of the parent device being deleted before the
> child is added have been observed.
> 
> 1/ scsi is performing async scans and the device is removed prior to the
>    async can thread running (can happen with an in-opportune / unlikely
>    unplug during initial scan).

That sounds like a bug in the scsi code, doesn't it?

> 2/ libsas discovery event running after the parent port has been torn
>    down (this is a bug in libsas).

Is this fixed somewhere?

I don't want to paper over bugs like this by changing the sysfs core.
We went through this a lot years ago when scsi changed to use the driver
core, and I thought we had fixed all of these types of errors properly.

So, any chance to fix these properly as well?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] sysfs: handle 'parent deleted before child added'
  2012-04-06 20:45 ` [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Greg KH
@ 2012-04-06 21:06   ` Williams, Dan J
  2012-04-06 21:17     ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Williams, Dan J @ 2012-04-06 21:06 UTC (permalink / raw)
  To: Greg KH; +Cc: James Bottomley, linux-kernel, linux-scsi

On Fri, Apr 6, 2012 at 1:45 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Fri, Apr 06, 2012 at 01:41:06PM -0700, Dan Williams wrote:
>> In scsi at least two cases of the parent device being deleted before the
>> child is added have been observed.
>>
>> 1/ scsi is performing async scans and the device is removed prior to the
>>    async can thread running (can happen with an in-opportune / unlikely
>>    unplug during initial scan).
>
> That sounds like a bug in the scsi code, doesn't it?
>
>> 2/ libsas discovery event running after the parent port has been torn
>>    down (this is a bug in libsas).
>
> Is this fixed somewhere?

Yes, these two issues have pending fixes that are posted to linux-scsi:

http://marc.info/?l=linux-scsi&m=133239707903443&w=2
http://marc.info/?l=linux-scsi&m=133239709603452&w=2

> I don't want to paper over bugs like this by changing the sysfs core.
> We went through this a lot years ago when scsi changed to use the driver
> core, and I thought we had fixed all of these types of errors properly.

Hotplug lifetime rules are still transport specific.  So in this case
scsi-core is innocent these are bugs from libsas and
scsi_transport_sas.

> So, any chance to fix these properly as well?

This patch doesn't really paper over anything.  It turns a NULL
pointer crash into an explicit warning from kobject_add_internal.  For
the libsas/scsi case this device_add() failure is still fatal.
Regardless of whether sysfs changes the above two fixes are still
required.

Since the -EEXIST case is just a KERN_ERR and not a BUG_ON I figured
it was worthwhile to post a patch to do the same for this 'parent
deleted' case.  But if crashing is the expectation then this patch can
be dropped.

--
Dan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] sysfs: handle 'parent deleted before child added'
  2012-04-06 21:06   ` Williams, Dan J
@ 2012-04-06 21:17     ` Greg KH
  2012-04-06 21:44       ` Williams, Dan J
  0 siblings, 1 reply; 7+ messages in thread
From: Greg KH @ 2012-04-06 21:17 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: James Bottomley, linux-kernel, linux-scsi

On Fri, Apr 06, 2012 at 02:06:50PM -0700, Williams, Dan J wrote:
> On Fri, Apr 6, 2012 at 1:45 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Fri, Apr 06, 2012 at 01:41:06PM -0700, Dan Williams wrote:
> >> In scsi at least two cases of the parent device being deleted before the
> >> child is added have been observed.
> >>
> >> 1/ scsi is performing async scans and the device is removed prior to the
> >>    async can thread running (can happen with an in-opportune / unlikely
> >>    unplug during initial scan).
> >
> > That sounds like a bug in the scsi code, doesn't it?
> >
> >> 2/ libsas discovery event running after the parent port has been torn
> >>    down (this is a bug in libsas).
> >
> > Is this fixed somewhere?
> 
> Yes, these two issues have pending fixes that are posted to linux-scsi:
> 
> http://marc.info/?l=linux-scsi&m=133239707903443&w=2
> http://marc.info/?l=linux-scsi&m=133239709603452&w=2
> 
> > I don't want to paper over bugs like this by changing the sysfs core.
> > We went through this a lot years ago when scsi changed to use the driver
> > core, and I thought we had fixed all of these types of errors properly.
> 
> Hotplug lifetime rules are still transport specific.  So in this case
> scsi-core is innocent these are bugs from libsas and
> scsi_transport_sas.

Ok, thanks for the explaination.

> > So, any chance to fix these properly as well?
> 
> This patch doesn't really paper over anything.  It turns a NULL
> pointer crash into an explicit warning from kobject_add_internal.  For
> the libsas/scsi case this device_add() failure is still fatal.
> Regardless of whether sysfs changes the above two fixes are still
> required.
> 
> Since the -EEXIST case is just a KERN_ERR and not a BUG_ON I figured
> it was worthwhile to post a patch to do the same for this 'parent
> deleted' case.  But if crashing is the expectation then this patch can
> be dropped.

No, crashing is not the expectation :)

But, without that crash, would the above fixes ever have been noticed
and fixed?  The device_add() most likely would have quietly failed and
no one would have been the wiser.

Or would something else have caused this to be an obvious problem?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] sysfs: handle 'parent deleted before child added'
  2012-04-06 21:17     ` Greg KH
@ 2012-04-06 21:44       ` Williams, Dan J
  2012-04-06 22:11         ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Williams, Dan J @ 2012-04-06 21:44 UTC (permalink / raw)
  To: Greg KH; +Cc: James Bottomley, linux-kernel, linux-scsi

On Fri, Apr 6, 2012 at 2:17 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Fri, Apr 06, 2012 at 02:06:50PM -0700, Williams, Dan J wrote:
>> On Fri, Apr 6, 2012 at 1:45 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
>> > On Fri, Apr 06, 2012 at 01:41:06PM -0700, Dan Williams wrote:
>> >> In scsi at least two cases of the parent device being deleted before the
>> >> child is added have been observed.
>> >>
>> >> 1/ scsi is performing async scans and the device is removed prior to the
>> >>    async can thread running (can happen with an in-opportune / unlikely
>> >>    unplug during initial scan).
>> >
>> > That sounds like a bug in the scsi code, doesn't it?
>> >
>> >> 2/ libsas discovery event running after the parent port has been torn
>> >>    down (this is a bug in libsas).
>> >
>> > Is this fixed somewhere?
>>
>> Yes, these two issues have pending fixes that are posted to linux-scsi:
>>
>> http://marc.info/?l=linux-scsi&m=133239707903443&w=2
>> http://marc.info/?l=linux-scsi&m=133239709603452&w=2
>>
>> > I don't want to paper over bugs like this by changing the sysfs core.
>> > We went through this a lot years ago when scsi changed to use the driver
>> > core, and I thought we had fixed all of these types of errors properly.
>>
>> Hotplug lifetime rules are still transport specific.  So in this case
>> scsi-core is innocent these are bugs from libsas and
>> scsi_transport_sas.
>
> Ok, thanks for the explaination.
>
>> > So, any chance to fix these properly as well?
>>
>> This patch doesn't really paper over anything.  It turns a NULL
>> pointer crash into an explicit warning from kobject_add_internal.  For
>> the libsas/scsi case this device_add() failure is still fatal.
>> Regardless of whether sysfs changes the above two fixes are still
>> required.
>>
>> Since the -EEXIST case is just a KERN_ERR and not a BUG_ON I figured
>> it was worthwhile to post a patch to do the same for this 'parent
>> deleted' case.  But if crashing is the expectation then this patch can
>> be dropped.
>
> No, crashing is not the expectation :)

I thought not, but sometimes the kernel likes to teach people that
bollix an api a hard lesson :).

>
> But, without that crash, would the above fixes ever have been noticed
> and fixed?  The device_add() most likely would have quietly failed and
> no one would have been the wiser.
>
> Or would something else have caused this to be an obvious problem?
>

We still have the big red flag dump_stack() in kobject_add_internal()
(which patch 2 turns into a real WARN()), and for scsi our hotplug
tests still crash later on because libsas makes assumptions about the
device path.  I understand the paranoia here,  "check for NULL" is
usually a band-aid, but in this case this is just a softer
introduction to a debug session.  No less vocal than before as far as
I can see.

--
Dan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] sysfs: handle 'parent deleted before child added'
  2012-04-06 21:44       ` Williams, Dan J
@ 2012-04-06 22:11         ` Greg KH
  0 siblings, 0 replies; 7+ messages in thread
From: Greg KH @ 2012-04-06 22:11 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: James Bottomley, linux-kernel, linux-scsi

On Fri, Apr 06, 2012 at 02:44:37PM -0700, Williams, Dan J wrote:
> On Fri, Apr 6, 2012 at 2:17 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> > On Fri, Apr 06, 2012 at 02:06:50PM -0700, Williams, Dan J wrote:
> >> On Fri, Apr 6, 2012 at 1:45 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> >> > On Fri, Apr 06, 2012 at 01:41:06PM -0700, Dan Williams wrote:
> >> >> In scsi at least two cases of the parent device being deleted before the
> >> >> child is added have been observed.
> >> >>
> >> >> 1/ scsi is performing async scans and the device is removed prior to the
> >> >>    async can thread running (can happen with an in-opportune / unlikely
> >> >>    unplug during initial scan).
> >> >
> >> > That sounds like a bug in the scsi code, doesn't it?
> >> >
> >> >> 2/ libsas discovery event running after the parent port has been torn
> >> >>    down (this is a bug in libsas).
> >> >
> >> > Is this fixed somewhere?
> >>
> >> Yes, these two issues have pending fixes that are posted to linux-scsi:
> >>
> >> http://marc.info/?l=linux-scsi&m=133239707903443&w=2
> >> http://marc.info/?l=linux-scsi&m=133239709603452&w=2
> >>
> >> > I don't want to paper over bugs like this by changing the sysfs core.
> >> > We went through this a lot years ago when scsi changed to use the driver
> >> > core, and I thought we had fixed all of these types of errors properly.
> >>
> >> Hotplug lifetime rules are still transport specific.  So in this case
> >> scsi-core is innocent these are bugs from libsas and
> >> scsi_transport_sas.
> >
> > Ok, thanks for the explaination.
> >
> >> > So, any chance to fix these properly as well?
> >>
> >> This patch doesn't really paper over anything.  It turns a NULL
> >> pointer crash into an explicit warning from kobject_add_internal.  For
> >> the libsas/scsi case this device_add() failure is still fatal.
> >> Regardless of whether sysfs changes the above two fixes are still
> >> required.
> >>
> >> Since the -EEXIST case is just a KERN_ERR and not a BUG_ON I figured
> >> it was worthwhile to post a patch to do the same for this 'parent
> >> deleted' case.  But if crashing is the expectation then this patch can
> >> be dropped.
> >
> > No, crashing is not the expectation :)
> 
> I thought not, but sometimes the kernel likes to teach people that
> bollix an api a hard lesson :).
> 
> >
> > But, without that crash, would the above fixes ever have been noticed
> > and fixed?  The device_add() most likely would have quietly failed and
> > no one would have been the wiser.
> >
> > Or would something else have caused this to be an obvious problem?
> >
> 
> We still have the big red flag dump_stack() in kobject_add_internal()
> (which patch 2 turns into a real WARN()), and for scsi our hotplug
> tests still crash later on because libsas makes assumptions about the
> device path.  I understand the paranoia here,  "check for NULL" is
> usually a band-aid, but in this case this is just a softer
> introduction to a debug session.  No less vocal than before as far as
> I can see.

Ok, that's reasonable, I'll queue this up for 3.5.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-04-06 22:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-06 20:41 [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Dan Williams
2012-04-06 20:41 ` [PATCH v2 2/2] sysfs: provide more diagnostic info for kobject_add_internal() failures Dan Williams
2012-04-06 20:45 ` [PATCH v2 1/2] sysfs: handle 'parent deleted before child added' Greg KH
2012-04-06 21:06   ` Williams, Dan J
2012-04-06 21:17     ` Greg KH
2012-04-06 21:44       ` Williams, Dan J
2012-04-06 22:11         ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).