race between nvme device creation and discovery?

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* race between nvme device creation and discovery?
@ 2024-02-02 15:16 Daniel Wagner
  2024-02-05  5:02 ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Wagner @ 2024-02-02 15:16 UTC (permalink / raw)
  To: linux-nvme@lists.infradead.org
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg

I am trying to figure out why some of the blktests fail randomly when
running with FC as transport. This failure only appear when the
autoconnect is running in the background. A clear indication we still
have some sort of interference with it.

nvme/030 fails a bit more often then the rest, and it might just because
it issues several 'nvme discover' commands, many other tests only a one.

When a test fails, it fails with

  failed to lookup subsystem for controller nvme0

which is from libnvme when it iterates over sysfs to gather infos.

        subsysname = nvme_ctrl_lookup_subsystem_name(r, name);
        if (!subsysname) {
                nvme_msg(r, LOG_ERR,
                         "failed to lookup subsystem for controller %s\n",
                         name);
                errno = ENXIO;
                return NULL;
        }

My current theory is when a new controller isa dded is not atomic from
the POV userland and thus libnvme is able to observe a situation when
there is controller but the matching subsystem is not yet visible.

So something like:

  nvme_init_ctrl
    cdev_device_add

  // libnvme iterates over sysfs

  nvme_init_ctrl_finish
    nvme_init_identify
      nvme_init_subsystem
         device_add          // nvme-subsys%d
         sysfs_create_link   // subsys->dev -> ctrl-device

Does this any sense? And if so what could be done? Should we add some
retry logic to libnvme?

Daniel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: race between nvme device creation and discovery?
  2024-02-02 15:16 race between nvme device creation and discovery? Daniel Wagner
@ 2024-02-05  5:02 ` Hannes Reinecke
  2024-02-05  7:47   ` Daniel Wagner
  0 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2024-02-05  5:02 UTC (permalink / raw)
  To: Daniel Wagner, linux-nvme@lists.infradead.org
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg

On 2/2/24 23:16, Daniel Wagner wrote:
> I am trying to figure out why some of the blktests fail randomly when
> running with FC as transport. This failure only appear when the
> autoconnect is running in the background. A clear indication we still
> have some sort of interference with it.
> 
> nvme/030 fails a bit more often then the rest, and it might just because
> it issues several 'nvme discover' commands, many other tests only a one.
> 
> When a test fails, it fails with
> 
>    failed to lookup subsystem for controller nvme0
> 
> which is from libnvme when it iterates over sysfs to gather infos.
> 
>          subsysname = nvme_ctrl_lookup_subsystem_name(r, name);
>          if (!subsysname) {
>                  nvme_msg(r, LOG_ERR,
>                           "failed to lookup subsystem for controller %s\n",
>                           name);
>                  errno = ENXIO;
>                  return NULL;
>          }
> 
> My current theory is when a new controller isa dded is not atomic from
> the POV userland and thus libnvme is able to observe a situation when
> there is controller but the matching subsystem is not yet visible.
> 
> So something like:
> 
>    nvme_init_ctrl
>      cdev_device_add
> 
>    // libnvme iterates over sysfs
> 
>    nvme_init_ctrl_finish
>      nvme_init_identify
>        nvme_init_subsystem
>           device_add          // nvme-subsys%d
>           sysfs_create_link   // subsys->dev -> ctrl-device
> 
> Does this any sense? And if so what could be done? Should we add some
> retry logic to libnvme?
> 

Hehe. Good old sysfs.
This is a common issue with sysfs, and we've even had a retry loop in 
udev back in them days to avoid these kind of things.

Point is, uevent will be sent out with device_add(), causing udev to 
run, running udev rules, and eventually call into libnvme to scan the 
device. But as you rightly pointed out, the sysfs link is only created
_after_ the event has been sent, so there's a race window during which
libnvme will fails to read the link, landing us with the scenario above.

While we could add a retry logic to libnvme, I'm not really convinced
this is a good idea; in the end, who's to tell how long we should wait?
A second? Several seconds? A minute? Several minutes?
Also not that sysfs_create_link() has a return code, so the link might
not be created at all ...

A possibly better way here would be to suppress uevents on device_add(),
and only send out events once the device is fully set up, ie just before
the 'return 0'.

Let me see if I can whip up a patch ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: race between nvme device creation and discovery?
  2024-02-05  5:02 ` Hannes Reinecke
@ 2024-02-05  7:47   ` Daniel Wagner
  2024-02-05  7:57     ` Hannes Reinecke
  2024-02-05  9:20     ` Maurizio Lombardi
  0 siblings, 2 replies; 11+ messages in thread
From: Daniel Wagner @ 2024-02-05  7:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
	Sagi Grimberg

On Mon, Feb 05, 2024 at 02:02:04PM +0900, Hannes Reinecke wrote:
> Hehe. Good old sysfs.
> This is a common issue with sysfs, and we've even had a retry loop in udev
> back in them days to avoid these kind of things.
> 
> Point is, uevent will be sent out with device_add(), causing udev to run,
> running udev rules, and eventually call into libnvme to scan the device. But
> as you rightly pointed out, the sysfs link is only created
> _after_ the event has been sent, so there's a race window during which
> libnvme will fails to read the link, landing us with the scenario above.
> 
> While we could add a retry logic to libnvme, I'm not really convinced
> this is a good idea; in the end, who's to tell how long we should wait?
> A second? Several seconds? A minute? Several minutes?
> Also not that sysfs_create_link() has a return code, so the link might
> not be created at all ...

Yep, retry logics have always a smell too it. What about the idea to
add some sort of ready attribute to the controller sysfs entry?

> A possibly better way here would be to suppress uevents on device_add(),
> and only send out events once the device is fully set up, ie just before
> the 'return 0'.

I don't think this will address the problem. The blktests runs completely
independent to what udev does and it's blktests which observes the
missing link not udev.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: race between nvme device creation and discovery?
  2024-02-05  7:47   ` Daniel Wagner
@ 2024-02-05  7:57     ` Hannes Reinecke
  2024-02-05  8:46       ` Daniel Wagner
  2024-02-05  9:20     ` Maurizio Lombardi
  1 sibling, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2024-02-05  7:57 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
	Sagi Grimberg

On 2/5/24 15:47, Daniel Wagner wrote:
> On Mon, Feb 05, 2024 at 02:02:04PM +0900, Hannes Reinecke wrote:
>> Hehe. Good old sysfs.
>> This is a common issue with sysfs, and we've even had a retry loop in udev
>> back in them days to avoid these kind of things.
>>
>> Point is, uevent will be sent out with device_add(), causing udev to run,
>> running udev rules, and eventually call into libnvme to scan the device. But
>> as you rightly pointed out, the sysfs link is only created
>> _after_ the event has been sent, so there's a race window during which
>> libnvme will fails to read the link, landing us with the scenario above.
>>
>> While we could add a retry logic to libnvme, I'm not really convinced
>> this is a good idea; in the end, who's to tell how long we should wait?
>> A second? Several seconds? A minute? Several minutes?
>> Also not that sysfs_create_link() has a return code, so the link might
>> not be created at all ...
> 
> Yep, retry logics have always a smell too it. What about the idea to
> add some sort of ready attribute to the controller sysfs entry?
> 
>> A possibly better way here would be to suppress uevents on device_add(),
>> and only send out events once the device is fully set up, ie just before
>> the 'return 0'.
> 
> I don't think this will address the problem. The blktests runs completely
> independent to what udev does and it's blktests which observes the
> missing link not udev.

I thought something along these lines:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 85ab0fcf9e88..3ed706f595fe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2845,6 +2845,7 @@ static int nvme_init_subsystem(struct nvme_ctrl 
*ctrl, struct nvme_id_ctrl *id)
                         goto out_put_subsystem;
                 }
         } else {
+               dev_set_uevent_suppress(&subsys->dev, 1);
                 ret = device_add(&subsys->dev);
                 if (ret) {
                         dev_err(ctrl->device,
@@ -2869,6 +2870,10 @@ static int nvme_init_subsystem(struct nvme_ctrl 
*ctrl, struct nvme_id_ctrl *id)
         ctrl->subsys = subsys;
         list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
         mutex_unlock(&nvme_subsystems_lock);
+       if (!found) {
+               dev_set_uevent_suppress(&subsys->dev, 0);
+               kobject_uevent(&subsys->dev.kobj, KOBJ_ADD);
+       }
         return 0;


Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Re: race between nvme device creation and discovery?
  2024-02-05  7:57     ` Hannes Reinecke
@ 2024-02-05  8:46       ` Daniel Wagner
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Wagner @ 2024-02-05  8:46 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
	Sagi Grimberg

On Mon, Feb 05, 2024 at 04:57:24PM +0900, Hannes Reinecke wrote:
> I thought something along these lines:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 85ab0fcf9e88..3ed706f595fe 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2845,6 +2845,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl,
> struct nvme_id_ctrl *id)
>                         goto out_put_subsystem;
>                 }
>         } else {
> +               dev_set_uevent_suppress(&subsys->dev, 1);
>                 ret = device_add(&subsys->dev);
>                 if (ret) {
>                         dev_err(ctrl->device,
> @@ -2869,6 +2870,10 @@ static int nvme_init_subsystem(struct nvme_ctrl
> *ctrl, struct nvme_id_ctrl *id)
>         ctrl->subsys = subsys;
>         list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
>         mutex_unlock(&nvme_subsystems_lock);
> +       if (!found) {
> +               dev_set_uevent_suppress(&subsys->dev, 0);
> +               kobject_uevent(&subsys->dev.kobj, KOBJ_ADD);
> +       }
>         return 0;

Unfortunatly, this fixes a different race. Not the one I reported here:

nvme/030 (ensure the discovery generation counter is updated appropriately) [failed]
    runtime  0.602s  ...  0.542s
    --- tests/nvme/030.out      2023-08-30 10:39:08.428409596 +0200
    +++ /home/wagi/work/blktests/results/nodev/nvme/030.out.bad 2024-02-05 09:44:28.768208105 +0100
    @@ -1,2 +1,3 @@
     Running nvme/030
    +failed to lookup subsystem for controller nvme0
     Test complete


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: race between nvme device creation and discovery?
  2024-02-05  7:47   ` Daniel Wagner
  2024-02-05  7:57     ` Hannes Reinecke
@ 2024-02-05  9:20     ` Maurizio Lombardi
  2024-02-05  9:20       ` Maurizio Lombardi
  1 sibling, 1 reply; 11+ messages in thread
From: Maurizio Lombardi @ 2024-02-05  9:20 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Christoph Hellwig, Sagi Grimberg

po 5. 2. 2024 v 8:47 odesílatel Daniel Wagner <dwagner@suse.de> napsal:
>
> > While we could add a retry logic to libnvme, I'm not really convinced
> > this is a good idea; in the end, who's to tell how long we should wait?
> > A second? Several seconds? A minute? Several minutes?
> > Also not that sysfs_create_link() has a return code, so the link might
> > not be created at all ...
>
> Yep, retry logics have always a smell too it. What about the idea to
> add some sort of ready attribute to the controller sysfs entry?

I think I had a similar problem with nvme-stas some time ago, it tried
to delete a controller right after it appeared in sysfs, causing race conditions
with the init code path.

Maybe you can take advantage of the NVME_CTRL_STARTED_ONCE, exposing
this information via sysfs with a "ready" or "status" attribute.

Maurizio



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: race between nvme device creation and discovery?
  2024-02-05  9:20     ` Maurizio Lombardi
@ 2024-02-05  9:20       ` Maurizio Lombardi
  2024-02-05 10:07         ` Daniel Wagner
  0 siblings, 1 reply; 11+ messages in thread
From: Maurizio Lombardi @ 2024-02-05  9:20 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Christoph Hellwig, Sagi Grimberg

po 5. 2. 2024 v 10:20 odesílatel Maurizio Lombardi <mlombard@redhat.com> napsal:
>
> I think I had a similar problem with nvme-stas some time ago, it tried
> to delete a controller right after it appeared in sysfs, causing race conditions
> with the init code path.

This is the problem I was talking about
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb94dd56a4a4e3fe286def3e2ba207804a37345



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: race between nvme device creation and discovery?
  2024-02-05  9:20       ` Maurizio Lombardi
@ 2024-02-05 10:07         ` Daniel Wagner
  2024-02-05 10:17           ` Daniel Wagner
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Wagner @ 2024-02-05 10:07 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Christoph Hellwig, Sagi Grimberg

On Mon, Feb 05, 2024 at 10:20:54AM +0100, Maurizio Lombardi wrote:
> po 5. 2. 2024 v 10:20 odesílatel Maurizio Lombardi <mlombard@redhat.com> napsal:
> >
> > I think I had a similar problem with nvme-stas some time ago, it tried
> > to delete a controller right after it appeared in sysfs, causing race conditions
> > with the init code path.
> 
> This is the problem I was talking about
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb94dd56a4a4e3fe286def3e2ba207804a37345

We have a 'state' attribute in sysfs already, though the
NVME_CTRL_STARTED_ONCE is modelled as flag. Not sure how this fits
together yet, so let me play with this.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: Re: race between nvme device creation and discovery?
  2024-02-05 10:07         ` Daniel Wagner
@ 2024-02-05 10:17           ` Daniel Wagner
  2024-02-05 17:13             ` Daniel Wagner
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Wagner @ 2024-02-05 10:17 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Christoph Hellwig, Sagi Grimberg

On Mon, Feb 05, 2024 at 11:07:48AM +0100, Daniel Wagner wrote:
> On Mon, Feb 05, 2024 at 10:20:54AM +0100, Maurizio Lombardi wrote:
> > po 5. 2. 2024 v 10:20 odesílatel Maurizio Lombardi <mlombard@redhat.com> napsal:
> > >
> > > I think I had a similar problem with nvme-stas some time ago, it tried
> > > to delete a controller right after it appeared in sysfs, causing race conditions
> > > with the init code path.
> > 
> > This is the problem I was talking about
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb94dd56a4a4e3fe286def3e2ba207804a37345
> 
> We have a 'state' attribute in sysfs already, though the
> NVME_CTRL_STARTED_ONCE is modelled as flag. Not sure how this fits
> together yet, so let me play with this.

If I got this right, we could just filter out all controllers which are
in the NEW state, because all transports trigger the connect code right
after the NEW state. At this point the sysfs should be completely
populated.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: Re: Re: race between nvme device creation and discovery?
  2024-02-05 10:17           ` Daniel Wagner
@ 2024-02-05 17:13             ` Daniel Wagner
  2024-02-06 12:45               ` Daniel Wagner
  0 siblings, 1 reply; 11+ messages in thread
From: Daniel Wagner @ 2024-02-05 17:13 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Christoph Hellwig, Sagi Grimberg

> > We have a 'state' attribute in sysfs already, though the
> > NVME_CTRL_STARTED_ONCE is modelled as flag. Not sure how this fits
> > together yet, so let me play with this.
> 
> If I got this right, we could just filter out all controllers which are
> in the NEW state, because all transports trigger the connect code right
> after the NEW state. At this point the sysfs should be completely
> populated.

I've patched libnvme to print the state of the controller right
before the lookup for the subsystem happens and it says:

    +state connecting
    +failed to lookup subsystem for controller nvme1

more digging needed...


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: race between nvme device creation and discovery?
  2024-02-05 17:13             ` Daniel Wagner
@ 2024-02-06 12:45               ` Daniel Wagner
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Wagner @ 2024-02-06 12:45 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Hannes Reinecke, linux-nvme@lists.infradead.org, Keith Busch,
	Christoph Hellwig, Sagi Grimberg

On Mon, Feb 05, 2024 at 06:13:13PM +0100, Daniel Wagner wrote:
> > > We have a 'state' attribute in sysfs already, though the
> > > NVME_CTRL_STARTED_ONCE is modelled as flag. Not sure how this fits
> > > together yet, so let me play with this.
> > 
> > If I got this right, we could just filter out all controllers which are
> > in the NEW state, because all transports trigger the connect code right
> > after the NEW state. At this point the sysfs should be completely
> > populated.
> 
> I've patched libnvme to print the state of the controller right
> before the lookup for the subsystem happens and it says:
> 
>     +state connecting
>     +failed to lookup subsystem for controller nvme1
> 
> more digging needed...

So after a lot more debugging and testing I think the real solution is
just not to issue the error message in libnvme. It this case we just do
add the device to the tree. And looking the code we have already
other error returns in this function which issue nothing:


nvme_scan_ctrl()
{
	[...]

	subsysnqn = nvme_get_attr(path, "subsysnqn");
	if (!subsysnqn) {
		errno = ENXIO;
		return NULL;
	}
	subsysname = nvme_ctrl_lookup_subsystem_name(r, name);
	if (!subsysname) {
		nvme_msg(r, LOG_ERR,
			 "failed to lookup subsystem for controller %s\n",
			 name);
		errno = ENXIO;
		return NULL;
	}
	s = nvme_lookup_subsystem(h, subsysname, subsysnqn);

	if (!s) {
		errno = ENOMEM;
		return NULL;
	}

	[...]
}

Changing it from LOG_ERR to LOG_DEBUG makes the tests happy. No problems
observed.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-06 12:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-02 15:16 race between nvme device creation and discovery? Daniel Wagner
2024-02-05  5:02 ` Hannes Reinecke
2024-02-05  7:47   ` Daniel Wagner
2024-02-05  7:57     ` Hannes Reinecke
2024-02-05  8:46       ` Daniel Wagner
2024-02-05  9:20     ` Maurizio Lombardi
2024-02-05  9:20       ` Maurizio Lombardi
2024-02-05 10:07         ` Daniel Wagner
2024-02-05 10:17           ` Daniel Wagner
2024-02-05 17:13             ` Daniel Wagner
2024-02-06 12:45               ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox