Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
       [not found]                 ` <ZUG0gcRhUlFm57qN@mail-itl>
@ 2023-11-01  3:24                   ` Ming Lei
  2023-11-01 10:15                     ` Hannes Reinecke
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2023-11-01  3:24 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, linux-block,
	linux-nvme, ming.lei

On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > <marmarek@invisiblethingslab.com> wrote:
> > >
> > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > Then tried:
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > >  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > >
> > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > and I can't reproduce the issue there. I'm confused...
> > > >
> > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > the block layer or in the storage driver has problems with handling bios
> > > > with sufficiently high order pages attached. This is going to be a bit
> > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > that we don't overwhelm the system...
> > >
> > > Sometimes it freezes just after logging in, but in worst case it takes
> > > me about 10min of more or less `tar xz` + `dd`.
> > 
> > blk-mq debugfs is usually helpful for hang issue in block layer or
> > underlying drivers:
> > 
> > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > 
> > BTW,  you can just collect logs of the exact disks if you know what
> > are behind dm-crypt,
> > which can be figured out by `lsblk`, and it has to be collected after
> > the hang is triggered.
> 
> dm-crypt lives on the nvme disk, this is what I collected when it
> hanged:
> 
...
> nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}

Two requests stays in sw queue, but not related with this issue.

> nvme0n1/hctx4/type:default
> nvme0n1/hctx4/dispatch_busy:9

non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
nvme_queue_rq() recently and mostly.

> nvme0n1/hctx4/active:0
> nvme0n1/hctx4/run:20290468

...

> nvme0n1/hctx4/tags:nr_tags=1023
> nvme0n1/hctx4/tags:nr_reserved_tags=0
> nvme0n1/hctx4/tags:active_queues=0
> nvme0n1/hctx4/tags:bitmap_tags:
> nvme0n1/hctx4/tags:depth=1023
> nvme0n1/hctx4/tags:busy=3

Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.

...

> nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> nvme0n1/hctx4/state:SCHED_RESTART

The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
returns -BLK_STS_RESOURCE constantly, and you can verify with
the following bpftrace when the hang is triggered:

	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'

It is very likely that memory allocation inside nvme_queue_rq()
can't be done successfully, then blk-mq just have to retry by calling
nvme_queue_rq() on the above request.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01  3:24                   ` Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Ming Lei
@ 2023-11-01 10:15                     ` Hannes Reinecke
  2023-11-01 10:26                       ` Jan Kara
                                         ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Hannes Reinecke @ 2023-11-01 10:15 UTC (permalink / raw)
  To: Ming Lei, Marek Marczykowski-Górecki
  Cc: Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, linux-block,
	linux-nvme, ming.lei

On 11/1/23 04:24, Ming Lei wrote:
> On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
>> On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
>>> On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
>>> <marmarek@invisiblethingslab.com> wrote:
>>>>
>>>> On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
>>>>> On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
>>>>>> Then tried:
>>>>>>   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
>>>>>>   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
>>>>>>   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
>>>>>>
>>>>>> I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
>>>>>> and I can't reproduce the issue there. I'm confused...
>>>>>
>>>>> And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
>>>>> causing hangs is most likely just a coincidence. Rather something either in
>>>>> the block layer or in the storage driver has problems with handling bios
>>>>> with sufficiently high order pages attached. This is going to be a bit
>>>>> painful to debug I'm afraid. How long does it take for you trigger the
>>>>> hang? I'm asking to get rough estimate how heavy tracing we can afford so
>>>>> that we don't overwhelm the system...
>>>>
>>>> Sometimes it freezes just after logging in, but in worst case it takes
>>>> me about 10min of more or less `tar xz` + `dd`.
>>>
>>> blk-mq debugfs is usually helpful for hang issue in block layer or
>>> underlying drivers:
>>>
>>> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
>>>
>>> BTW,  you can just collect logs of the exact disks if you know what
>>> are behind dm-crypt,
>>> which can be figured out by `lsblk`, and it has to be collected after
>>> the hang is triggered.
>>
>> dm-crypt lives on the nvme disk, this is what I collected when it
>> hanged:
>>
> ...
>> nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
>> nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> 
> Two requests stays in sw queue, but not related with this issue.
> 
>> nvme0n1/hctx4/type:default
>> nvme0n1/hctx4/dispatch_busy:9
> 
> non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> nvme_queue_rq() recently and mostly.
> 
>> nvme0n1/hctx4/active:0
>> nvme0n1/hctx4/run:20290468
> 
> ...
> 
>> nvme0n1/hctx4/tags:nr_tags=1023
>> nvme0n1/hctx4/tags:nr_reserved_tags=0
>> nvme0n1/hctx4/tags:active_queues=0
>> nvme0n1/hctx4/tags:bitmap_tags:
>> nvme0n1/hctx4/tags:depth=1023
>> nvme0n1/hctx4/tags:busy=3
> 
> Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> 
> ...
> 
>> nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
>> nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
>> nvme0n1/hctx4/state:SCHED_RESTART
> 
> The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> returns -BLK_STS_RESOURCE constantly, and you can verify with
> the following bpftrace when the hang is triggered:
> 
> 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> 
> It is very likely that memory allocation inside nvme_queue_rq()
> can't be done successfully, then blk-mq just have to retry by calling
> nvme_queue_rq() on the above request.
> 
And that is something I've been wondering (for quite some time now):
What _is_ the appropriate error handling for -ENOMEM?
At this time, we assume it to be a retryable error and re-run the queue
in the hope that things will sort itself out.
But if they don't we're stuck.
Can we somehow figure out if we make progress during submission, and (at 
least) issue a warning once we detect a stall?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 10:15                     ` Hannes Reinecke
@ 2023-11-01 10:26                       ` Jan Kara
  2023-11-01 11:23                       ` Ming Lei
  2023-11-01 12:16                       ` Mikulas Patocka
  2 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2023-11-01 10:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Marek Marczykowski-Górecki, Jan Kara,
	Mikulas Patocka, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, linux-block, linux-nvme, ming.lei

On Wed 01-11-23 11:15:02, Hannes Reinecke wrote:
> On 11/1/23 04:24, Ming Lei wrote:
> > On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > > > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > > > <marmarek@invisiblethingslab.com> wrote:
> > > > > 
> > > > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > > > Then tried:
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > > > > 
> > > > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > > > and I can't reproduce the issue there. I'm confused...
> > > > > > 
> > > > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > > > the block layer or in the storage driver has problems with handling bios
> > > > > > with sufficiently high order pages attached. This is going to be a bit
> > > > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > > > that we don't overwhelm the system...
> > > > > 
> > > > > Sometimes it freezes just after logging in, but in worst case it takes
> > > > > me about 10min of more or less `tar xz` + `dd`.
> > > > 
> > > > blk-mq debugfs is usually helpful for hang issue in block layer or
> > > > underlying drivers:
> > > > 
> > > > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > BTW,  you can just collect logs of the exact disks if you know what
> > > > are behind dm-crypt,
> > > > which can be figured out by `lsblk`, and it has to be collected after
> > > > the hang is triggered.
> > > 
> > > dm-crypt lives on the nvme disk, this is what I collected when it
> > > hanged:
> > > 
> > ...
> > > nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> > > nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> > 
> > Two requests stays in sw queue, but not related with this issue.
> > 
> > > nvme0n1/hctx4/type:default
> > > nvme0n1/hctx4/dispatch_busy:9
> > 
> > non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> > nvme_queue_rq() recently and mostly.
> > 
> > > nvme0n1/hctx4/active:0
> > > nvme0n1/hctx4/run:20290468
> > 
> > ...
> > 
> > > nvme0n1/hctx4/tags:nr_tags=1023
> > > nvme0n1/hctx4/tags:nr_reserved_tags=0
> > > nvme0n1/hctx4/tags:active_queues=0
> > > nvme0n1/hctx4/tags:bitmap_tags:
> > > nvme0n1/hctx4/tags:depth=1023
> > > nvme0n1/hctx4/tags:busy=3
> > 
> > Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> > 
> > ...
> > 
> > > nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> > > nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> > > nvme0n1/hctx4/state:SCHED_RESTART
> > 
> > The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> > returns -BLK_STS_RESOURCE constantly, and you can verify with
> > the following bpftrace when the hang is triggered:
> > 
> > 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> > 
> > It is very likely that memory allocation inside nvme_queue_rq()
> > can't be done successfully, then blk-mq just have to retry by calling
> > nvme_queue_rq() on the above request.
> > 
> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?
> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.
> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

Well, but Marek has show [1] the machine is pretty far from being OOM when it
is stuck. So it doesn't seem like a simple OOM situation...

								Honza

[1] https://lore.kernel.org/all/ZTiJ3CO8w0jauOzW@mail-itl/

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 10:15                     ` Hannes Reinecke
  2023-11-01 10:26                       ` Jan Kara
@ 2023-11-01 11:23                       ` Ming Lei
  2023-11-02 14:02                         ` Keith Busch
  2023-11-01 12:16                       ` Mikulas Patocka
  2 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2023-11-01 11:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Marek Marczykowski-Górecki, Jan Kara,
	Mikulas Patocka, Vlastimil Babka, Andrew Morton, Matthew Wilcox,
	Michal Hocko, stable, regressions, Alasdair Kergon, Mike Snitzer,
	dm-devel, linux-mm, linux-block, linux-nvme

On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
> On 11/1/23 04:24, Ming Lei wrote:
> > On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > > > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > > > <marmarek@invisiblethingslab.com> wrote:
> > > > > 
> > > > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > > > Then tried:
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > > > > 
> > > > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > > > and I can't reproduce the issue there. I'm confused...
> > > > > > 
> > > > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > > > the block layer or in the storage driver has problems with handling bios
> > > > > > with sufficiently high order pages attached. This is going to be a bit
> > > > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > > > that we don't overwhelm the system...
> > > > > 
> > > > > Sometimes it freezes just after logging in, but in worst case it takes
> > > > > me about 10min of more or less `tar xz` + `dd`.
> > > > 
> > > > blk-mq debugfs is usually helpful for hang issue in block layer or
> > > > underlying drivers:
> > > > 
> > > > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > BTW,  you can just collect logs of the exact disks if you know what
> > > > are behind dm-crypt,
> > > > which can be figured out by `lsblk`, and it has to be collected after
> > > > the hang is triggered.
> > > 
> > > dm-crypt lives on the nvme disk, this is what I collected when it
> > > hanged:
> > > 
> > ...
> > > nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> > > nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> > 
> > Two requests stays in sw queue, but not related with this issue.
> > 
> > > nvme0n1/hctx4/type:default
> > > nvme0n1/hctx4/dispatch_busy:9
> > 
> > non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> > nvme_queue_rq() recently and mostly.
> > 
> > > nvme0n1/hctx4/active:0
> > > nvme0n1/hctx4/run:20290468
> > 
> > ...
> > 
> > > nvme0n1/hctx4/tags:nr_tags=1023
> > > nvme0n1/hctx4/tags:nr_reserved_tags=0
> > > nvme0n1/hctx4/tags:active_queues=0
> > > nvme0n1/hctx4/tags:bitmap_tags:
> > > nvme0n1/hctx4/tags:depth=1023
> > > nvme0n1/hctx4/tags:busy=3
> > 
> > Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> > 
> > ...
> > 
> > > nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> > > nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> > > nvme0n1/hctx4/state:SCHED_RESTART
> > 
> > The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> > returns -BLK_STS_RESOURCE constantly, and you can verify with
> > the following bpftrace when the hang is triggered:
> > 
> > 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> > 
> > It is very likely that memory allocation inside nvme_queue_rq()
> > can't be done successfully, then blk-mq just have to retry by calling
> > nvme_queue_rq() on the above request.
> > 
> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?

It is just my guess.

Actually it shouldn't fail since the sgl allocation is backed with
memory pool, but there is also dma pool allocation and dma mapping.

> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.

It should not be hard to figure out why nvme_queue_rq() can't move on.

> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

It needs counting on request retry, and people often hate to add something
to request or bio in fast path. Also this kind of issue is easy to show
in blk-mq debugfs or bpftrace.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 10:15                     ` Hannes Reinecke
  2023-11-01 10:26                       ` Jan Kara
  2023-11-01 11:23                       ` Ming Lei
@ 2023-11-01 12:16                       ` Mikulas Patocka
  2 siblings, 0 replies; 6+ messages in thread
From: Mikulas Patocka @ 2023-11-01 12:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Marek Marczykowski-Górecki, Jan Kara,
	Vlastimil Babka, Andrew Morton, Matthew Wilcox, Michal Hocko,
	stable, regressions, Alasdair Kergon, Mike Snitzer, dm-devel,
	linux-mm, linux-block, linux-nvme, ming.lei



On Wed, 1 Nov 2023, Hannes Reinecke wrote:

> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?
> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.
> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

The appropriate way is to use mempools. mempool_alloc (with 
__GFP_DIRECT_RECLAIM) can't ever fail.

But some kernel code does GFP_NOIO allocations in the I/O path and the 
authors hope that they get away with it.

Mikulas



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5
  2023-11-01 11:23                       ` Ming Lei
@ 2023-11-02 14:02                         ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2023-11-02 14:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Ming Lei, Marek Marczykowski-G'orecki,
	Jan Kara, Mikulas Patocka, Vlastimil Babka, Andrew Morton,
	Matthew Wilcox, Michal Hocko, stable, regressions,
	Alasdair Kergon, Mike Snitzer, dm-devel, linux-mm, linux-block,
	linux-nvme

On Wed, Nov 01, 2023 at 07:23:05PM +0800, Ming Lei wrote:
> On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
> > > nvme_queue_rq() on the above request.
> > > 
> > And that is something I've been wondering (for quite some time now):
> > What _is_ the appropriate error handling for -ENOMEM?
> 
> It is just my guess.
> 
> Actually it shouldn't fail since the sgl allocation is backed with
> memory pool, but there is also dma pool allocation and dma mapping.
> 
> > At this time, we assume it to be a retryable error and re-run the queue
> > in the hope that things will sort itself out.
> 
> It should not be hard to figure out why nvme_queue_rq() can't move on.

There's only a few reasons nvme_queue_rq would return BLK_STS_RESOURCE
for a typical read/write command:

  DMA mapping error
  Can't allocate SGL from mempool
  Can't allocate PRP from dma_pool
  Controller stuck in resetting state

We should always be able to get at least one allocation from the memory
pools, so I think the only one the driver doesn't have a way to
guarantee eventual forward progress are the DMA mapping error
conditions. Is there some other limit that the driver needs to consider
when configuring it's largest supported transfers?


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-11-02 14:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <ZT+wDLwCBRB1O+vB@mail-itl>
     [not found] ` <a2a8dbf6-d22e-65d0-6fab-b9cdf9ec3320@redhat.com>
     [not found]   ` <20231030155603.k3kejytq2e4vnp7z@quack3>
     [not found]     ` <ZT/e/EaBIkJEgevQ@mail-itl>
     [not found]       ` <98aefaa9-1ac-a0e4-fb9a-89ded456750@redhat.com>
     [not found]         ` <ZUB5HFeK3eHeI8UH@mail-itl>
     [not found]           ` <20231031140136.25bio5wajc5pmdtl@quack3>
     [not found]             ` <ZUEgWA5P8MFbyeBN@mail-itl>
     [not found]               ` <CACVXFVOEWDyzasS7DWDvLOhC3Hr6qOn5ks3HLX+fbRYCxYv26w@mail.gmail.com>
     [not found]                 ` <ZUG0gcRhUlFm57qN@mail-itl>
2023-11-01  3:24                   ` Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5 Ming Lei
2023-11-01 10:15                     ` Hannes Reinecke
2023-11-01 10:26                       ` Jan Kara
2023-11-01 11:23                       ` Ming Lei
2023-11-02 14:02                         ` Keith Busch
2023-11-01 12:16                       ` Mikulas Patocka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox