public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* Crash in NVME tracing on 5.10LTS
@ 2024-01-09 18:17 John Sperbeck
  2024-01-11  9:46 ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: John Sperbeck @ 2024-01-09 18:17 UTC (permalink / raw)
  To: Bean Huo, Sagi Grimberg; +Cc: khazhy, Jens Axboe, stable

With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
following tracing commands will trigger a crash due to a NULL pointer
dereference:

KDIR=/sys/kernel/debug/tracing
echo 1 > $KDIR/tracing_on
echo 1 > $KDIR/events/nvme/enable
echo "Waiting for trace events..."
cat $KDIR/trace_pipe

The backtrace looks something like this:

Call Trace:
 <IRQ>
 ? __die_body+0x6b/0xb0
 ? __die+0x9e/0xb0
 ? no_context+0x3eb/0x460
 ? ttwu_do_activate+0xf0/0x120
 ? __bad_area_nosemaphore+0x157/0x200
 ? select_idle_sibling+0x2f/0x410
 ? bad_area_nosemaphore+0x13/0x20
 ? do_user_addr_fault+0x2ab/0x360
 ? exc_page_fault+0x69/0x180
 ? asm_exc_page_fault+0x1e/0x30
 ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
 ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
 nvme_complete_rq+0x168/0x170
 nvme_pci_complete_rq+0x16c/0x1f0
 nvme_handle_cqe+0xde/0x190
 nvme_irq+0x78/0x100
 __handle_irq_event_percpu+0x77/0x1e0
 handle_irq_event+0x54/0xb0
 handle_edge_irq+0xdf/0x230
 asm_call_irq_on_stack+0xf/0x20
 </IRQ>
 common_interrupt+0x9e/0x150
 asm_common_interrupt+0x1e/0x40

It looks to me like these two upstream commits were backported to 5.10:

679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")

But they depend on this upstream commit to initialize the 'cmd' field in
some cases:

f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")

Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crash in NVME tracing on 5.10LTS
  2024-01-09 18:17 Crash in NVME tracing on 5.10LTS John Sperbeck
@ 2024-01-11  9:46 ` Greg KH
  2024-01-11 17:00   ` John Sperbeck
  0 siblings, 1 reply; 6+ messages in thread
From: Greg KH @ 2024-01-11  9:46 UTC (permalink / raw)
  To: John Sperbeck; +Cc: Bean Huo, Sagi Grimberg, khazhy, Jens Axboe, stable

On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
> With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
> following tracing commands will trigger a crash due to a NULL pointer
> dereference:
> 
> KDIR=/sys/kernel/debug/tracing
> echo 1 > $KDIR/tracing_on
> echo 1 > $KDIR/events/nvme/enable
> echo "Waiting for trace events..."
> cat $KDIR/trace_pipe
> 
> The backtrace looks something like this:
> 
> Call Trace:
>  <IRQ>
>  ? __die_body+0x6b/0xb0
>  ? __die+0x9e/0xb0
>  ? no_context+0x3eb/0x460
>  ? ttwu_do_activate+0xf0/0x120
>  ? __bad_area_nosemaphore+0x157/0x200
>  ? select_idle_sibling+0x2f/0x410
>  ? bad_area_nosemaphore+0x13/0x20
>  ? do_user_addr_fault+0x2ab/0x360
>  ? exc_page_fault+0x69/0x180
>  ? asm_exc_page_fault+0x1e/0x30
>  ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
>  ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
>  nvme_complete_rq+0x168/0x170
>  nvme_pci_complete_rq+0x16c/0x1f0
>  nvme_handle_cqe+0xde/0x190
>  nvme_irq+0x78/0x100
>  __handle_irq_event_percpu+0x77/0x1e0
>  handle_irq_event+0x54/0xb0
>  handle_edge_irq+0xdf/0x230
>  asm_call_irq_on_stack+0xf/0x20
>  </IRQ>
>  common_interrupt+0x9e/0x150
>  asm_common_interrupt+0x1e/0x40
> 
> It looks to me like these two upstream commits were backported to 5.10:
> 
> 679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
> e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
> 
> But they depend on this upstream commit to initialize the 'cmd' field in
> some cases:
> 
> f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
> 
> Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.
> 

If you apply that commit, does it solve the issue for you?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crash in NVME tracing on 5.10LTS
  2024-01-11  9:46 ` Greg KH
@ 2024-01-11 17:00   ` John Sperbeck
  2024-01-11 19:38     ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: John Sperbeck @ 2024-01-11 17:00 UTC (permalink / raw)
  To: Greg KH; +Cc: Bean Huo, Sagi Grimberg, khazhy, Jens Axboe, stable

On Thu, Jan 11, 2024 at 1:46 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
> > With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
> > following tracing commands will trigger a crash due to a NULL pointer
> > dereference:
> >
> > KDIR=/sys/kernel/debug/tracing
> > echo 1 > $KDIR/tracing_on
> > echo 1 > $KDIR/events/nvme/enable
> > echo "Waiting for trace events..."
> > cat $KDIR/trace_pipe
> >
> > The backtrace looks something like this:
> >
> > Call Trace:
> >  <IRQ>
> >  ? __die_body+0x6b/0xb0
> >  ? __die+0x9e/0xb0
> >  ? no_context+0x3eb/0x460
> >  ? ttwu_do_activate+0xf0/0x120
> >  ? __bad_area_nosemaphore+0x157/0x200
> >  ? select_idle_sibling+0x2f/0x410
> >  ? bad_area_nosemaphore+0x13/0x20
> >  ? do_user_addr_fault+0x2ab/0x360
> >  ? exc_page_fault+0x69/0x180
> >  ? asm_exc_page_fault+0x1e/0x30
> >  ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
> >  ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
> >  nvme_complete_rq+0x168/0x170
> >  nvme_pci_complete_rq+0x16c/0x1f0
> >  nvme_handle_cqe+0xde/0x190
> >  nvme_irq+0x78/0x100
> >  __handle_irq_event_percpu+0x77/0x1e0
> >  handle_irq_event+0x54/0xb0
> >  handle_edge_irq+0xdf/0x230
> >  asm_call_irq_on_stack+0xf/0x20
> >  </IRQ>
> >  common_interrupt+0x9e/0x150
> >  asm_common_interrupt+0x1e/0x40
> >
> > It looks to me like these two upstream commits were backported to 5.10:
> >
> > 679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
> > e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
> >
> > But they depend on this upstream commit to initialize the 'cmd' field in
> > some cases:
> >
> > f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
> >
> > Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.
> >
>
> If you apply that commit, does it solve the issue for you?
>
> thanks,
>
> greg k-h

The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
upstream commit doesn't apply cleanly to 5.10LTS.  If I adjust it to
fit, then the crash no longer occurs for me.

A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in
trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.

My leaning would be for a revert from 5.10LTS, but I think the
maintainers would have better insight then me.  It's also possible
that this isn't serious enough to worry about in general.  I don't
really know.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crash in NVME tracing on 5.10LTS
  2024-01-11 17:00   ` John Sperbeck
@ 2024-01-11 19:38     ` Jens Axboe
  2024-01-13  9:34       ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2024-01-11 19:38 UTC (permalink / raw)
  To: John Sperbeck, Greg KH; +Cc: Bean Huo, Sagi Grimberg, khazhy, stable

On 1/11/24 10:00 AM, John Sperbeck wrote:
> On Thu, Jan 11, 2024 at 1:46?AM Greg KH <gregkh@linuxfoundation.org> wrote:
>>
>> On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
>>> With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
>>> following tracing commands will trigger a crash due to a NULL pointer
>>> dereference:
>>>
>>> KDIR=/sys/kernel/debug/tracing
>>> echo 1 > $KDIR/tracing_on
>>> echo 1 > $KDIR/events/nvme/enable
>>> echo "Waiting for trace events..."
>>> cat $KDIR/trace_pipe
>>>
>>> The backtrace looks something like this:
>>>
>>> Call Trace:
>>>  <IRQ>
>>>  ? __die_body+0x6b/0xb0
>>>  ? __die+0x9e/0xb0
>>>  ? no_context+0x3eb/0x460
>>>  ? ttwu_do_activate+0xf0/0x120
>>>  ? __bad_area_nosemaphore+0x157/0x200
>>>  ? select_idle_sibling+0x2f/0x410
>>>  ? bad_area_nosemaphore+0x13/0x20
>>>  ? do_user_addr_fault+0x2ab/0x360
>>>  ? exc_page_fault+0x69/0x180
>>>  ? asm_exc_page_fault+0x1e/0x30
>>>  ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
>>>  ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
>>>  nvme_complete_rq+0x168/0x170
>>>  nvme_pci_complete_rq+0x16c/0x1f0
>>>  nvme_handle_cqe+0xde/0x190
>>>  nvme_irq+0x78/0x100
>>>  __handle_irq_event_percpu+0x77/0x1e0
>>>  handle_irq_event+0x54/0xb0
>>>  handle_edge_irq+0xdf/0x230
>>>  asm_call_irq_on_stack+0xf/0x20
>>>  </IRQ>
>>>  common_interrupt+0x9e/0x150
>>>  asm_common_interrupt+0x1e/0x40
>>>
>>> It looks to me like these two upstream commits were backported to 5.10:
>>>
>>> 679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
>>> e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
>>>
>>> But they depend on this upstream commit to initialize the 'cmd' field in
>>> some cases:
>>>
>>> f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
>>>
>>> Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.
>>>
>>
>> If you apply that commit, does it solve the issue for you?
>>
>> thanks,
>>
>> greg k-h
> 
> The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
> upstream commit doesn't apply cleanly to 5.10LTS.  If I adjust it to
> fit, then the crash no longer occurs for me.
> 
> A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in
> trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
> 
> My leaning would be for a revert from 5.10LTS, but I think the
> maintainers would have better insight then me.  It's also possible
> that this isn't serious enough to worry about in general.  I don't
> really know.

Either solution is fine with me, doesn't really matter. I was wondering
how this ended up in stable, and it looks like it was one of those
auto-selections... Those seem particularly dangerous the further back
you go.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crash in NVME tracing on 5.10LTS
  2024-01-11 19:38     ` Jens Axboe
@ 2024-01-13  9:34       ` Greg KH
  2024-01-13 16:11         ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Greg KH @ 2024-01-13  9:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: John Sperbeck, Bean Huo, Sagi Grimberg, khazhy, stable

On Thu, Jan 11, 2024 at 12:38:03PM -0700, Jens Axboe wrote:
> On 1/11/24 10:00 AM, John Sperbeck wrote:
> > On Thu, Jan 11, 2024 at 1:46?AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >>
> >> On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
> >>> With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
> >>> following tracing commands will trigger a crash due to a NULL pointer
> >>> dereference:
> >>>
> >>> KDIR=/sys/kernel/debug/tracing
> >>> echo 1 > $KDIR/tracing_on
> >>> echo 1 > $KDIR/events/nvme/enable
> >>> echo "Waiting for trace events..."
> >>> cat $KDIR/trace_pipe
> >>>
> >>> The backtrace looks something like this:
> >>>
> >>> Call Trace:
> >>>  <IRQ>
> >>>  ? __die_body+0x6b/0xb0
> >>>  ? __die+0x9e/0xb0
> >>>  ? no_context+0x3eb/0x460
> >>>  ? ttwu_do_activate+0xf0/0x120
> >>>  ? __bad_area_nosemaphore+0x157/0x200
> >>>  ? select_idle_sibling+0x2f/0x410
> >>>  ? bad_area_nosemaphore+0x13/0x20
> >>>  ? do_user_addr_fault+0x2ab/0x360
> >>>  ? exc_page_fault+0x69/0x180
> >>>  ? asm_exc_page_fault+0x1e/0x30
> >>>  ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
> >>>  ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
> >>>  nvme_complete_rq+0x168/0x170
> >>>  nvme_pci_complete_rq+0x16c/0x1f0
> >>>  nvme_handle_cqe+0xde/0x190
> >>>  nvme_irq+0x78/0x100
> >>>  __handle_irq_event_percpu+0x77/0x1e0
> >>>  handle_irq_event+0x54/0xb0
> >>>  handle_edge_irq+0xdf/0x230
> >>>  asm_call_irq_on_stack+0xf/0x20
> >>>  </IRQ>
> >>>  common_interrupt+0x9e/0x150
> >>>  asm_common_interrupt+0x1e/0x40
> >>>
> >>> It looks to me like these two upstream commits were backported to 5.10:
> >>>
> >>> 679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
> >>> e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
> >>>
> >>> But they depend on this upstream commit to initialize the 'cmd' field in
> >>> some cases:
> >>>
> >>> f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
> >>>
> >>> Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.
> >>>
> >>
> >> If you apply that commit, does it solve the issue for you?
> >>
> >> thanks,
> >>
> >> greg k-h
> > 
> > The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
> > upstream commit doesn't apply cleanly to 5.10LTS.  If I adjust it to
> > fit, then the crash no longer occurs for me.
> > 
> > A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in
> > trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
> > 
> > My leaning would be for a revert from 5.10LTS, but I think the
> > maintainers would have better insight then me.  It's also possible
> > that this isn't serious enough to worry about in general.  I don't
> > really know.
> 
> Either solution is fine with me, doesn't really matter. I was wondering
> how this ended up in stable, and it looks like it was one of those
> auto-selections... Those seem particularly dangerous the further back
> you go.

Now reverted, thanks.  But note, that commit does say it fixes an issue
this far back, which is why it was applied.

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Crash in NVME tracing on 5.10LTS
  2024-01-13  9:34       ` Greg KH
@ 2024-01-13 16:11         ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2024-01-13 16:11 UTC (permalink / raw)
  To: Greg KH; +Cc: John Sperbeck, Bean Huo, Sagi Grimberg, khazhy, stable

On 1/13/24 2:34 AM, Greg KH wrote:
> On Thu, Jan 11, 2024 at 12:38:03PM -0700, Jens Axboe wrote:
>> On 1/11/24 10:00 AM, John Sperbeck wrote:
>>> On Thu, Jan 11, 2024 at 1:46?AM Greg KH <gregkh@linuxfoundation.org> wrote:
>>>>
>>>> On Tue, Jan 09, 2024 at 10:17:22AM -0800, John Sperbeck wrote:
>>>>> With 5.10LTS (e.g., 5.10.206), on a machine using an NVME device, the
>>>>> following tracing commands will trigger a crash due to a NULL pointer
>>>>> dereference:
>>>>>
>>>>> KDIR=/sys/kernel/debug/tracing
>>>>> echo 1 > $KDIR/tracing_on
>>>>> echo 1 > $KDIR/events/nvme/enable
>>>>> echo "Waiting for trace events..."
>>>>> cat $KDIR/trace_pipe
>>>>>
>>>>> The backtrace looks something like this:
>>>>>
>>>>> Call Trace:
>>>>>  <IRQ>
>>>>>  ? __die_body+0x6b/0xb0
>>>>>  ? __die+0x9e/0xb0
>>>>>  ? no_context+0x3eb/0x460
>>>>>  ? ttwu_do_activate+0xf0/0x120
>>>>>  ? __bad_area_nosemaphore+0x157/0x200
>>>>>  ? select_idle_sibling+0x2f/0x410
>>>>>  ? bad_area_nosemaphore+0x13/0x20
>>>>>  ? do_user_addr_fault+0x2ab/0x360
>>>>>  ? exc_page_fault+0x69/0x180
>>>>>  ? asm_exc_page_fault+0x1e/0x30
>>>>>  ? trace_event_raw_event_nvme_complete_rq+0xba/0x170
>>>>>  ? trace_event_raw_event_nvme_complete_rq+0xa3/0x170
>>>>>  nvme_complete_rq+0x168/0x170
>>>>>  nvme_pci_complete_rq+0x16c/0x1f0
>>>>>  nvme_handle_cqe+0xde/0x190
>>>>>  nvme_irq+0x78/0x100
>>>>>  __handle_irq_event_percpu+0x77/0x1e0
>>>>>  handle_irq_event+0x54/0xb0
>>>>>  handle_edge_irq+0xdf/0x230
>>>>>  asm_call_irq_on_stack+0xf/0x20
>>>>>  </IRQ>
>>>>>  common_interrupt+0x9e/0x150
>>>>>  asm_common_interrupt+0x1e/0x40
>>>>>
>>>>> It looks to me like these two upstream commits were backported to 5.10:
>>>>>
>>>>> 679c54f2de67 ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
>>>>> e7006de6c238 ("nvme: code command_id with a genctr for use-after-free validation")
>>>>>
>>>>> But they depend on this upstream commit to initialize the 'cmd' field in
>>>>> some cases:
>>>>>
>>>>> f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
>>>>>
>>>>> Does it sound like I'm on the right track?  The 5.15LTS and later seems to be okay.
>>>>>
>>>>
>>>> If you apply that commit, does it solve the issue for you?
>>>>
>>>> thanks,
>>>>
>>>> greg k-h
>>>
>>> The f4b9e6c90c57 ("nvme: use driver pdu command for passthrough")
>>> upstream commit doesn't apply cleanly to 5.10LTS.  If I adjust it to
>>> fit, then the crash no longer occurs for me.
>>>
>>> A revert of 706960d328f5 ("nvme: use command_id instead of req->tag in
>>> trace_nvme_complete_rq()") from 5.10LTS also prevents the crash.
>>>
>>> My leaning would be for a revert from 5.10LTS, but I think the
>>> maintainers would have better insight then me.  It's also possible
>>> that this isn't serious enough to worry about in general.  I don't
>>> really know.
>>
>> Either solution is fine with me, doesn't really matter. I was wondering
>> how this ended up in stable, and it looks like it was one of those
>> auto-selections... Those seem particularly dangerous the further back
>> you go.
> 
> Now reverted, thanks.  But note, that commit does say it fixes an issue
> this far back, which is why it was applied.

Yeah, looking into it, I can see how this was applied. And hard to know
not to, it's one of those boundary conditions that may happen for stable
backports and that only the ones intimate with the nvme code will spot.
Just one of those things, I think.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-01-13 16:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-09 18:17 Crash in NVME tracing on 5.10LTS John Sperbeck
2024-01-11  9:46 ` Greg KH
2024-01-11 17:00   ` John Sperbeck
2024-01-11 19:38     ` Jens Axboe
2024-01-13  9:34       ` Greg KH
2024-01-13 16:11         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox