[PATCH] nvmet-tcp: check INIT_FAILED before nvmet_req

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path
       [not found] ` <2026031827-carving-overdrawn-5b1f@gregkh>
@ 2026-03-18 22:56   ` Shivam Kumar
  2026-03-20  7:46     ` Christoph Hellwig
       [not found]   ` <CA+ysrSLcSJSgEC7i0GYXPiRY9sbbyEGT60Nh1SJ6PqXWENgSOw@mail.gmail.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Shivam Kumar @ 2026-03-18 22:56 UTC (permalink / raw)
  To: gregkh; +Cc: security, hch, sagi, kch, linux-nvme, kumar.shivam43666

In nvmet_tcp_try_recv_ddgst(), when a data digest mismatch is detected,
nvmet_req_uninit() is called unconditionally. However, if the command
arrived via the nvmet_tcp_handle_req_failure() path, nvmet_req_init()
had returned false and percpu_ref_tryget_live() was never executed. The
unconditional percpu_ref_put() inside nvmet_req_uninit() then causes a
refcount underflow, leading to a WARNING in
percpu_ref_switch_to_atomic_rcu, a use-after-free diagnostic, and
eventually a permanent workqueue deadlock.

Check cmd->flags & NVMET_TCP_F_INIT_FAILED before calling
nvmet_req_uninit(), matching the existing pattern in
nvmet_tcp_execute_request().

Signed-off-by: Shivam Kumar <kumar.shivam43666@gmail.com>
---
 drivers/nvme/target/tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index acc71a26733f..a456dd2fd8bd 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1310,7 +1310,8 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
 			queue->idx, cmd->req.cmd->common.command_id,
 			queue->pdu.cmd.hdr.type, le32_to_cpu(cmd->recv_ddgst),
 			le32_to_cpu(cmd->exp_ddgst));
-		nvmet_req_uninit(&cmd->req);
+		if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED))
+			nvmet_req_uninit(&cmd->req);
 		nvmet_tcp_free_cmd_buffers(cmd);
 		nvmet_tcp_fatal_error(queue);
 		ret = -EPROTO;
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path
  2026-03-18 22:56   ` [PATCH] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path Shivam Kumar
@ 2026-03-20  7:46     ` Christoph Hellwig
  2026-04-05 19:54       ` Shivam Kumar
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:46 UTC (permalink / raw)
  To: Shivam Kumar; +Cc: gregkh, security, sagi, kch, linux-nvme

Looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Although I'd prefer if someone more familiar with the TCP code could
also look over it.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path
  2026-03-20  7:46     ` Christoph Hellwig
@ 2026-04-05 19:54       ` Shivam Kumar
  0 siblings, 0 replies; 6+ messages in thread
From: Shivam Kumar @ 2026-04-05 19:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: gregkh, security, sagi, kch, linux-nvme

Hi all,

Just following up on this patch.
Please let me know if there are any concerns or if any changes are needed.

Thanks,
Shivam Kumar

On Fri, Mar 20, 2026 at 3:46 AM Christoph Hellwig <hch@lst.de> wrote:
>
> Looks good to me:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> Although I'd prefer if someone more familiar with the TCP code could
> also look over it.
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <CA+ysrSLcSJSgEC7i0GYXPiRY9sbbyEGT60Nh1SJ6PqXWENgSOw@mail.gmail.com>]

[parent not found: <CA+ysrSL+3YVGvTKhJ=jL92XZPwjoDpt5D2BK-HeCU975vO-fCQ@mail.gmail.com>]

[parent not found: <CA+ysrSJQ2tDiB=zL+GwF=ckhmO62v6GC2bWVUVVDVn7d8Km1mw@mail.gmail.com>]

* Re: [Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock
       [not found]       ` <CA+ysrSJQ2tDiB=zL+GwF=ckhmO62v6GC2bWVUVVDVn7d8Km1mw@mail.gmail.com>
@ 2026-03-25  4:29         ` Chaitanya Kulkarni
  2026-04-05 19:55           ` Shivam Kumar
  0 siblings, 1 reply; 6+ messages in thread
From: Chaitanya Kulkarni @ 2026-03-25  4:29 UTC (permalink / raw)
  To: Shivam Kumar, Greg KH
  Cc: Christoph Hellwig, sagi@grimberg.me, Chaitanya Kulkarni,
	security@kernel.org, linux-nvme@lists.infradead.org

+linux-nvme list as well.

On 3/24/26 21:22, Shivam Kumar wrote:
> Hi Greg, Christoph, Sagi, Chaitanya
>
> Following up on my previous report (percpu_ref underflow in
> nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
> applied, I identified a separate vulnerability triggered by the same
> PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
> different root cause.
>
> # Bug Overview
>
> nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
> holding state_lock and without checking whether teardown has already
> started. When a client closes the connection immediately after sending
> an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
> processed before io_work reads the ICReq from the socket buffer. The
> FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
> kref_put (1->0). Then io_work processes the ICReq and handle_icreq
> blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
> the guard in nvmet_tcp_schedule_release_queue(), which relies on state
> == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
>
> When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
> kernel_sock_shutdown(), the socket teardown synchronously processes
> any pending TCP packets in the socket backlog via release_sock() ->
> tcp_done() -> nvmet_tcp_state_change() ->
> nvmet_tcp_schedule_release_queue(). Because the state was overwritten
> to LIVE, the guard passes and kref_put is called on an already-zero
> kref, triggering the underflow.
>
> # Root Cause
>
> The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
>
> queue->state = NVMET_TCP_Q_LIVE;
> nvmet_prepare_receive_pdu(queue);
> return 0;
>
> This state transition is not protected by state_lock and does not
> check the current state. All other state transitions to
> NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
> protected by the spinlock, but handle_icreq bypasses this
> synchronization entirely.
>
> # Race Sequence
>
> The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
>
> [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
>
> All data arrives in order in the socket buffer, but the FIN triggers
> a TCP state change callback (softirq context) that races with io_work
> (process context) reading the application data.
>
> # ftrace evidence (queue 29)
>
> Step 1: FIN arrives, teardown begins (softirq context):
>
> kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
> 29: In schedule_release before checking state variable state is 0,
> kref=1, caller=tcp_fin+0x278/0x4f0
> kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
> 29: kref_put in schedule_release state changed to 3, kref=1,
> caller=tcp_fin+0x278/0x4f0
>
> State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
>
> Step 2: io_work processes ICReq, overwrites state (process context):
>
> kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
> 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
> 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
>
> handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
> without holding state_lock. kref is already 0.
>
> Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
> (process context):
>
> kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
> 29: fatal_error, state=2, ctrl=0000000000000000,
> caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
>
> Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
> re-enters state_change, guard bypassed because state is LIVE(2):
>
> kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
> 29: In schedule_release before checking state variable state is 2,
> kref=0, caller=tcp_done+0x1d7/0x350
> kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
> 29: kref_put in schedule_release state changed to 3, kref=0,
> caller=tcp_done+0x1d7/0x350
>
> schedule_release_queue see's state=LIVE(2), guard passes,
> kref_put called on kref=0 -> underflow -> CRASH.
>
> The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
> and writes state=2(LIVE) — proving the unsynchronized overwrite.
>
> # Additional Observation: Double fatal_error Call
>
> The ftrace also shows that fatal_error is called twice per error: once
> from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
> nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
> Both calls invoke kernel_sock_shutdown, doubling the window for the
> reentrant state_change.
>
> # Crash Trace (with percpu_ref patch applied)
>
> refcount_t: underflow; use-after-free.
> WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
> Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
> Call Trace:
> <TASK>
> nvmet_tcp_state_change+0x3e6/0x470
> tcp_done+0x1d7/0x350
> tcp_rcv_state_process+0x392c/0x6d80
> tcp_v4_do_rcv+0x1b1/0xa80
> __release_sock+0x2ab/0x340
> release_sock+0x58/0x1c0
> inet_shutdown+0x1f6/0x3c0
> nvmet_tcp_fatal_error+0x13c/0x1c0
> nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> nvmet_tcp_io_work+0xd43/0x1fc0
> process_one_work+0x8cf/0x1a40
>
> # Affected Code
>
> - Subsystem: drivers/nvme/target/tcp.c
> - Function: nvmet_tcp_handle_icreq()
> - Tested on: Linux 7.0-rc3
>
> # Suggested Fix Direction
>
> The state transition in nvmet_tcp_handle_icreq() needs to be
> synchronized under state_lock with a check for
> NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
> handle_icreq should not set the state to LIVE and should stop
> processing. I'll defer to you on the exact error handling semantics
> (return value and cleanup path).
>
> This bug is independent of the percpu_ref underflow in the previous
> report. The percpu_ref patch fixes the SQ refcount underflow; this bug
> affects the queue kref through an unsynchronized state transition.
>
> I am available for further discussion and testing.
>
> Thanks,
> Shivam Kumar
> PhD Student, Computer Science
> SYNE Lab, Syracuse University
> skumar47@syr.edu

I'll try to have a look at this over the weekend.

-ck



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock
  2026-03-25  4:29         ` [Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock Chaitanya Kulkarni
@ 2026-04-05 19:55           ` Shivam Kumar
  2026-04-06  4:23             ` Chaitanya Kulkarni
  0 siblings, 1 reply; 6+ messages in thread
From: Shivam Kumar @ 2026-04-05 19:55 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Greg KH, Christoph Hellwig, sagi@grimberg.me, security@kernel.org,
	linux-nvme@lists.infradead.org

Hi all,

Just following up on this report. Please let me know if there are any
questions about the analysis or if any additional information would be
helpful.

Thanks,
Shivam Kumar

On Wed, Mar 25, 2026 at 12:29 AM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
> +linux-nvme list as well.
>
> On 3/24/26 21:22, Shivam Kumar wrote:
> > Hi Greg, Christoph, Sagi, Chaitanya
> >
> > Following up on my previous report (percpu_ref underflow in
> > nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
> > applied, I identified a separate vulnerability triggered by the same
> > PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
> > different root cause.
> >
> > # Bug Overview
> >
> > nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
> > holding state_lock and without checking whether teardown has already
> > started. When a client closes the connection immediately after sending
> > an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
> > processed before io_work reads the ICReq from the socket buffer. The
> > FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
> > kref_put (1->0). Then io_work processes the ICReq and handle_icreq
> > blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
> > the guard in nvmet_tcp_schedule_release_queue(), which relies on state
> > == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
> >
> > When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
> > kernel_sock_shutdown(), the socket teardown synchronously processes
> > any pending TCP packets in the socket backlog via release_sock() ->
> > tcp_done() -> nvmet_tcp_state_change() ->
> > nvmet_tcp_schedule_release_queue(). Because the state was overwritten
> > to LIVE, the guard passes and kref_put is called on an already-zero
> > kref, triggering the underflow.
> >
> > # Root Cause
> >
> > The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
> >
> > queue->state = NVMET_TCP_Q_LIVE;
> > nvmet_prepare_receive_pdu(queue);
> > return 0;
> >
> > This state transition is not protected by state_lock and does not
> > check the current state. All other state transitions to
> > NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
> > protected by the spinlock, but handle_icreq bypasses this
> > synchronization entirely.
> >
> > # Race Sequence
> >
> > The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
> >
> > [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
> >
> > All data arrives in order in the socket buffer, but the FIN triggers
> > a TCP state change callback (softirq context) that races with io_work
> > (process context) reading the application data.
> >
> > # ftrace evidence (queue 29)
> >
> > Step 1: FIN arrives, teardown begins (softirq context):
> >
> > kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
> > 29: In schedule_release before checking state variable state is 0,
> > kref=1, caller=tcp_fin+0x278/0x4f0
> > kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
> > 29: kref_put in schedule_release state changed to 3, kref=1,
> > caller=tcp_fin+0x278/0x4f0
> >
> > State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
> >
> > Step 2: io_work processes ICReq, overwrites state (process context):
> >
> > kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
> > 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
> > kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> > kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
> > 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
> > kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> >
> > handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
> > without holding state_lock. kref is already 0.
> >
> > Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
> > (process context):
> >
> > kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
> > 29: fatal_error, state=2, ctrl=0000000000000000,
> > caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> >
> > Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
> > re-enters state_change, guard bypassed because state is LIVE(2):
> >
> > kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
> > 29: In schedule_release before checking state variable state is 2,
> > kref=0, caller=tcp_done+0x1d7/0x350
> > kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
> > 29: kref_put in schedule_release state changed to 3, kref=0,
> > caller=tcp_done+0x1d7/0x350
> >
> > schedule_release_queue see's state=LIVE(2), guard passes,
> > kref_put called on kref=0 -> underflow -> CRASH.
> >
> > The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
> > and writes state=2(LIVE) — proving the unsynchronized overwrite.
> >
> > # Additional Observation: Double fatal_error Call
> >
> > The ftrace also shows that fatal_error is called twice per error: once
> > from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
> > nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
> > Both calls invoke kernel_sock_shutdown, doubling the window for the
> > reentrant state_change.
> >
> > # Crash Trace (with percpu_ref patch applied)
> >
> > refcount_t: underflow; use-after-free.
> > WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
> > Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
> > Call Trace:
> > <TASK>
> > nvmet_tcp_state_change+0x3e6/0x470
> > tcp_done+0x1d7/0x350
> > tcp_rcv_state_process+0x392c/0x6d80
> > tcp_v4_do_rcv+0x1b1/0xa80
> > __release_sock+0x2ab/0x340
> > release_sock+0x58/0x1c0
> > inet_shutdown+0x1f6/0x3c0
> > nvmet_tcp_fatal_error+0x13c/0x1c0
> > nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> > nvmet_tcp_io_work+0xd43/0x1fc0
> > process_one_work+0x8cf/0x1a40
> >
> > # Affected Code
> >
> > - Subsystem: drivers/nvme/target/tcp.c
> > - Function: nvmet_tcp_handle_icreq()
> > - Tested on: Linux 7.0-rc3
> >
> > # Suggested Fix Direction
> >
> > The state transition in nvmet_tcp_handle_icreq() needs to be
> > synchronized under state_lock with a check for
> > NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
> > handle_icreq should not set the state to LIVE and should stop
> > processing. I'll defer to you on the exact error handling semantics
> > (return value and cleanup path).
> >
> > This bug is independent of the percpu_ref underflow in the previous
> > report. The percpu_ref patch fixes the SQ refcount underflow; this bug
> > affects the queue kref through an unsynchronized state transition.
> >
> > I am available for further discussion and testing.
> >
> > Thanks,
> > Shivam Kumar
> > PhD Student, Computer Science
> > SYNE Lab, Syracuse University
> > skumar47@syr.edu
>
> I'll try to have a look at this over the weekend.
>
> -ck
>
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock
  2026-04-05 19:55           ` Shivam Kumar
@ 2026-04-06  4:23             ` Chaitanya Kulkarni
  0 siblings, 0 replies; 6+ messages in thread
From: Chaitanya Kulkarni @ 2026-04-06  4:23 UTC (permalink / raw)
  To: Shivam Kumar
  Cc: Greg KH, Christoph Hellwig, sagi@grimberg.me, security@kernel.org,
	linux-nvme@lists.infradead.org

Shivam,


On 4/5/26 12:55, Shivam Kumar wrote:
> Hi all,
>
> Just following up on this report. Please let me know if there are any
> questions about the analysis or if any additional information would be
> helpful.
>
> Thanks,
> Shivam Kumar
>
> On Wed, Mar 25, 2026 at 12:29 AM Chaitanya Kulkarni
> <chaitanyak@nvidia.com> wrote:
>> +linux-nvme list as well.
>>
>> On 3/24/26 21:22, Shivam Kumar wrote:
>>> Hi Greg, Christoph, Sagi, Chaitanya
>>>
>>> Following up on my previous report (percpu_ref underflow in
>>> nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
>>> applied, I identified a separate vulnerability triggered by the same
>>> PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
>>> different root cause.
>>>
>>> # Bug Overview
>>>
>>> nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
>>> holding state_lock and without checking whether teardown has already
>>> started. When a client closes the connection immediately after sending
>>> an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
>>> processed before io_work reads the ICReq from the socket buffer. The
>>> FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
>>> kref_put (1->0). Then io_work processes the ICReq and handle_icreq
>>> blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
>>> the guard in nvmet_tcp_schedule_release_queue(), which relies on state
>>> == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
>>>
>>> When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
>>> kernel_sock_shutdown(), the socket teardown synchronously processes
>>> any pending TCP packets in the socket backlog via release_sock() ->
>>> tcp_done() -> nvmet_tcp_state_change() ->
>>> nvmet_tcp_schedule_release_queue(). Because the state was overwritten
>>> to LIVE, the guard passes and kref_put is called on an already-zero
>>> kref, triggering the underflow.
>>>
>>> # Root Cause
>>>
>>> The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
>>>
>>> queue->state = NVMET_TCP_Q_LIVE;
>>> nvmet_prepare_receive_pdu(queue);
>>> return 0;
>>>
>>> This state transition is not protected by state_lock and does not
>>> check the current state. All other state transitions to
>>> NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
>>> protected by the spinlock, but handle_icreq bypasses this
>>> synchronization entirely.
>>>
>>> # Race Sequence
>>>
>>> The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
>>>
>>> [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
>>>
>>> All data arrives in order in the socket buffer, but the FIN triggers
>>> a TCP state change callback (softirq context) that races with io_work
>>> (process context) reading the application data.
>>>
>>> # ftrace evidence (queue 29)
>>>
>>> Step 1: FIN arrives, teardown begins (softirq context):
>>>
>>> kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
>>> 29: In schedule_release before checking state variable state is 0,
>>> kref=1, caller=tcp_fin+0x278/0x4f0
>>> kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
>>> 29: kref_put in schedule_release state changed to 3, kref=1,
>>> caller=tcp_fin+0x278/0x4f0
>>>
>>> State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
>>>
>>> Step 2: io_work processes ICReq, overwrites state (process context):
>>>
>>> kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
>>> 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
>>> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
>>> kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
>>> 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
>>> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
>>>
>>> handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
>>> without holding state_lock. kref is already 0.
>>>
>>> Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
>>> (process context):
>>>
>>> kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
>>> 29: fatal_error, state=2, ctrl=0000000000000000,
>>> caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
>>>
>>> Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
>>> re-enters state_change, guard bypassed because state is LIVE(2):
>>>
>>> kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
>>> 29: In schedule_release before checking state variable state is 2,
>>> kref=0, caller=tcp_done+0x1d7/0x350
>>> kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
>>> 29: kref_put in schedule_release state changed to 3, kref=0,
>>> caller=tcp_done+0x1d7/0x350
>>>
>>> schedule_release_queue see's state=LIVE(2), guard passes,
>>> kref_put called on kref=0 -> underflow -> CRASH.
>>>
>>> The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
>>> and writes state=2(LIVE) — proving the unsynchronized overwrite.
>>>
>>> # Additional Observation: Double fatal_error Call
>>>
>>> The ftrace also shows that fatal_error is called twice per error: once
>>> from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
>>> nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
>>> Both calls invoke kernel_sock_shutdown, doubling the window for the
>>> reentrant state_change.
>>>
>>> # Crash Trace (with percpu_ref patch applied)
>>>
>>> refcount_t: underflow; use-after-free.
>>> WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
>>> Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
>>> Call Trace:
>>> <TASK>
>>> nvmet_tcp_state_change+0x3e6/0x470
>>> tcp_done+0x1d7/0x350
>>> tcp_rcv_state_process+0x392c/0x6d80
>>> tcp_v4_do_rcv+0x1b1/0xa80
>>> __release_sock+0x2ab/0x340
>>> release_sock+0x58/0x1c0
>>> inet_shutdown+0x1f6/0x3c0
>>> nvmet_tcp_fatal_error+0x13c/0x1c0
>>> nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
>>> nvmet_tcp_io_work+0xd43/0x1fc0
>>> process_one_work+0x8cf/0x1a40
>>>
>>> # Affected Code
>>>
>>> - Subsystem: drivers/nvme/target/tcp.c
>>> - Function: nvmet_tcp_handle_icreq()
>>> - Tested on: Linux 7.0-rc3
>>>
>>> # Suggested Fix Direction
>>>
>>> The state transition in nvmet_tcp_handle_icreq() needs to be
>>> synchronized under state_lock with a check for
>>> NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
>>> handle_icreq should not set the state to LIVE and should stop
>>> processing. I'll defer to you on the exact error handling semantics
>>> (return value and cleanup path).
>>>
>>> This bug is independent of the percpu_ref underflow in the previous
>>> report. The percpu_ref patch fixes the SQ refcount underflow; this bug
>>> affects the queue kref through an unsynchronized state transition.
>>>
>>> I am available for further discussion and testing.
>>>
>>> Thanks,
>>> Shivam Kumar
>>> PhD Student, Computer Science
>>> SYNE Lab, Syracuse University
>>> skumar47@syr.edu
>> I'll try to have a look at this over the weekend.
>>
>> -ck
>>
>>
Let's avoid top posting. Sorry for the delay.

Can the following patch fix the ref count underflow issue ?

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 4b8b02341ddc..69e971b179ae 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1310,7 +1310,8 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
  			queue->idx, cmd->req.cmd->common.command_id,
  			queue->pdu.cmd.hdr.type, le32_to_cpu(cmd->recv_ddgst),
  			le32_to_cpu(cmd->exp_ddgst));
-		nvmet_req_uninit(&cmd->req);
+		if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED))
+			nvmet_req_uninit(&cmd->req);
  		nvmet_tcp_free_cmd_buffers(cmd);
  		nvmet_tcp_fatal_error(queue);
  		ret = -EPROTO;
-- 
2.39.5

and something like following can fix the race between ICReq handling
and queue teardown ?

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 69e971b179ae..5c03a6505319 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -408,6 +408,8 @@ static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
  static void nvmet_tcp_socket_error(struct nvmet_tcp_queue *queue, int status)
  {
  	queue->rcv_state = NVMET_TCP_RECV_ERR;
+	if (status == -ESHUTDOWN)
+		return;
  	if (status == -EPIPE || status == -ECONNRESET)
  		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
  	else
@@ -922,11 +924,21 @@ static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
  	iov.iov_len = sizeof(*icresp);
  	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
  	if (ret < 0) {
-		queue->state = NVMET_TCP_Q_FAILED;
+		spin_lock_bh(&queue->state_lock);
+		if (queue->state != NVMET_TCP_Q_DISCONNECTING)
+			queue->state = NVMET_TCP_Q_FAILED;
+		spin_unlock_bh(&queue->state_lock);
  		return ret; /* queue removal will cleanup */
  	}
  
+	spin_lock_bh(&queue->state_lock);
+	if (queue->state == NVMET_TCP_Q_DISCONNECTING) {
+		spin_unlock_bh(&queue->state_lock);
+		/* Tell nvmet_tcp_socket_error() teardown is already in progress. */
+		return -ESHUTDOWN;
+	}
  	queue->state = NVMET_TCP_Q_LIVE;
+	spin_unlock_bh(&queue->state_lock);
  	nvmet_prepare_receive_pdu(queue);
  	return 0;
  }
-- 
2.39.5


-ck



^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-06  4:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CA+ysrS+QrbGfOLk=e0=PZ-py-KO_xZgn827nEg7mh7hGhdUAAw@mail.gmail.com>
     [not found] ` <2026031827-carving-overdrawn-5b1f@gregkh>
2026-03-18 22:56   ` [PATCH] nvmet-tcp: check INIT_FAILED before nvmet_req_uninit in digest error path Shivam Kumar
2026-03-20  7:46     ` Christoph Hellwig
2026-04-05 19:54       ` Shivam Kumar
     [not found]   ` <CA+ysrSLcSJSgEC7i0GYXPiRY9sbbyEGT60Nh1SJ6PqXWENgSOw@mail.gmail.com>
     [not found]     ` <CA+ysrSL+3YVGvTKhJ=jL92XZPwjoDpt5D2BK-HeCU975vO-fCQ@mail.gmail.com>
     [not found]       ` <CA+ysrSJQ2tDiB=zL+GwF=ckhmO62v6GC2bWVUVVDVn7d8Km1mw@mail.gmail.com>
2026-03-25  4:29         ` [Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock Chaitanya Kulkarni
2026-04-05 19:55           ` Shivam Kumar
2026-04-06  4:23             ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox