From: Dongsheng Yang <dongsheng.yang@easystack.cn>
To: Philipp Reisner <philipp.reisner@linbit.com>,
"zhengbing.huang" <zhengbing.huang@easystack.cn>
Cc: drbd-dev@lists.linbit.com
Subject: Re: [PATCH 01/11] drbd_nl: dont allow detating to be inttrupted in waiting D_DETACHING to DISKLESS
Date: Mon, 1 Jul 2024 10:02:03 +0800 [thread overview]
Message-ID: <f0faab82-9113-d34b-fbd2-9f817c4166e0@easystack.cn> (raw)
In-Reply-To: <CADGDV=Xo6Z_K2R8vB7+7_jf0U_im0Nmy-xQ36AYh59qi45EvAQ@mail.gmail.com>
在 2024/6/28 星期五 下午 5:10, Philipp Reisner 写道:
> Hello Dongsheng,
>
> First of all, thanks for contributing patches to us.
> Please find my reply on the patch below the quote:
>
> On Mon, Jun 24, 2024 at 7:52 AM zhengbing.huang
> <zhengbing.huang@easystack.cn> wrote:
>>
>> From: Dongsheng Yang <dongsheng.yang@easystack.cn>
>>
>> In our network failure and drbd down testing, we found warning in dmesg and drbd down process into D state:
>>
>> "kernel: drbd /unregistered/ramtest3/0 drbd103: ASSERTION device->disk_state[NOW] == D_FAILED || device->disk_state[NOW] == D_DETACHING FAILED in go_diskless"
>>
>> the problem is the wait_event is inttruptable, it could be intrupted by signal and call drbd_cleanup_device before go_diskless()
>>
>
> In this case, I suggest improving the expression in the assertion.
> Improving an assertion can also mean removing that assertion.
Hi Philipp,
This patchset is fixing the problems found by a network failure test
script[1].
The [1/11] is not about just a WARNING, it will result a process with D
state in wait_event(device->misc_wait, !test_bit(GOING_DISKLESS,
&device->flags)); in adm_del_minor().
let's think about this sequence:
a) drbd_adm_down -> adm_detach -> change_disk_state(device, D_DETACHING...
b) it will call put_ldev(), set GOING_DISKLESS and post a work for
GO_DISKLESS
c) adm_detach() start wait_event_interruptible(device->misc_wait,
get_disk_state(device) != D_DETACHING);
but it can be intrrupted, then call drbd_cleanup_device() to set
device->disk_state[NOW] = D_DISKLESS;
after that, it will go to adm_del_minor() and
wait_event(device->misc_wait, !test_bit(GOING_DISKLESS,
&device->flags)); which expects drbd_ldev_destroy to clear GOING_DISKLESS.
d) on the other hand, go_diskless work start and warn on the message in
commit message. it will do change_disk_state(device, D_DISKLESS,
CS_HARD, "go-diskless", NULL); But the disk_state[NOW] is already
D_DISKLESS. So it will not schedule &device->ldev_destroy_work.
As a result, the wait_event in c) will never return.
[1]:
check_drbd_process() {
ps aux | grep " D"|grep drbd
}
check_node_2_drbd_process() {
ssh node-2 'ps aux' | grep " D"|grep drbd
}
wait_for_no_drbd_d_state() {
count=0
while true; do
if check_drbd_process; then
echo "Found drbd process in D state, sleeping for ${count}
second..."
sleep 1
count=$((count + 1))
else
echo "No drbd process in D state."
break
fi
done
while true; do
if check_node_2_drbd_process; then
echo "Found drbd process in D state, sleeping for ${count}
second..."
sleep 1
count=$((count + 1))
else
echo "No drbd process in D state."
break
fi
done
}
random_sleep=$((RANDOM % 100))
ssh node-2 "ifup Bond2-roce.1469"
ifup Bond2-roce.1469
sleep 5
for i in `seq 0 9`; do
drbdadm up ramtest${i}
ssh node-2 "drbdadm up ramtest${i}"
done
sleep ${random_sleep}
ssh node-2 "ifdown Bond2-roce.1469"
random_sleep=$((RANDOM % 10))
for i in `seq 0 9`; do
drbdsetup fail-io ramtest${i} &
drbdadm down ramtest${i} &
done
sleep 10
wait_for_no_drbd_d_state
>
> The wait_event_interruptible() is there for a reason. Think of a
> backing disk that behaves like a tar pit—a backing device that no
> longer finishes IO requests. You want a way to interrupt the drbdsetup
> waiting in detach.
>
> PS: A bit more elaborative commit messages are welcome.
>
> best regards,
> Philipp
>
next prev parent reply other threads:[~2024-07-01 2:19 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-24 5:46 [PATCH 01/11] drbd_nl: dont allow detating to be inttrupted in waiting D_DETACHING to DISKLESS zhengbing.huang
2024-06-24 5:46 ` [PATCH 02/11] drbd_receiver: get_ldev before use device->ldev for drbd_reconsider_queue_parameters() zhengbing.huang
2024-06-28 9:35 ` Philipp Reisner
2024-06-24 5:46 ` [PATCH 03/11] drbd_transport_rdma: put kref for cm in dtr_path_established in error path zhengbing.huang
2024-06-28 9:40 ` Philipp Reisner
2024-07-01 2:07 ` Dongsheng Yang
2024-07-01 2:48 ` Dongsheng Yang
2024-10-16 16:44 ` Philipp Reisner
2024-10-17 6:42 ` Zhengbing
2024-06-24 5:46 ` [PATCH 04/11] drbd_transport_rdma: dont schedule retry_connect_work in active is false zhengbing.huang
2024-06-28 11:51 ` Philipp Reisner
2024-07-01 2:11 ` Dongsheng Yang
2024-06-24 5:46 ` [PATCH 05/11] drbd_transport_rdma: dont break in dtr_tx_cq_event_handler if (cm->state != DSM_CONNECTED) zhengbing.huang
2024-06-28 12:07 ` Philipp Reisner
2024-07-01 2:23 ` Dongsheng Yang
2024-06-24 5:46 ` [PATCH 06/11] drbd_transport_rdma: put kref in error path zhengbing.huang
2024-06-28 12:12 ` Philipp Reisner
2024-06-24 5:46 ` [PATCH 07/11] drbd_transport_rdma: put kref in dtr_remap_tx_desc error zhengbing.huang
2024-06-28 12:19 ` Philipp Reisner
2024-07-01 2:28 ` Dongsheng Yang
2024-06-24 5:46 ` [PATCH 08/11] drbd_transport_rdma: fix a race between dtr_connect and drbd_thread_stop zhengbing.huang
2024-06-28 12:36 ` Philipp Reisner
2024-07-01 2:30 ` Dongsheng Yang
2024-06-24 5:46 ` [PATCH 09/11] drbd_transport_rdma: introduce timeout for rdma_disocnnect zhengbing.huang
2024-06-24 5:46 ` [PATCH 10/11] drbd_transport_rdma: introduce timeout for rdma_connect zhengbing.huang
2024-06-24 5:46 ` [PATCH 11/11] drbd_transport_rdma: wake up state_wq after clear DSB_CONNECTED in dtr_tx_timeout_work_fn zhengbing.huang
2024-06-28 9:10 ` [PATCH 01/11] drbd_nl: dont allow detating to be inttrupted in waiting D_DETACHING to DISKLESS Philipp Reisner
2024-07-01 2:02 ` Dongsheng Yang [this message]
2024-07-01 10:00 ` Philipp Reisner
2024-07-02 1:45 ` Dongsheng Yang
2024-07-03 14:31 ` [PATCH] drbd: make drbd_adm_detach() interruptible Philipp Reisner
2024-07-04 2:59 ` Zhengbing
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f0faab82-9113-d34b-fbd2-9f817c4166e0@easystack.cn \
--to=dongsheng.yang@easystack.cn \
--cc=drbd-dev@lists.linbit.com \
--cc=philipp.reisner@linbit.com \
--cc=zhengbing.huang@easystack.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox