From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from out-184.mta1.migadu.com (out-184.mta1.migadu.com [95.215.58.184]) by mail19.linbit.com (LINBIT Mail Daemon) with ESMTP id 330C142010D for ; Tue, 2 Jul 2024 03:53:30 +0200 (CEST) Subject: Re: [PATCH 01/11] drbd_nl: dont allow detating to be inttrupted in waiting D_DETACHING to DISKLESS To: Philipp Reisner , Dongsheng Yang References: <20240624054619.23212-1-zhengbing.huang@easystack.cn> From: Dongsheng Yang Message-ID: Date: Tue, 2 Jul 2024 09:45:59 +0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: drbd-dev@lists.linbit.com List-Id: "*Coordination* of development, patches, contributions -- *Questions* \(even to developers\) go to drbd-user, please." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , 在 2024/7/1 星期一 下午 6:00, Philipp Reisner 写道: > Hi Dongsheng, > > Thanks for all this information! That is already nearly a perfect > commit message. > Still, I am looking for a better approach to solving this problem. > Instead of making the detach uninterruptible, I suggest finding a way > to still schedule the ldev_destroy_work in this corner case. Sounds good, we are willing to do review and test for your patch when it's ready. > > PS: I prefer changes with 100 lines of commit message that touches 3 > lines of code over 3 lines of commit message for 100 lines of code > changes. Totally understood, that's important for discussion and code maintaining. Thanx > > best regards, > Philipp > > On Mon, Jul 1, 2024 at 4:02 AM Dongsheng Yang > wrote: >> >> >> >> 在 2024/6/28 星期五 下午 5:10, Philipp Reisner 写道: >>> Hello Dongsheng, >>> >>> First of all, thanks for contributing patches to us. >>> Please find my reply on the patch below the quote: >>> >>> On Mon, Jun 24, 2024 at 7:52 AM zhengbing.huang >>> wrote: >>>> >>>> From: Dongsheng Yang >>>> >>>> In our network failure and drbd down testing, we found warning in dmesg and drbd down process into D state: >>>> >>>> "kernel: drbd /unregistered/ramtest3/0 drbd103: ASSERTION device->disk_state[NOW] == D_FAILED || device->disk_state[NOW] == D_DETACHING FAILED in go_diskless" >>>> >>>> the problem is the wait_event is inttruptable, it could be intrupted by signal and call drbd_cleanup_device before go_diskless() >>>> >>> >>> In this case, I suggest improving the expression in the assertion. >>> Improving an assertion can also mean removing that assertion. >> >> Hi Philipp, >> This patchset is fixing the problems found by a network failure test >> script[1]. >> The [1/11] is not about just a WARNING, it will result a process with D >> state in wait_event(device->misc_wait, !test_bit(GOING_DISKLESS, >> &device->flags)); in adm_del_minor(). >> >> let's think about this sequence: >> >> a) drbd_adm_down -> adm_detach -> change_disk_state(device, D_DETACHING... >> >> b) it will call put_ldev(), set GOING_DISKLESS and post a work for >> GO_DISKLESS >> >> c) adm_detach() start wait_event_interruptible(device->misc_wait, >> get_disk_state(device) != D_DETACHING); >> but it can be intrrupted, then call drbd_cleanup_device() to set >> device->disk_state[NOW] = D_DISKLESS; >> >> after that, it will go to adm_del_minor() and >> wait_event(device->misc_wait, !test_bit(GOING_DISKLESS, >> &device->flags)); which expects drbd_ldev_destroy to clear GOING_DISKLESS. >> >> d) on the other hand, go_diskless work start and warn on the message in >> commit message. it will do change_disk_state(device, D_DISKLESS, >> CS_HARD, "go-diskless", NULL); But the disk_state[NOW] is already >> D_DISKLESS. So it will not schedule &device->ldev_destroy_work. >> >> As a result, the wait_event in c) will never return. >> >> >> [1]: >> check_drbd_process() { >> ps aux | grep " D"|grep drbd >> } >> >> check_node_2_drbd_process() { >> ssh node-2 'ps aux' | grep " D"|grep drbd >> } >> >> wait_for_no_drbd_d_state() { >> count=0 >> while true; do >> if check_drbd_process; then >> echo "Found drbd process in D state, sleeping for ${count} >> second..." >> sleep 1 >> count=$((count + 1)) >> else >> echo "No drbd process in D state." >> break >> fi >> done >> while true; do >> if check_node_2_drbd_process; then >> echo "Found drbd process in D state, sleeping for ${count} >> second..." >> sleep 1 >> count=$((count + 1)) >> else >> echo "No drbd process in D state." >> break >> fi >> done >> } >> >> random_sleep=$((RANDOM % 100)) >> >> ssh node-2 "ifup Bond2-roce.1469" >> ifup Bond2-roce.1469 >> >> sleep 5 >> >> for i in `seq 0 9`; do >> drbdadm up ramtest${i} >> ssh node-2 "drbdadm up ramtest${i}" >> done >> >> sleep ${random_sleep} >> >> ssh node-2 "ifdown Bond2-roce.1469" >> >> random_sleep=$((RANDOM % 10)) >> >> for i in `seq 0 9`; do >> drbdsetup fail-io ramtest${i} & >> drbdadm down ramtest${i} & >> done >> >> sleep 10 >> >> wait_for_no_drbd_d_state >>> >>> The wait_event_interruptible() is there for a reason. Think of a >>> backing disk that behaves like a tar pit—a backing device that no >>> longer finishes IO requests. You want a way to interrupt the drbdsetup >>> waiting in detach. >>> >>> PS: A bit more elaborative commit messages are welcome. >>> >>> best regards, >>> Philipp >>>