From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF36AC433DF for ; Wed, 14 Oct 2020 09:49:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4C52B21527 for ; Wed, 14 Oct 2020 09:49:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZVHOEfIF" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727859AbgJNJtY (ORCPT ); Wed, 14 Oct 2020 05:49:24 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:23307 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726491AbgJNJtY (ORCPT ); Wed, 14 Oct 2020 05:49:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1602668962; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+7nTPdXwonaaObjH+9aIhWDcRRg+j2ycjiukz+YMaTU=; b=ZVHOEfIFF9V0TmAihg5g3h7+Do18Bi3p8HQieaQQW7AWyjKZYVbIdqtFXF2RIce1Amgwtm ebQSy9R9X2pPgSKCAdPRd4hD6/7DPP/pm3dJWRpuByUD9f+c9OAznR4Ad1WJDNBCnRGX90 x1s3cQ02BtVw/m+1oy4smyXlziicbX8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-253-8FD2FY3XMA68z9wqunmn7g-1; Wed, 14 Oct 2020 05:49:18 -0400 X-MC-Unique: 8FD2FY3XMA68z9wqunmn7g-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 124179CC06; Wed, 14 Oct 2020 09:49:17 +0000 (UTC) Received: from T590 (ovpn-12-36.pek2.redhat.com [10.72.12.36]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E44701002382; Wed, 14 Oct 2020 09:49:10 +0000 (UTC) Date: Wed, 14 Oct 2020 17:49:06 +0800 From: Ming Lei To: JeffleXu Cc: axboe@kernel.dk, linux-block@vger.kernel.org, joseph.qi@linux.alibaba.com, xiaoguang.wang@linux.alibaba.com, Christoph Hellwig Subject: Re: [PATCH] block: set NOWAIT for sync polling Message-ID: <20201014094906.GD775684@T590> References: <20201013084051.27255-1-jefflexu@linux.alibaba.com> <20201013120913.GA614668@T590> <17357ee1-7662-f20b-0a49-2fb3fdf01ebc@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <17357ee1-7662-f20b-0a49-2fb3fdf01ebc@linux.alibaba.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Wed, Oct 14, 2020 at 04:31:49PM +0800, JeffleXu wrote: > What about just disabling HIPRI in preadv2(2)/pwritev2(2)? Christoph Hellwig > disabled HIPRI for libaio in Then people will complain that poll can't be used for sync dio, and it is an regression. > > commit 154989e45fd8de9bfb52bbd6e5ea763e437e54c5 ("aio: clear IOCB_HIPRI"). > What do you think, @Christoph? > > (cc Christoph Hellwig) > > > > static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) > > { > > - bio->bi_opf |= REQ_HIPRI; > > - if (!is_sync_kiocb(kiocb)) > > - bio->bi_opf |= REQ_NOWAIT; > > + bio->bi_opf |= REQ_HIPRI | REQ_NOWAIT; > > } > > The original patch indeed could not fix the problem. Though it could fix the > potential deadlock, > > the VFS code read(2)/write(2) is not ready by handling the returned -EAGAIN > gracefully. Currently > > read(2)/write(2) will just return -EAGAIN to user space. > > > > On 10/13/20 8:09 PM, Ming Lei wrote: > > On Tue, Oct 13, 2020 at 04:40:51PM +0800, Jeffle Xu wrote: > > > Sync polling also needs REQ_NOWAIT flag. One sync read/write may be > > > split into several bios (and thus several requests), and can used up the > > > queue depth sometimes. Thus the following bio in the same sync > > > read/write will wait for usable request if REQ_NOWAIT flag not set, in > > > which case the following sync polling will cause a deadlock. > > > > > > One case (maybe the only case) for above situation is preadv2/pwritev2 > > > + direct + highpri. Two conditions need to be satisfied to trigger the > > > deadlock. > > > > > > 1. HIPRI IO in sync routine. Normal read(2)/pread(2)/readv(2)/preadv(2) > > > and corresponding write family syscalls don't support high-priority IO and > > > thus won't trigger polling routine. Only preadv2(2)/pwritev2(2) supports > > > high-priority IO by RWF_HIPRI flag of @flags parameter. > > > > > > 2. Polling support in sync routine. Currently both the blkdev and > > > iomap-based fs (ext4/xfs, etc) support polling in direct IO routine. The > > > general routine is described as follows. > > > > > > submit_bio > > > wait for blk_mq_get_tag(), waiting for requests completion, which > > > should be done by the following polling, thus causing a deadlock. > > Another blocking point is rq_qos_throttle(), > > What is the issue here in rq_qos_throttle()? More details? Like allocating request, rq_qos_throttle() may wait until rq completion is done, see wbt_wait() and wbt_done(). > > > > so I guess falling back to > > REQ_NOWAIT may not fix the issue completely. > > > > > Given iopoll isn't supposed to in case of big IO, another solution > > may be to disable iopoll when bio splitting is needed, something > > like the following change: > > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index bcf5e4580603..8e762215660b 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -279,6 +279,12 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, > > return NULL; > > split: > > *segs = nsegs; > > + > > + /* > > + * bio splitting may cause more trouble for iopoll which isn't supposed > > + * to be used in case of big IO > > + */ > > + bio->bi_opf &= ~REQ_HIPRI; > > return bio_split(bio, sectors, GFP_NOIO, bs); > > } > > Actually split is not only from blk_mq_submit_bio->__blk_queue_split. In > __blkdev_direct_IO, > > one input iov_iter could be split to several bios. > > ``` > > __blkdev_direct_IO: > > for (;;) { >         ret = bio_iov_iter_get_pages(bio, iter); >         submit_bio(bio); > } > > for (;;) { >         blk_poll() > >         ... > > } > > ``` > > Since  one single bio can contain at most BIO_MAX_PAGES, i.e. 256 bio_vec in > @bio->bi_io_vec, As Jens mentioned, it is weird to poll on multiple bios, so we can disable io poll simply in __blkdev_direct_IO(). And it is reasonable from user's viewpoint to not poll on big bio cause io poll is often used in latency sensitive cases. > > if the @iovcnt parameter of preadv2(2)/pwritev2(2) is larger than 256, then > one call of > > preadv2(2)/pwritev2(2) can be split into several bios. These bios are > submitted at once, and then > > start sync polling in the process context. > > > If the number of bios split from one call of preadv2(2)/pwritev2(2) is > larger than the queue depth, > > bios from single preadv2(2)/pwritev2(2) call can exhaust the queue depth and > thus cause deadlock. > > Fortunately the maximum of @iovcnt parameter of preadv2(2)/pwritev2(2) is > UIO_MAXIOV, i.e. 1024, > > and the minimum of queue depth is BLKDEV_MIN_RQ i.e. 4. That means one > preadv2(2)/pwritev2(2) > > call can submit at most 4 bios, which will fill up the queue depth exactly > and there's no deadlock in this > > case. I'm not sure if the setting of UIO_MAXIOV/BIO_MAX_PAGES/BLKDEV_MIN_RQ > is coincident or > > deliberately tuned. At least it will not cause deadlock currently , though > the constraint may be a little fragile. > > > By the way, this patch could fix the potential hang I mentioned in > > https://patchwork.kernel.org/project/linux-block/patch/20200911032958.125068-1-jefflexu@linux.alibaba.com/ Right, I remembered that race. Thanks, Ming