From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED58CC433E2 for ; Fri, 28 Aug 2020 07:48:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BB19020DD4 for ; Fri, 28 Aug 2020 07:48:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="KRqMDleA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726825AbgH1HsT (ORCPT ); Fri, 28 Aug 2020 03:48:19 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:43351 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728094AbgH1HsR (ORCPT ); Fri, 28 Aug 2020 03:48:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1598600895; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=R1PMnHtImmvXAZiOtq3DLz2D9wh4Y/dbknKxUEMgm8s=; b=KRqMDleAfm5yhA87b9IfP8wCfkCDcekSw1guj4stIeQjiQGN1N+OoOmddYMFfhvldOrAQ4 WhM9QBsV0LSMmETGLSyO74SyecO4h9H1JkZrLlS5X1O4gJLxuMzNDlEsSYpuHF4guqoX7c MNM5rtTma0Plx73s31gZi0sH55xlVpQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-209-P4RIeJ64O_22SghSmhqcEw-1; Fri, 28 Aug 2020 03:48:11 -0400 X-MC-Unique: P4RIeJ64O_22SghSmhqcEw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A080B1074653; Fri, 28 Aug 2020 07:48:10 +0000 (UTC) Received: from T590 (ovpn-13-119.pek2.redhat.com [10.72.13.119]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E1090747BD; Fri, 28 Aug 2020 07:48:05 +0000 (UTC) Date: Fri, 28 Aug 2020 15:48:01 +0800 From: Ming Lei To: Jens Axboe Cc: "linux-block@vger.kernel.org" Subject: Re: [PATCH RFC] block: defer task/vm accounting until successful Message-ID: <20200828074801.GA224560@T590> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Thu, Aug 27, 2020 at 08:41:30PM -0600, Jens Axboe wrote: > We currently increment the task/vm counts when we first attempt to queue a > bio. But this isn't necessarily correct - if the request allocation fails > with -EAGAIN, for example, and the caller retries, then we'll over-account > by as many retries as are done. > > This can happen for polled IO, where we cannot wait for requests. Hence > retries can get aggressive, if we're running out of requests. If this > happens, then watching the IO rates in vmstat are incorrect as they count > every issue attempt as successful and hence the stats are inflated by > quite a lot potentially. > > Abstract out the accounting function, and move the blk-mq accounting into > blk_mq_make_request() when we know we got a request. For the non-mq path, > just do it when the bio is queued. > > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index d9d632639bd1..aabd016faf79 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -28,7 +28,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -1113,6 +1112,8 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio) > struct bio_list bio_list_on_stack[2]; > blk_qc_t ret = BLK_QC_T_NONE; > > + blk_account_bio(bio); > + > BUG_ON(bio->bi_next); > > bio_list_init(&bio_list_on_stack[0]); > @@ -1232,35 +1233,6 @@ blk_qc_t submit_bio(struct bio *bio) > if (blkcg_punt_bio_submit(bio)) > return BLK_QC_T_NONE; > > - /* > - * If it's a regular read/write or a barrier with data attached, > - * go through the normal accounting stuff before submission. > - */ > - if (bio_has_data(bio)) { > - unsigned int count; > - > - if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) > - count = queue_logical_block_size(bio->bi_disk->queue) >> 9; > - else > - count = bio_sectors(bio); > - > - if (op_is_write(bio_op(bio))) { > - count_vm_events(PGPGOUT, count); > - } else { > - task_io_account_read(bio->bi_iter.bi_size); > - count_vm_events(PGPGIN, count); > - } > - > - if (unlikely(block_dump)) { > - char b[BDEVNAME_SIZE]; > - printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n", > - current->comm, task_pid_nr(current), > - op_is_write(bio_op(bio)) ? "WRITE" : "READ", > - (unsigned long long)bio->bi_iter.bi_sector, > - bio_devname(bio, b), count); > - } > - } > - > /* > * If we're reading data that is part of the userspace workingset, count > * submission time as memory stall. When the device is congested, or > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 0015a1892153..b77c66dfc19c 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #include > > @@ -2111,6 +2112,39 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq) > } > } > > +void blk_account_bio(struct bio *bio) > +{ > + unsigned int count; > + > + /* > + * If it's a regular read/write or a barrier with data attached, > + * go through the normal accounting stuff before submission. > + */ > + if (unlikely(!bio_has_data(bio))) > + return; > + > + if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) > + count = queue_logical_block_size(bio->bi_disk->queue) >> 9; > + else > + count = bio_sectors(bio); > + > + if (op_is_write(bio_op(bio))) { > + count_vm_events(PGPGOUT, count); > + } else { > + task_io_account_read(bio->bi_iter.bi_size); > + count_vm_events(PGPGIN, count); > + } > + > + if (unlikely(block_dump)) { > + char b[BDEVNAME_SIZE]; > + printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)\n", > + current->comm, task_pid_nr(current), > + op_is_write(bio_op(bio)) ? "WRITE" : "READ", > + (unsigned long long)bio->bi_iter.bi_sector, > + bio_devname(bio, b), count); > + } > +} > + > /** > * blk_mq_submit_bio - Create and send a request to block device. > * @bio: Bio pointer. > @@ -2165,6 +2199,7 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio) > goto queue_exit; > } > > + blk_account_bio(bio); bio merged to plug list or sched will not be accounted in this way. Thanks, Ming