From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DECFCC47247 for ; Thu, 30 Apr 2020 14:03:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BE46B2137B for ; Thu, 30 Apr 2020 14:03:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="O1BTazyx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728730AbgD3ODR (ORCPT ); Thu, 30 Apr 2020 10:03:17 -0400 Received: from us-smtp-2.mimecast.com ([205.139.110.61]:35666 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726782AbgD3ODQ (ORCPT ); Thu, 30 Apr 2020 10:03:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1588255394; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Tq65J0azB/z4/durrypOaw9kEf5bZxciy5BHiqcWsHY=; b=O1BTazyxAK3L3cAuSg/X/kxqDFUOnP5JGVkcehKDUM5smps3ghMp0iwATa5QdhIs5NnWYY CRp4N8Ixfd51XDOJqJJfvRE7WaFmx8vHKne9fGlCpwczfIFAZC56VW5rrv13kLXiAIEKDk O/NR+xc0+Jc/BgagJiNzVSUi28oqLH0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-270-5ycRqCQKOXu2vfa2STuoLg-1; Thu, 30 Apr 2020 10:03:10 -0400 X-MC-Unique: 5ycRqCQKOXu2vfa2STuoLg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 58291800D24; Thu, 30 Apr 2020 14:03:08 +0000 (UTC) Received: from T590 (ovpn-8-24.pek2.redhat.com [10.72.8.24]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 664195C1B0; Thu, 30 Apr 2020 14:02:58 +0000 (UTC) Date: Thu, 30 Apr 2020 22:02:54 +0800 From: Ming Lei To: Will Deacon Cc: Peter Zijlstra , Christoph Hellwig , Jens Axboe , linux-block@vger.kernel.org, John Garry , Bart Van Assche , Hannes Reinecke , Thomas Gleixner , paulmck@kernel.org Subject: Re: [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Message-ID: <20200430140254.GA996887@T590> References: <20200425154832.GA16004@lst.de> <20200428155837.GA16910@hirez.programming.kicks-ass.net> <20200429021612.GD671522@T590> <20200429080728.GB29143@willie-the-truck> <20200429094616.GB700644@T590> <20200429122757.GA30247@willie-the-truck> <20200429134327.GC700644@T590> <20200429173400.GC30247@willie-the-truck> <20200430003945.GA719313@T590> <20200430110429.GI19932@willie-the-truck> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200430110429.GI19932@willie-the-truck> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Thu, Apr 30, 2020 at 12:04:29PM +0100, Will Deacon wrote: > On Thu, Apr 30, 2020 at 08:39:45AM +0800, Ming Lei wrote: > > On Wed, Apr 29, 2020 at 06:34:01PM +0100, Will Deacon wrote: > > > On Wed, Apr 29, 2020 at 09:43:27PM +0800, Ming Lei wrote: > > > > Please see the following two code paths: > > > > > > > > [1] code path1: > > > > blk_mq_hctx_notify_offline(): > > > > set_bit(BLK_MQ_S_INACTIVE, &hctx->state); > > > > > > > > smp_mb() or smp_mb_after_atomic() > > > > > > > > blk_mq_hctx_drain_inflight_rqs(): > > > > blk_mq_tags_inflight_rqs() > > > > rq = hctx->tags->rqs[index] > > > > and > > > > READ rq->tag > > > > > > > > [2] code path2: > > > > blk_mq_get_driver_tag(): > > > > > > > > process might be migrated to other CPU here and chance is small, > > > > then the follow code will be run on CPU different with code path1 > > > > > > > > rq->tag = rq->internal_tag; > > > > hctx->tags->rqs[rq->tag] = rq; > > > > > > I /think/ this can be distilled to the SB litmus test: > > > > > > // blk_mq_hctx_notify_offline() blk_mq_get_driver_tag(); > > > Wstate = INACTIVE Wtag > > > smp_mb() smp_mb() > > > Rtag Rstate > > > > > > and you want to make sure that either blk_mq_get_driver_tag() sees the > > > state as INACTIVE and does the cleanup, or it doesn't and > > > blk_mq_hctx_notify_offline() sees the newly written tag and waits for the > > > request to complete (I don't get how that happens, but hey). > > > > > > Is that right? > > > > Yeah, exactly. > > > > > > > > > barrier() in case that code path2 is run on same CPU with code path1 > > > > OR > > > > smp_mb() in case that code path2 is run on different CPU with code path1 because > > > > of process migration > > > > > > > > test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state) > > > > > > Couldn't you just check this at the start of blk_mq_get_driver_tag() as > > > well, and then make the smp_mb() unconditional? > > > > As I mentioned, the chance for the current process(calling > > blk_mq_get_driver_tag()) migration is very small, we do want to > > avoid the extra smp_mb() in the fast path. > > Hmm, but your suggestion of checking 'rq->mq_ctx->cpu' only works if that > is the same CPU on which blk_mq_hctx_notify_offline() executes. What > provides that guarantee? BLK_MQ_S_INACTIVE is only set when the last cpu of this hctx is becoming offline, and blk_mq_hctx_notify_offline() is called from cpu hotplug handler. So if there is any request of this hctx submitted from somewhere, it has to this last cpu. That is done by blk-mq's queue mapping. In case of direct issue, basically blk_mq_get_driver_tag() is run after the request is allocated, that is why I mentioned the chance of migration is very small. > > If there's any chance of this thing being concurrent, then you need the The only chance is that the process running blk_mq_get_driver_tag is migrated to another CPU in case of direct issue. And we do add smp_mb() for this case. > barrier there just in case. So I'd say you either need to prevent the race, > or live with the barrier. Do you have numbers to show how expensive it is? Not yet, but we can save it easily in the very fast path, so why not do it? Especially most of times preemption won't happen at all. Also this patch itself is correct, and preempt disable via get_cpu() suggested by Christoph isn't needed too, because migration implies smp_mb(). I will document this point in next version. Thanks, Ming