From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D881FC43603 for ; Fri, 6 Dec 2019 08:30:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A8B6924682 for ; Fri, 6 Dec 2019 08:30:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726476AbfLFIaG (ORCPT ); Fri, 6 Dec 2019 03:30:06 -0500 Received: from mx2.suse.de ([195.135.220.15]:38026 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726088AbfLFIaG (ORCPT ); Fri, 6 Dec 2019 03:30:06 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 931CFB019; Fri, 6 Dec 2019 08:30:03 +0000 (UTC) Subject: Re: [RFC PATCH] bcache: enable zoned device support To: Damien Le Moal , Hannes Reinecke , Eric Wheeler Cc: "linux-bcache@vger.kernel.org" , "linux-block@vger.kernel.org" , Hannes Reinecke References: <20191205152543.73885-1-colyli@suse.de> <66345af3-fad6-3079-1604-3b0e9d2779ce@suse.de> From: Coly Li Organization: SUSE Labs Message-ID: <9bcbcb7b-4620-359a-b7b0-63366688a523@suse.de> Date: Fri, 6 Dec 2019 16:29:49 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=iso-2022-jp Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On 2019/12/6 3:42 下午, Damien Le Moal wrote: > On 2019/12/06 16:09, Hannes Reinecke wrote: >> On 12/6/19 5:37 AM, Coly Li wrote: >>> On 2019/12/6 8:30 上午, Damien Le Moal wrote: >>>> On 2019/12/06 9:22, Eric Wheeler wrote: >>>>> On Thu, 5 Dec 2019, Coly Li wrote: >>>>>> This is a very basic zoned device support. With this patch, bcache >>>>>> device is able to, >>>>>> - Export zoned device attribution via sysfs >>>>>> - Response report zones request, e.g. by command 'blkzone report' >>>>>> But the bcache device is still NOT able to, >>>>>> - Response any zoned device management request or IOCTL command >>>>>> >>>>>> Here are the testings I have done, >>>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed' >>>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones >>>>>> including all zone types. >>>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size >>>>>> in sectors. >>>>>> - run 'blkzone report /dev/bcache0', all zones information displayed. >>>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error >>>>>> information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed: >>>>>> Operation not supported" >>>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr' >>>>>> values updated. >>>>>> >>>>>> All of these are very basic testings, if you have better testing >>>>>> tools or cases, please offer me hint. >>>>> >>>>> Interesting. >>>>> >>>>> 1. should_writeback() could benefit by hinting true when an IO would fall >>>>> in a zoned region. >>>>> >>>>> 2. The writeback thread could writeback such that they prefer >>>>> fully(mostly)-populated zones when choosing what to write out. >>>> >>>> That definitely would be a good idea since that would certainly benefit >>>> backend-GC (that will be needed). >>>> >>>> However, I do not see the point in exposing the /dev/bcacheX block >>>> device itself as a zoned disk. In fact, I think we want exactly the >>>> opposite: expose it as a regular disk so that any FS or application can >>>> run. If the bcache backend disk is zoned, then the writeback handles >>>> sequential writes. This would be in the end a solution similar to >>>> dm-zoned, that is, a zoned disk becomes useable as a regular block >>>> device (random writes anywhere are possible), but likely far more >>>> efficient and faster. That may result in imposing some limitations on >>>> bcache operations though, e.g. it can only be setup with writeback, no >>>> writethrough allowed (not sure though...). >>>> Thoughts ? >>>> >>> >>> I come to realize this is really an idea on the opposite. Let me try to >>> explain what I understand, please correct me if I am wrong. The idea you >>> proposed indeed is to make bcache act as something like FTL for the >>> backend zoned SMR drive, that is, for all random writes, bcache may >>> convert them into sequential write onto the backend zoned SMR drive. In >>> the meantime, if there are hot data, bcache continues to act as a >>> caching device to accelerate read request. >>> >>> Yes, if I understand your proposal correctly, writeback mode might be >>> mandatory and backend-GC will be needed. The idea is interesting, it >>> looks like adding a log-structure storage layer between current bcache >>> B+tree indexing and zoned SMR hard drive. >>> >> Well, not sure if that's required. >> >> Or, to be correct, we actually have _two_ use-cases: >> 1) Have a SMR drive as a backing device. This was my primary goal for >> handling these devices, as SMR device are typically not _that_ fast. >> (Damien once proudly reported getting the incredible speed of 1 IOPS :-) > > Yes, it can get to that with dm-zoned if one goes crazy with sustained > random writes :) The physical drive itself does a lot more than 1 iops > in that case though and is as fast as any other HDD. But from the DM > logical drive side, the user can sometimes fall into the 1 iops > territory for really nasty workloads. Tests for well behaved users like > f2fs show that SMR and regular HDDs are on par for performance. > >> So having bcache running on top of those will be a clear win. >> But in this scenario the cache device will be a normal device (typically >> an SSD), and we shouldn't need much modification here. > > I agree. That should work mostly as is since the user will be zone aware > and already be issuing sequential writes. bcache write-through only > needs to follow the same pattern, not reordering any write, and > write-back only has to replay the same. > >> In fact, a good testcase would be the btrfs patches which got posted >> earlier this week. With them you should be able to create a btrfs >> filesystem on the SMR drive, and use an SSD as a cache device. >> Getting this scenario to run would indeed be my primary goal, and I >> guess your patches should be more or less sufficient for that. > > + Will need the zone revalidation and zone type & write lock bitmaps to > prevent reordering from the block IO stack, unless bcache is a BIO > driver ? My knowledge of bcache is limited. Would need to look into the > details a little more to be able to comment. Hi Damien, Bcache should be a bio based driver, it splits and clones bios, and submits it by generic_make_request() to underlying block layer code. So zone revalidation and zone type & write lock bitmaps are unnecessary for bcache ? > >> 2) Using a SMR drive as a _cache_ device. This seems to be contrary to >> the above statement of SMR drive not being fast, but then the NVMe WG is >> working on a similar mechanism for flash devices called 'ZNS' (zoned >> namespaces). And for those it really would make sense to have bcache >> being able to handle zoned devices as a cache device. >> But this is to my understanding really in the early stages, with no real >> hardware being available. Damien might disagree, though :-) > > Yes, that would be another potential use case and ZNS indeed could fit > this model, assuming that zone sizes align (multiples) between front and > back devices. > >> And the implementation is still on the works on the linux side, so it's >> more of a long-term goal.> >> But the first use-case is definitely something we should be looking at; >> SMR drives are available _and_ with large capacity, so any speedup there >> would be greatly appreciated. > > Yes. And what I was talking about in my earlier email is actually a > third use case: > 3) SMR drive as backend + regular SSD as frontend and the resulting > bcache device advertising itself as a regular disk, hiding all the zone > & sequential write constraint to the user. Since bcache already has some > form of indirection table for cached blocks, I thought we could hijack > this to implement a sort of FTL that would allow serializing random > writes to the backend with the help of the frontend as a write staging > buffer. Doing so, we get full random write capability with the benefit > of "hot" blocks staying in the cache. But again, not knowing enough > details about bcache, I may be talking too lightly here. Not sure if > that is reasonably easily feasible with the current bcache code. There are three addresses involved in the above proposal. 1) User space LBA address: the LBA of block device which are combiled by bcache+SMR. 2) Cache device LBA address: where the random writing cached data blocks are stored on SSD. 3) SMR drive LBA address: where the sequential writing data blocks are stored on zoned SMR drive Therefore we need at least two layers mapping to connect these 3 addresses together. Currently only 1 mapping from bcache B+tree is not enough. Maybe stacking bcache backing device on top of dm-zoned target is a solution for proposal 3), let me try whether it works. -- Coly Li