From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9EECC6FA8E for ; Sun, 5 Mar 2023 11:22:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229580AbjCELWV (ORCPT ); Sun, 5 Mar 2023 06:22:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36968 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229555AbjCELWV (ORCPT ); Sun, 5 Mar 2023 06:22:21 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E0F58A5F4; Sun, 5 Mar 2023 03:22:17 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 84B8D1FD7C; Sun, 5 Mar 2023 11:22:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1678015336; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AT2Kt/wYAneV6Cu1tgTYC07zTeFUojM/srzHygHKBtU=; b=WRXL6GpifPqmw4W8ELX7M424mNrEcN04dA8F7Qusm6xPxGWWB1phA7oIsLgeWA69iG8G2P OgF8cbg/5i1Eo0g7Cqr9arZ+YUEmQXfSq1dF0De5dSl75BtelmXZ51n++mVidOekQb9R5j nE58/h8IYoCwlvUF8t/HW+gKq0RQg74= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1678015336; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AT2Kt/wYAneV6Cu1tgTYC07zTeFUojM/srzHygHKBtU=; b=qvdLGlsSkSEZaROxqIHvRBGHmNxo0U7YNlfGJASUf/g/JEFqphjw9rHlEp2OM2iNkqUJjF 2/+v+CVYCJG48SAg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1B4091339E; Sun, 5 Mar 2023 11:22:16 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id iMgXBmh7BGTZYAAAMHmgww (envelope-from ); Sun, 05 Mar 2023 11:22:16 +0000 Message-ID: <0b70deae-9fc7-ca33-5737-85d7532b3d33@suse.de> Date: Sun, 5 Mar 2023 12:22:15 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations Content-Language: en-US To: Matthew Wilcox Cc: Luis Chamberlain , Keith Busch , Theodore Ts'o , Pankaj Raghav , Daniel Gomez , =?UTF-8?Q?Javier_Gonz=c3=a1lez?= , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org References: From: Hannes Reinecke In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On 3/4/23 18:54, Matthew Wilcox wrote: > On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote: >> On 3/4/23 17:47, Matthew Wilcox wrote: >>> On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote: >>>> We could implement a (virtual) zoned device, and expose each zone as a >>>> block. That gives us the required large block characteristics, and with >>>> a bit of luck we might be able to dial up to really large block sizes >>>> like the 256M sizes on current SMR drives. >>>> ublk might be a good starting point. >>> >>> Ummmm. Is supporting 256MB block sizes really a desired goal? I suggest >>> that is far past the knee of the curve; if we can only write 256MB chunks >>> as a single entity, we're looking more at a filesystem redesign than we >>> are at making filesystems and the MM support 256MB size blocks. >>> >> Naa, not really. It _would_ be cool as we could get rid of all the cludges >> which have nowadays re sequential writes. >> And, remember, 256M is just a number someone thought to be a good >> compromise. If we end up with a lower number (16M?) we might be able >> to convince the powers that be to change their zone size. >> Heck, with 16M block size there wouldn't be a _need_ for zones in >> the first place. >> >> But yeah, 256M is excessive. Initially I would shoot for something >> like 2M. > > I think we're talking about different things (probably different storage > vendors want different things, or even different people at the same > storage vendor want different things). > > Luis and I are talking about larger LBA sizes. That is, the minimum > read/write size from the block device is 16kB or 64kB or whatever. > In this scenario, the minimum amount of space occupied by a file goes > up from 512 bytes or 4kB to 64kB. That's doable, even if somewhat > suboptimal. > And so do I. One can view zones as really large LBAs. Indeed it might be suboptimal from the OS point of view. But from the device point of view it won't. And, in fact, with devices becoming faster and faster the question is whether sticking with relatively small sectors won't become a limiting factor eventually. > Your concern seems to be more around shingled devices (or their equivalent > in SSD terms) where there are large zones which are append-only, but > you can still random-read 512 byte LBAs. I think there are different > solutions to these problems, and people are working on both of these > problems. > My point being that zones are just there because the I/O stack can only deal with sectors up to 4k. If the I/O stack would be capable of dealing with larger LBAs one could identify a zone with an LBA, and the entire issue of append-only and sequential writes would be moot. Even the entire concept of zones becomes irrelevant as the OS would trivially only write entire zones. > But if storage vendors are really pushing for 256MB LBAs, then that's > going to need a third kind of solution, and I'm not aware of anyone > working on that. What I was saying is that 256M is not set in stone. It's just a compromise vendors used. Even if in the course of development we arrive at a lower number of max LBA we can handle (say, 2MB) I am pretty sure vendors will be quite interested in that. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman