Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hannes Reinecke <hare@suse.de>
To: Luis Chamberlain <mcgrof@kernel.org>,
	Matthew Wilcox <willy@infradead.org>
Cc: "Keith Busch" <kbusch@kernel.org>,
	"Theodore Ts'o" <tytso@mit.edu>,
	"Pankaj Raghav" <p.raghav@samsung.com>,
	"Daniel Gomez" <da.gomez@samsung.com>,
	"Javier González" <javier.gonz@samsung.com>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-block@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations
Date: Sat, 4 Mar 2023 12:08:36 +0100	[thread overview]
Message-ID: <c9f6544d-1731-4a73-a926-0e85ae9da9df@suse.de> (raw)
In-Reply-To: <ZAJqjM6qLrraFrrn@bombadil.infradead.org>

On 3/3/23 22:45, Luis Chamberlain wrote:
> On Fri, Mar 03, 2023 at 03:49:29AM +0000, Matthew Wilcox wrote:
>> On Thu, Mar 02, 2023 at 06:58:58PM -0700, Keith Busch wrote:
>>> That said, I was hoping you were going to suggest supporting 16k logical block
>>> sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
>>> 4k. :)
>>
>> I was hoping Luis was going to propose a session on LBA size > PAGE_SIZE.
>> Funnily, while the pressure is coming from the storage vendors, I don't
>> think there's any work to be done in the storage layers.  It's purely
>> a FS+MM problem.
> 
> You'd hope most of it is left to FS + MM, but I'm not yet sure that's
> quite it yet. Initial experimentation shows just enabling > PAGE_SIZE
> physical & logical block NVMe devices gets brought down to 512 bytes.
> That seems odd to say the least. Would changing this be an issue now?
> 
> I'm gathering there is generic interest in this topic though. So one
> thing we *could* do is perhaps review lay-of-the-land of interest and
> break down what we all think are things likely could be done / needed.
> At the very least we can come out together knowing the unknowns together.
> 
> I started to think about some of these things a while ago and with the
> help of Willy I tried to break down some of the items I gathered from him
> into community OKRs (super informal itemization of goals and sub tasks which
> would complete such goals) and started trying to take a stab at them
> with our team, but obviously I think it would be great if we all just
> divide & and conquer here. So maybe reviewing these and extending them
> as a community would be good:
> 
> https://kernelnewbies.org/KernelProjects/large-block-size
> 
> I'm recently interested in tmpfs so will be taking a stab at higher
> order page size support there to see what blows up.
> 
Cool.

> The other stuff like general IOMAP conversion is pretty well known, and
> we already I think have a proposed session on that. But there is also
> even smaller fish to fry, like *just* doing a baseline with some
> filesystems with 4 KiB block size seems in order.
> 
> Hearing filesystem developer's thoughts on support for larger block
> size in light of lower order PAGE_SIZE would be good, given one of the
> odd situations some distributions / teams find themselves in is trying
> to support larger block sizes but with difficult access to higher
> PAGE_SIZE systems. Are there ways to simplify this / help us in general?
> Without it's a bit hard to muck around with some of this in terms of
> support long term. This also got me thinking about ways to try to replicate
> larger IO virtual devices a bit better too. While paying a cloud
> provider to test this is one nice option, it'd be great if I can just do
> this in house with some hacks too. For virtio-blk-pci at least, for instance,
> I wondered whether using just the host page cache suffices, or would a 4K
> page cache on the host modify say a 16 k emualated io controller results
> significantly? How do we most effectively virtualize 16k controllers
> in-house?
> 
> To help with experimenting with large io and NVMe / virtio-blk-pci I
> recented added support to intantiate tons of large IO devices to kdevops
> [0], with it it should be easy to reproduce odd issues we may come up
> with. For instnace it should be possible to subsequently extend the
> kdevops fstests or blktests automation support with just a few Kconfig files
> to use some of these largio devices to see what blows up.
> 
We could implement a (virtual) zoned device, and expose each zone as a 
block. That gives us the required large block characteristics, and with
a bit of luck we might be able to dial up to really large block sizes
like the 256M sizes on current SMR drives.
ublk might be a good starting point.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

next prev parent reply	other threads:[~2023-03-04 11:08 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-01  3:52 [LSF/MM/BPF TOPIC] Cloud storage optimizations Theodore Ts'o
2023-03-01  4:18 ` Gao Xiang
2023-03-01  4:40   ` Matthew Wilcox
2023-03-01  4:59     ` Gao Xiang
2023-03-01  4:35 ` Matthew Wilcox
2023-03-01  4:49   ` Gao Xiang
2023-03-01  5:01     ` Matthew Wilcox
2023-03-01  5:09       ` Gao Xiang
2023-03-01  5:19         ` Gao Xiang
2023-03-01  5:42         ` Matthew Wilcox
2023-03-01  5:51           ` Gao Xiang
2023-03-01  6:00             ` Gao Xiang
2023-03-02  3:13 ` Chaitanya Kulkarni
2023-03-02  3:50 ` Darrick J. Wong
2023-03-03  3:03   ` Martin K. Petersen
2023-03-02 20:30 ` Bart Van Assche
2023-03-03  3:05   ` Martin K. Petersen
2023-03-03  1:58 ` Keith Busch
2023-03-03  3:49   ` Matthew Wilcox
2023-03-03 11:32     ` Hannes Reinecke
2023-03-03 13:11     ` James Bottomley
2023-03-04  7:34       ` Matthew Wilcox
2023-03-04 13:41         ` James Bottomley
2023-03-04 16:39           ` Matthew Wilcox
2023-03-05  4:15             ` Luis Chamberlain
2023-03-05  5:02               ` Matthew Wilcox
2023-03-08  6:11                 ` Luis Chamberlain
2023-03-08  7:59                   ` Dave Chinner
2023-03-06 12:04               ` Hannes Reinecke
2023-03-06  3:50             ` James Bottomley
2023-03-04 19:04         ` Luis Chamberlain
2023-03-03 21:45     ` Luis Chamberlain
2023-03-03 22:07       ` Keith Busch
2023-03-03 22:14         ` Luis Chamberlain
2023-03-03 22:32           ` Keith Busch
2023-03-03 23:09             ` Luis Chamberlain
2023-03-16 15:29             ` Pankaj Raghav
2023-03-16 15:41               ` Pankaj Raghav
2023-03-03 23:51       ` Bart Van Assche
2023-03-04 11:08       ` Hannes Reinecke [this message]
2023-03-04 13:24         ` Javier González
2023-03-04 16:47         ` Matthew Wilcox
2023-03-04 17:17           ` Hannes Reinecke
2023-03-04 17:54             ` Matthew Wilcox
2023-03-04 18:53               ` Luis Chamberlain
2023-03-05  3:06               ` Damien Le Moal
2023-03-05 11:22               ` Hannes Reinecke
2023-03-06  8:23                 ` Matthew Wilcox
2023-03-06 10:05                   ` Hannes Reinecke
2023-03-06 16:12                   ` Theodore Ts'o
2023-03-08 17:53                     ` Matthew Wilcox
2023-03-08 18:13                       ` James Bottomley
2023-03-09  8:04                         ` Javier González
2023-03-09 13:11                           ` James Bottomley
2023-03-09 14:05                             ` Keith Busch
2023-03-09 15:23                             ` Martin K. Petersen
2023-03-09 20:49                               ` James Bottomley
2023-03-09 21:13                                 ` Luis Chamberlain
2023-03-09 21:28                                   ` Martin K. Petersen
2023-03-10  1:16                                     ` Dan Helmick
2023-03-10  7:59                             ` Javier González
2023-03-08 19:35                 ` Luis Chamberlain
2023-03-08 19:55                 ` Bart Van Assche
2023-03-03  2:54 ` Martin K. Petersen
2023-03-03  3:29   ` Keith Busch
2023-03-03  4:20   ` Theodore Ts'o
  -- strict thread matches above, loose matches on Subject: below --
2023-07-16  4:09 BELINDA Goodpaster kelly
2025-09-22 17:49 Belinda R Goodpaster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c9f6544d-1731-4a73-a926-0e85ae9da9df@suse.de \
    --to=hare@suse.de \
    --cc=da.gomez@samsung.com \
    --cc=javier.gonz@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).