From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:35757 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751307AbcCHV4l (ORCPT ); Tue, 8 Mar 2016 16:56:41 -0500 Subject: Re: [PATCH 0/11] Update version of write stream ID patchset To: "Martin K. Petersen" References: <1457107853-8689-1-git-send-email-axboe@fb.com> <56D9F141.9070803@fb.com> <56D9F8E1.2080702@fb.com> CC: Jeff Moyer , , , , , From: Jens Axboe Message-ID: <56DF4A8F.8070103@fb.com> Date: Tue, 8 Mar 2016 14:56:31 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 03/05/2016 01:48 PM, Martin K. Petersen wrote: >>>>>> "Jens" == Jens Axboe writes: > > Jens, > >>> OK. I'm still of the opinion that we should try to make this >>> transparent. I could be swayed by workload descriptions and numbers >>> comparing approaches, though. > > Jens> You can't just waive that flag and not have a solution. Any > Jens> solution in that space would imply having policy in the kernel. A > Jens> "just use a stream per file" is never going to work. > > I totally understand the desire to have explicit, long-lived > "from-file-open to file-close" streams for things like database journals > and whatnot. That is an appealing use case. > However, I think that you are dismissing the benefits of being able to > group I/Os to disjoint LBA ranges within a brief period of time as > belonging to a single file. It's something that we know works well on > other types of storage. And it's also a much better heuristic for data > placement on SSDs than just picking the next available bucket. It does > require some pipelining on the drive but they will need some front end > logic to handle the proposed stream ID separation in any case. I'm not a huge fan of heuristics based exclusively around the temporal and spacial locality. Using that as a hint for a case where no stream ID (or write tag) is given would be an improvement, though. And perhaps parts of the space should be reserved to just that. But I don't think that should exclude doing this in a much more managed fashion, personally I find that a lot saner than adding this sort of state tracking in the kernel. > Also, in our experiments we essentially got the explicit stream ID for > free by virtue of the journal being written often enough that it was > rarely if ever evicted as an active stream by the device. With no > changes whatsoever to any application. Journal would be an easy one to guess, for sure. > My gripe with the current stuff is the same as before: The protocol is > squarely aimed at papering over issues with current flash technology. It > kinda-sorta works for other types of devices but it is very limiting. I > appreciate that it is a great fit for the "handful of apps sharing a > COTS NVMe drive on a cloud server" use case. But I think it is horrible > for NVMe over Fabrics and pretty much everything else. That wouldn't be > a big deal if the traditional storage models were going away. But I > don't think they are... I don't think erase blocks are going to go away in the near future. We're going to have better media as well, that's a given, but cheaper TLC flash is just going to make the current problem much worse. The patchset is really about tagging the writes with a stream ID, nothing else. That could potentially be any type of hinting, it's not exclusive to being used with NVMe write directives at all. -- Jens Axboe