From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:41397 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030898AbcCRATL (ORCPT ); Thu, 17 Mar 2016 20:19:11 -0400 Subject: Re: [PATCH 0/11] Update version of write stream ID patchset To: Dan Williams References: <1457107853-8689-1-git-send-email-axboe@fb.com> <56D9F141.9070803@fb.com> <56D9F8E1.2080702@fb.com> <56DF4A8F.8070103@fb.com> CC: "Martin K. Petersen" , Jeff Moyer , linux-fsdevel , , , Christoph Hellwig , Andreas Dilger From: Jens Axboe Message-ID: <56EB4963.9030704@fb.com> Date: Thu, 17 Mar 2016 17:18:43 -0700 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 03/17/2016 04:43 PM, Dan Williams wrote: > On Tue, Mar 8, 2016 at 1:56 PM, Jens Axboe wrote: >> On 03/05/2016 01:48 PM, Martin K. Petersen wrote: >>>>>>>> >>>>>>>> "Jens" == Jens Axboe writes: >>> >>> >>> Jens, >>> >>>>> OK. I'm still of the opinion that we should try to make this >>>>> transparent. I could be swayed by workload descriptions and numbers >>>>> comparing approaches, though. >>> >>> >>> Jens> You can't just waive that flag and not have a solution. Any >>> Jens> solution in that space would imply having policy in the kernel. A >>> Jens> "just use a stream per file" is never going to work. >>> >>> I totally understand the desire to have explicit, long-lived >>> "from-file-open to file-close" streams for things like database journals >>> and whatnot. >> >> >> That is an appealing use case. >> >>> However, I think that you are dismissing the benefits of being able to >>> group I/Os to disjoint LBA ranges within a brief period of time as >>> belonging to a single file. It's something that we know works well on >>> other types of storage. And it's also a much better heuristic for data >>> placement on SSDs than just picking the next available bucket. It does >>> require some pipelining on the drive but they will need some front end >>> logic to handle the proposed stream ID separation in any case. >> >> >> I'm not a huge fan of heuristics based exclusively around the temporal and >> spacial locality. Using that as a hint for a case where no stream ID (or >> write tag) is given would be an improvement, though. And perhaps parts of >> the space should be reserved to just that. >> >> But I don't think that should exclude doing this in a much more managed >> fashion, personally I find that a lot saner than adding this sort of state >> tracking in the kernel. >> >>> Also, in our experiments we essentially got the explicit stream ID for >>> free by virtue of the journal being written often enough that it was >>> rarely if ever evicted as an active stream by the device. With no >>> changes whatsoever to any application. >> >> >> Journal would be an easy one to guess, for sure. >> >>> My gripe with the current stuff is the same as before: The protocol is >>> squarely aimed at papering over issues with current flash technology. It >>> kinda-sorta works for other types of devices but it is very limiting. I >>> appreciate that it is a great fit for the "handful of apps sharing a >>> COTS NVMe drive on a cloud server" use case. But I think it is horrible >>> for NVMe over Fabrics and pretty much everything else. That wouldn't be >>> a big deal if the traditional storage models were going away. But I >>> don't think they are... >> >> >> I don't think erase blocks are going to go away in the near future. We're >> going to have better media as well, that's a given, but cheaper TLC flash is >> just going to make the current problem much worse. The patchset is really >> about tagging the writes with a stream ID, nothing else. That could >> potentially be any type of hinting, it's not exclusive to being used with >> NVMe write directives at all. >> > > Maybe I'm misunderstanding, but why does stream-id imply anything more > than just "opaque tag set at the top of the stack that makes it down > to a driver". Sure NVMe can interpret these as NVMe streams, but any > other driver can have its own transport specific translation of what > the hint means. I think the minute the opaque number requires > specific driver behavior we'll fall into a rat hole of how to > translate intent across usages. > > In other words, I think it will always be the case that the hint has > application + transport/driver meaning, but otherwise the kernel is > just a conduit. You are not missing anything, that's exactly how it is intended, and that's how the interface is designed as well. If you want to tie extra meaning to this for specific drivers or transports, that's fine, and there's nothing that prevents that from happening. As it stands, and as it is proposed, it's just a write tag/stream ID that we can set on a file or inode, and have passed to the driver. Nothing more, nothing less. -- Jens Axboe