From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 1B3D46B00F7 for ; Mon, 11 Nov 2013 14:51:24 -0500 (EST) Received: by mail-pd0-f182.google.com with SMTP id y13so2783118pdi.41 for ; Mon, 11 Nov 2013 11:51:23 -0800 (PST) Received: from psmtp.com ([74.125.245.140]) by mx.google.com with SMTP id j6si9347545pad.105.2013.11.11.11.51.20 for ; Mon, 11 Nov 2013 11:51:21 -0800 (PST) Date: Mon, 11 Nov 2013 20:31:47 +0100 From: Jan Kara Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131111193147.GC24867@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> <20131107134806.GB30832@quack.suse.cz> <20131111032211.GT6188@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131111032211.GT6188@dastard> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Jan Kara , Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm On Mon 11-11-13 14:22:11, Dave Chinner wrote: > On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote: > > On Tue 05-11-13 15:12:45, Dave Chinner wrote: > > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > > Realistically, there is no "one right answer" for all combinations > > > of applications, filesystems and hardware, but writeback caching is > > > the best *general solution* we've got right now. > > > > > > However, IMO users should not need to care about tuning BDI dirty > > > ratios or even have to understand what a BDI dirty ratio is to > > > select the rigth caching method for their devices and/or workload. > > > The difference between writeback and write through caching is easy > > > to explain and AFAICT those two modes suffice to solve the problems > > > being discussed here. Further, if two modes suffice to solve the > > > problems, then we should be able to easily define a trigger to > > > automatically switch modes. > > > > > > /me notes that if we look at random vs sequential IO and the impact > > > that has on writeback duration, then it's very similar to suddenly > > > having a very slow device. IOWs, fadvise(RANDOM) could be used to > > > switch an *inode* to write through mode rather than writeback mode > > > to solve the problem aggregating massive amounts of random write IO > > > in the page cache... > > I disagree here. Writeback cache is also useful for aggregating random > > writes and making semi-sequential writes out of them. There are quite some > > applications which rely on the fact that they can write a file in a rather > > random manner (Berkeley DB, linker, ...) but the files are written out in > > one large linear sweep. That is actually the reason why SLES (and I believe > > RHEL as well) tune dirty_limit even higher than what's the default value. > > Right - but the correct behaviour really depends on the pattern of > randomness. The common case we get into trouble with is when no > clustering occurs and we end up with small, random IO for gigabytes > of cached data. That's the case where write-through caching for > random data is better. > > It's also questionable whether writeback caching for aggregation is > faster for random IO on high-IOPS devices or not. Again, I think it > woul depend very much on how random the patterns are... I agree usefulness of writeback caching for random IO very much depends on the working set size vs cache size, how random the accesses really are, and HW characteristics. I just wanted to point out there are fairly common workloads & setups where writeback caching for semi-random IO really helps (because you seemed to suggest that random IO implies we should disable writeback cache). > > So I think it's rather the other way around: If you can detect the file is > > being written in a streaming manner, there's not much point in caching too > > much data for it. > > But we're not talking about how much data we cache here - we are > considering how much data we allow to get dirty before writing it > back. Sorry, I was imprecise here. I really meant that IMO it doesn't make sense to allow too much dirty data for sequentially written files. > It doesn't matter if we use writeback or write through > caching, the page cache footprint for a given workload is likely to > be similar, but without any data we can't draw any conclusions here. > > > And I agree with you that we also have to be careful not > > to cache too few because otherwise two streaming writes would be > > interleaved too much. Currently, we have writeback_chunk_size() which > > determines how much we ask to write from a single inode. So streaming > > writers are going to be interleaved at this chunk size anyway (currently > > that number is "measured bandwidth / 2"). So it would make sense to also > > limit amount of dirty cache for each file with streaming pattern at this > > number. > > My experience says that for streaming IO we typically need at least > 5s of cached *dirty* data to even out delays and latencies in the > writeback IO pipeline. Hence limiting a file to what we can write in > a second given we might only write a file once a second is likely > going to result in pipeline stalls... I guess this begs for real data. We agree in principle but differ in constants :). > Remember, writeback caching is about maximising throughput, not > minimising latency. The "sync latency" problem with caching too much > dirty data on slow block devices is really a corner case behaviour > and should not compromise the common case for bulk writeback > throughput. Agreed. As a primary goal we want to maximise throughput. But we want to maintain sane latency as well (e.g. because we have a "promise" of "dirty_writeback_centisecs" we have to cycle through dirty inodes reasonably frequently). > > Agreed. But the ability to limit amount of dirty pages outstanding > > against a particular BDI seems as a sane one to me. It's not as flexible > > and automatic as the approach you suggested but it's much simpler and > > solves most of problems we currently have. > > That's true, but.... > > > The biggest objection against the sysfs-tunable approach is that most > > people won't have a clue meaning that the tunable is useless for them. > > .... that's the big problem I see - nobody is going to know how to > use it, when to use it, or be able to tell if it's the root cause of > some weird performance problem they are seeing. > > > But I > > wonder if something like: > > 1) turn on strictlimit by default > > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback > > speed > > > > won't go a long way into solving our current problems without too much > > complication... > > Turning on strict limit by default is going to change behaviour > quite markedly. Again, it's not something I'd want to see done > without a bunch of data showing that it doesn't cause regressions > for common workloads... Agreed. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org