From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Artem S. Tashkinov" Subject: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 07:25:13 +0000 (UTC) Message-ID: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: linux-kernel@vger.kernel.org Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hello! On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon invoking "sync". How can I disable this memory cache altogether (or at least minimize caching)? When running the i686 kernel with the same configuration I don't observe this effect - files get written out almost immediately (for instance "sync" takes less than a second, whereas on x86-64 it can take a dozen of _minutes_ depending on a file size and storage performance). I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) - firstly this command is detrimental to the performance of my PC, secondly, it won't help in this instance. Swap is totally disabled, usually my memory is entirely free. My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 Please, advise. Best regards, Artem -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 09:18:49 +0100 Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton Return-path: In-Reply-To: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: > > On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel > built for the i686 (with PAE) and x86-64 architectures. What's really troubling me > is that the x86-64 kernel has the following problem: > > When I copy large files to any storage device, be it my HDD with ext4 partitions > or flash drive with FAT32 partitions, the kernel first caches them in memory entirely > then flushes them some time later (quite unpredictably though) or immediately upon > invoking "sync". Yeah, I think we default to a 10% "dirty background memory" (and allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB of dirty memory for writeout before we even start writing, and twice that before we start *waiting* for it. On 32-bit x86, we only count the memory in the low 1GB (really actually up to about 890MB), so "10% dirty" really means just about 90MB of buffering (and a "hard limit" of ~180MB of dirty). And that "up to 3.2GB of dirty memory" is just crazy. Our defaults come from the old days of less memory (and perhaps servers that don't much care), and the fact that x86-32 ends up having much lower limits even if you end up having more memory. You can easily tune it: echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes or similar. But you're right, we need to make the defaults much saner. Wu? Andrew? Comments? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 21:49:52 +1100 Message-ID: <20131025214952.3eb41201@notabene.brown> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/o0p+.MsNjNOi=r6AgUlpcoW"; protocol="application/pgp-signature" Cc: linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: "Artem S. Tashkinov" Return-path: In-Reply-To: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --Sig_/o0p+.MsNjNOi=r6AgUlpcoW Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov" wrote: > Hello! >=20 > On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kern= el > built for the i686 (with PAE) and x86-64 architectures. What's really tro= ubling me > is that the x86-64 kernel has the following problem: >=20 > When I copy large files to any storage device, be it my HDD with ext4 par= titions > or flash drive with FAT32 partitions, the kernel first caches them in mem= ory entirely > then flushes them some time later (quite unpredictably though) or immedia= tely upon > invoking "sync". >=20 > How can I disable this memory cache altogether (or at least minimize cach= ing)? When > running the i686 kernel with the same configuration I don't observe this = effect - files get > written out almost immediately (for instance "sync" takes less than a sec= ond, whereas > on x86-64 it can take a dozen of _minutes_ depending on a file size and s= torage > performance). What exactly is bothering you about this? The amount of memory used or the time until data is flushed? If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to look. This defaults to 30 seconds (3000 centisecs). You could make it smaller (providing you also shrink dirty_writeback_centisecs in a similar ratio) and the VM will flush out data more quickly. NeilBrown >=20 > I'm _not_ talking about disabling write cache on my storage itself (hdpar= m -W 0 /dev/XXX) > - firstly this command is detrimental to the performance of my PC, second= ly, it won't help > in this instance. >=20 > Swap is totally disabled, usually my memory is entirely free. >=20 > My kernel configuration can be fetched here: https://bugzilla.kernel.org/= show_bug.cgi?id=3D63531 >=20 > Please, advise. >=20 > Best regards, >=20 > Artem=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ --Sig_/o0p+.MsNjNOi=r6AgUlpcoW Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUBUmpM0Dnsnt1WYoG5AQKbaA/+I+mILT1c0lnYbOi8ARniGUqmGmgqdPhV ywBk0r8Tg2G8uk2hL+KGidXAockhIUOMWWazStHfSIS0OCz3PiAH9zJmExP6Qnng mHsUJbBcaqPnFquUaX+8+zs84Kv4D6RP7hAYaZpkuEDlvrbEXUwnHqKpdEk+RRFv 9bJqEVFHTApcLJ+BHN12UNPRsTXX5Ry10I7IKPJg4col6yZQVWXOvtID7ZrcJt88 IQcLgc6qDVQc6lkKOkrM/5v6oDQy3Ls+VN+6sVvkDtB0s2ZfJeETFNS9JzCWA9N/ 8m65S9oCXBIwNyApYdIf/uMMv+RgmmsosqaJ+KiQLkb5AtnsWUtubuD/4gWQZzJK f6CGinr/ZtzhbhGMq+ogBJ2cOzqbeFGkJlDyGIbNZBrckFRcD80+z0JofTUbQHcN b7ti4NvZzRYDBdkfSL90HMwlpSg26PExxzMbJryxHYAs85DV9nv/PxK+7nSCBhPI 15zziEoty35885Sd94//ECZIiyZINvhCBH6MEzKPq2o3qwlae0egAZowYcdUlSge LRAO8NqVQASqNRj9NE+wYAeEyi0ZRX3yK01lWoV7mYyGNz46gMUYtqeC5+q50GLC dsaQ4preEQHlRsqf8xkYsfZUGTiUa3fWYKiPSXKKIuh2nA8W7IuGDmgdHPj+m1PI Y2E8MBJave0= =w1E3 -----END PGP SIGNATURE----- --Sig_/o0p+.MsNjNOi=r6AgUlpcoW-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Lang Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 04:26:37 -0700 (PDT) Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: "Artem S. Tashkinov" , linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: NeilBrown Return-path: In-Reply-To: <20131025214952.3eb41201@notabene.brown> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, 25 Oct 2013, NeilBrown wrote: > On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov" > wrote: > >> Hello! >> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel >> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me >> is that the x86-64 kernel has the following problem: >> >> When I copy large files to any storage device, be it my HDD with ext4 partitions >> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely >> then flushes them some time later (quite unpredictably though) or immediately upon >> invoking "sync". >> >> How can I disable this memory cache altogether (or at least minimize caching)? When >> running the i686 kernel with the same configuration I don't observe this effect - files get >> written out almost immediately (for instance "sync" takes less than a second, whereas >> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage >> performance). > > What exactly is bothering you about this? The amount of memory used or the > time until data is flushed? actually, I think the problem is more the impact of the huge write later on. David Lang > If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to > look. > This defaults to 30 seconds (3000 centisecs). > You could make it smaller (providing you also shrink > dirty_writeback_centisecs in a similar ratio) and the VM will flush out data > more quickly. > > NeilBrown > > >> >> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) >> - firstly this command is detrimental to the performance of my PC, secondly, it won't help >> in this instance. >> >> Swap is totally disabled, usually my memory is entirely free. >> >> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 >> >> Please, advise. >> >> Best regards, >> >> Artem >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Artem S. Tashkinov" Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 18:26:23 +0000 (UTC) Message-ID: <154617470.12445.1382725583671.JavaMail.mail@webmail11> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: david@lang.hm Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Oct 25, 2013 05:26:45 PM, david wrote: On Fri, 25 Oct 2013, NeilBrown wrote: > >> >> What exactly is bothering you about this? The amount of memory used or the >> time until data is flushed? > >actually, I think the problem is more the impact of the huge write later on. Exactly. And not being able to use applications which show you IO performance like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine my life without being able to see the progress of a copying operation. With the current dirty cache there's no way to understand how you storage media actually behaves. Hopefully this issue won't dissolve into obscurity and someone will actually make up a plan (and a patch) how to make dirty write cache behave in a sane manner considering the fact that there are devices with very different write speeds and requirements. It'd be ever better, if I could specify dirty cache as a mount option (though sane defaults or semi-automatic values based on runtime estimates won't hurt). Per device dirty cache seems like a nice idea, I, for one, would like to disable it altogether or make it an absolute minimum for things like USB flash drives - because I don't care about multithreaded performance or delayed allocation on such devices - I'm interested in my data reaching my USB stick ASAP - because it's how most people use them. Regards, Artem -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Diego Calleja Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 21:40:13 +0200 Message-ID: <1999200.Zdacx0scmY@diego-arch> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <154617470.12445.1382725583671.JavaMail.mail@webmail11> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Cc: david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: "Artem S. Tashkinov" Return-path: In-Reply-To: <154617470.12445.1382725583671.JavaMail.mail@webmail11> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribi=F3= : > Oct 25, 2013 05:26:45 PM, david wrote: > >actually, I think the problem is more the impact of the huge write l= ater > >on. > Exactly. And not being able to use applications which show you IO > performance like Midnight Commander. You might prefer to use "cp -a" = but I > cannot imagine my life without being able to see the progress of a co= pying > operation. With the current dirty cache there's no way to understand = how > you storage media actually behaves. This is a problem I also have been suffering for a long time. It's not = so much=20 how much and when the systems syncs dirty data, but how unreponsive the= =20 desktop becomes when it happens (usually, with rsync + large files). Mo= st=20 programs become completely unreponsive, specially if they have a large = memory=20 consumption (ie. the browser). I need to pause rsync and wait until the= =20 systems writes out all dirty data if I want to do simple things like sc= rolling=20 or do any action that uses I/O, otherwise I need to wait minutes. I have 16 GB of RAM and excluding the browser (which usually uses about= half=20 of a GB) and KDE itself, there are no memory hogs, so it seem like it's= =20 something that shouldn't happen. I can understand that I/O operations a= re=20 laggy when there is some other intensive I/O ongoing, but right now the= system=20 becomes completely unreponsive. If I am unlucky and Konsole also become= s=20 unreponsive, I need to switch to a VT (which also takes time). I haven't reported it before in part because I didn't know how to do it= , "my=20 browser stalls" is not a very useful description and I didn't know what= kind=20 of data I'm supposed to report. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Sat, 26 Oct 2013 07:43:49 +1100 Message-ID: <20131026074349.0adc9646@notabene.brown> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/Ruxvo/UHlXxrQJ/hv4uQd02"; protocol="application/pgp-signature" Cc: david@lang.hm, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: "Artem S. Tashkinov" Return-path: In-Reply-To: <154617470.12445.1382725583671.JavaMail.mail@webmail11> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --Sig_/Ruxvo/UHlXxrQJ/hv4uQd02 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov" wrote: > Oct 25, 2013 05:26:45 PM, david wrote: > On Fri, 25 Oct 2013, NeilBrown wrote: > > > >> > >> What exactly is bothering you about this? The amount of memory used o= r the > >> time until data is flushed? > > > >actually, I think the problem is more the impact of the huge write later= on. >=20 > Exactly. And not being able to use applications which show you IO perform= ance > like Midnight Commander. You might prefer to use "cp -a" but I cannot ima= gine > my life without being able to see the progress of a copying operation. Wi= th the current > dirty cache there's no way to understand how you storage media actually b= ehaves. So fix Midnight Commander. If you want the copy to be actually finished wh= en it says it is finished, then it needs to call 'fsync()' at the end. >=20 > Hopefully this issue won't dissolve into obscurity and someone will actua= lly make > up a plan (and a patch) how to make dirty write cache behave in a sane ma= nner > considering the fact that there are devices with very different write spe= eds and > requirements. It'd be ever better, if I could specify dirty cache as a mo= unt option > (though sane defaults or semi-automatic values based on runtime estimates > won't hurt). >=20 > Per device dirty cache seems like a nice idea, I, for one, would like to = disable it > altogether or make it an absolute minimum for things like USB flash drive= s - because > I don't care about multithreaded performance or delayed allocation on suc= h devices - > I'm interested in my data reaching my USB stick ASAP - because it's how m= ost people > use them. > As has already been said, you can substantially disable the cache by tuning down various values in /proc/sys/vm/. Have you tried? NeilBrown --Sig_/Ruxvo/UHlXxrQJ/hv4uQd02 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUBUmrYBTnsnt1WYoG5AQLuSA/7BMZQHQ0T9U0PVyggS9AeboIxLXlnMjIp IXNLZLqvfuSpAwTeSvrc58DCxQsoggkQYIbSrHy3j1mwFAhHFq2z/Q1JvdkLfuA7 FhXJYI3F05L09+/KxOoStIBkD2MqBYlZYSbWu2UZOzzZIlKOrtb8wTXVt7IrU2oq +KzvuttIaF1/3QEQL3SocPhUuJGS9Ym1yxlnLaiDPNEgoa61tg5VOAFyJQP+dybT 3UDvSunL3vFZhrg8oDqcauiQl7DO+hnLw0jew93DBun1svFPaOtjSNc/vWnoXST7 PnYsMsHC/NBQGGdNe6BG4paShoUNR6Z7rXxrQf/HLmcMAiy+7On1/HIe2Qcfju3k T5hoIqSLvG9bHXQxOR8XnMG3P8rNzQ9I9R/5sHFGZJeNuFjBpxk3CxSzTtbjoGPN P+PFyXs/n9L5QvjEKsKFk+PT8DYYiY0U9+rklP7verpqOa3mVgvsVQuVLlEyL51T BXBOrRXJedOLUzUE6fxNS/QeZ6CF/dner1qlf/G6aEEJLmqs//qVS1IxnB6UiZKJ NNjXaRY64idodWP8pOSG41WFP2WSvFXymJ+s6qF6gaJEtiQNeHeukF38h2X2qn7A EsyG/NXH6XOt3vP+nQkhrNAe4iZqKIOV29FANIJy11nUHEB0nsE3qH9GYcHl5ZD4 AxPLr+BkgaE= =ZScs -----END PGP SIGNATURE----- --Sig_/Ruxvo/UHlXxrQJ/hv4uQd02-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Artem S. Tashkinov" Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 21:03:44 +0000 (UTC) Message-ID: <476525596.14731.1382735024280.JavaMail.mail@webmail11> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11><20131026074349.0adc9646@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: david@lang.hm, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: neilb@suse.de Return-path: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Oct 26, 2013 02:44:07 AM, neil wrote: On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov" >> >> Exactly. And not being able to use applications which show you IO performance >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine >> my life without being able to see the progress of a copying operation. With the current >> dirty cache there's no way to understand how you storage media actually behaves. > >So fix Midnight Commander. If you want the copy to be actually finished when >it says it is finished, then it needs to call 'fsync()' at the end. This sounds like a very bad joke. How applications are supposed to show and calculate an _average_ write speed if there are no kernel calls/ioctls to actually make the kernel flush dirty buffers _during_ copying? Actually it's a good way to solve this problem in user space - alas, even if such calls are implemented, user space will start using them only in 2018 if not further from that. >> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it >> altogether or make it an absolute minimum for things like USB flash drives - because >> I don't care about multithreaded performance or delayed allocation on such devices - >> I'm interested in my data reaching my USB stick ASAP - because it's how most people >> use them. >> > >As has already been said, you can substantially disable the cache by tuning >down various values in /proc/sys/vm/. >Have you tried? I don't understand who you are replying to. I asked about per device settings, you are again referring me to system wide settings - they don't look that good if we're talking about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense to allocate 20% of physical RAM for things which don't belong to it in the first place. I don't know any other OS which has a similar behaviour. And like people (including me) have already mentioned, such a huge dirty cache can stall their PCs/servers for a considerable amount of time. Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also not everyone in this world has an UPS - which means such a huge buffer can lead to a serious data loss in case of a power blackout. Regards, Artem -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Sat, 26 Oct 2013 09:11:12 +1100 Message-ID: <20131026091112.241da260@notabene.brown> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <20131026074349.0adc9646@notabene.brown> <476525596.14731.1382735024280.JavaMail.mail@webmail11> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/rrWLFApImeDIgsMleg656mr"; protocol="application/pgp-signature" Cc: david@lang.hm, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: "Artem S. Tashkinov" Return-path: In-Reply-To: <476525596.14731.1382735024280.JavaMail.mail@webmail11> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --Sig_/rrWLFApImeDIgsMleg656mr Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 25 Oct 2013 21:03:44 +0000 (UTC) "Artem S. Tashkinov" wrote: > Oct 26, 2013 02:44:07 AM, neil wrote: > On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov" > >>=20 > >> Exactly. And not being able to use applications which show you IO perf= ormance > >> like Midnight Commander. You might prefer to use "cp -a" but I cannot = imagine > >> my life without being able to see the progress of a copying operation.= With the current > >> dirty cache there's no way to understand how you storage media actuall= y behaves. > > > >So fix Midnight Commander. If you want the copy to be actually finished= when > >it says it is finished, then it needs to call 'fsync()' at the end. >=20 > This sounds like a very bad joke. How applications are supposed to show a= nd > calculate an _average_ write speed if there are no kernel calls/ioctls to= actually > make the kernel flush dirty buffers _during_ copying? Actually it's a goo= d way to > solve this problem in user space - alas, even if such calls are implement= ed, user > space will start using them only in 2018 if not further from that. But there is a way to flush dirty buffers *during* copies. =20 man 2 sync_file_range if giving precise feedback is is paramount importance to you, then this wou= ld be the interface to use. >=20 > >>=20 > >> Per device dirty cache seems like a nice idea, I, for one, would like = to disable it > >> altogether or make it an absolute minimum for things like USB flash dr= ives - because > >> I don't care about multithreaded performance or delayed allocation on = such devices - > >> I'm interested in my data reaching my USB stick ASAP - because it's ho= w most people > >> use them. > >> > > > >As has already been said, you can substantially disable the cache by tu= ning > >down various values in /proc/sys/vm/. > >Have you tried? >=20 > I don't understand who you are replying to. I asked about per device sett= ings, you are > again referring me to system wide settings - they don't look that good if= we're talking > about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no = sense > to allocate 20% of physical RAM for things which don't belong to it in th= e first place. Sorry, missed the per-device bit. You could try playing with /sys/class/bdi/XX:YY/max_ratio where XX:YY is the major/minor number of the device, so 8:0 for /dev/sda. Wind it right down for slow devices and you might get something like what y= ou want. >=20 > I don't know any other OS which has a similar behaviour. I don't know about the internal details of any other OS, so I cannot really comment. >=20 > And like people (including me) have already mentioned, such a huge dirty = cache can > stall their PCs/servers for a considerable amount of time. Yes. But this is a different issue. There are two very different issues that should be kept separate. One is that when "cp" or similar complete, the data hasn't all be written o= ut yet. It typically takes another 30 seconds before the flush will complete. You seemed to primarily complain about this, so that is what I originally address. That is where in the "dirty_*_centisecs" values apply. The other, quite separate, issue is that Linux will cache more dirty data than it can write out in a reasonable time. All the tuning parameters refer to the amount of data (whether as a percentage of RAM or as a number of bytes), but what people really care about is a number of seconds. As you might imagine, estimating how long it will take to write out a certa= in amount of data is highly non-trivial. The relationship between megabytes a= nd seconds can be non-linear and can change over time. Caching nothing at all can hurt a lot of workloads. Caching too much can obviously hurt too. Caching "5 seconds" worth of data would be ideal, but would be incredibly difficult to implement. It is possible that keeping a sliding estimate of device throughput for each device would be possible, and using that to automatically adjust the "max_ratio" value (or some related internal thing) might be a 70% solution. Certainly it would be an interesting project for someone. >=20 > Of course, if you don't use Linux on the desktop you don't really care - = well, I do. Also > not everyone in this world has an UPS - which means such a huge buffer ca= n lead to a > serious data loss in case of a power blackout. I don't have a desk (just a lap), but I use Linux on all my computers and I've never really noticed the problem. Maybe I'm just very patient, or may= be I don't work with large data sets and slow devices. However I don't think data-loss is really a related issue. Any process that cares about data safety *must* use fsync at appropriate places. This has always been true. NeilBrown >=20 > Regards, >=20 > Artem --Sig_/rrWLFApImeDIgsMleg656mr Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUBUmrsgDnsnt1WYoG5AQJ4BRAAu4vKVv8ecehnzp1wUp6/oN1n1Bqlae4H oL9uZdxbmcfrkoq3n/IKkpVqc/Rt9ps0Zcx9LLHcheGmSghQwSOE7fxzUUHtkaXA dZlFshh2kbR0qrwa4/ogrmYLbhi6JrT6vQKFDbn6sp4UdeHhauBUHHKhpaxypEHL HoSVSsnG9OOWB3H0i8NLe9z19jTdSOKT6SOiZf0M8+OonR/M7oJVuaH0k1Gclcw0 U4wzrPjaGaAAHB0b6VL8v64OZnasgz9G8MfRGZ5Ff+Ui5UZ2W2u33mx+IvCs/wnu MDq55S0pRI6t8dl79FgdYhcxySUY7etynbe2rUBOlLe5fo4LUQjG80wLwODB0N9q DPb0sVH6NxmB6NSLSOaTZpXaNQlIG0nAxDgo1rt7uCknpScSlHmz3p4DqeGp892S MNP3cxOQFSYT7Y8/DY1ChnwJ/U099NdVWWnGfRco0qSlCZ3R/+Mf3ejere0bl/PL QCZSvneQVS6eejyd8G23Ka2WxaTkG6/NzpQlE9QkVQ2uf/I+LYgQPeYoAK5Jdlna k8O6QWVlOVsQsCHcMJAhqRJBQKde0g7T4SQjs1aR59cfd/kRY4ts0U0klIWsGwmc OCJd8HKLrdMPF2Ufl006QiJ0oTp7a/O9cyWj4jJQoaqZBGISP4kEpH5MGUBSz3lX FOsv7ZmhC8s= =Hz1y -----END PGP SIGNATURE----- --Sig_/rrWLFApImeDIgsMleg656mr-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Fengguang Wu Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Sat, 26 Oct 2013 00:32:25 +0100 Message-ID: <20131025233225.GA32051@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <1999200.Zdacx0scmY@diego-arch> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: "Artem S. Tashkinov" , david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: Diego Calleja Return-path: Content-Disposition: inline In-Reply-To: <1999200.Zdacx0scmY@diego-arch> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote: > El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribi=C3= =B3: > > Oct 25, 2013 05:26:45 PM, david wrote: > > >actually, I think the problem is more the impact of the huge write l= ater > > >on. > > Exactly. And not being able to use applications which show you IO > > performance like Midnight Commander. You might prefer to use "cp -a" = but I > > cannot imagine my life without being able to see the progress of a co= pying > > operation. With the current dirty cache there's no way to understand = how > > you storage media actually behaves. >=20 >=20 > This is a problem I also have been suffering for a long time. It's not = so much=20 > how much and when the systems syncs dirty data, but how unreponsive the= =20 > desktop becomes when it happens (usually, with rsync + large files). Mo= st=20 > programs become completely unreponsive, specially if they have a large = memory=20 > consumption (ie. the browser). I need to pause rsync and wait until the= =20 > systems writes out all dirty data if I want to do simple things like sc= rolling=20 > or do any action that uses I/O, otherwise I need to wait minutes. That's a problem. And it's kind of independent of the dirty threshold -- if you are doing large file copies in the background, it will lead to continuous disk writes and stalls anyway -- the large dirty threshold merely delays the write IO time. > I have 16 GB of RAM and excluding the browser (which usually uses about= half=20 > of a GB) and KDE itself, there are no memory hogs, so it seem like it's= =20 > something that shouldn't happen. I can understand that I/O operations a= re=20 > laggy when there is some other intensive I/O ongoing, but right now the= system=20 > becomes completely unreponsive. If I am unlucky and Konsole also become= s=20 > unreponsive, I need to switch to a VT (which also takes time). >=20 > I haven't reported it before in part because I didn't know how to do it= , "my=20 > browser stalls" is not a very useful description and I didn't know what= kind=20 > of data I'm supposed to report. What's the kernel you are running? And it's writing to a hard disk? The stalls are most likely caused by either one of 1) write IO starves read IO 2) direct page reclaim blocked when - trying to writeout PG_dirty pages - trying to lock PG_writeback pages Which may be confirmed by running ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 or echo w > /proc/sysrq-trigger # and check dmesg during the stalls. The latter command works more reliably. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Tue, 29 Oct 2013 21:49:37 +0100 Message-ID: <20131029204937.GG9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: "Artem S. Tashkinov" Return-path: Content-Disposition: inline In-Reply-To: <154617470.12445.1382725583671.JavaMail.mail@webmail11> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri 25-10-13 18:26:23, Artem S. Tashkinov wrote: > Oct 25, 2013 05:26:45 PM, david wrote: > On Fri, 25 Oct 2013, NeilBrown wrote: > > > >> > >> What exactly is bothering you about this? The amount of memory used or the > >> time until data is flushed? > > > >actually, I think the problem is more the impact of the huge write later on. > > Exactly. And not being able to use applications which show you IO > performance like Midnight Commander. You might prefer to use "cp -a" but > I cannot imagine my life without being able to see the progress of a > copying operation. With the current dirty cache there's no way to > understand how you storage media actually behaves. Large writes shouldn't stall your desktop, that's certain and we must fix that. I don't find the problem with copy progress indicators that pressing... > Hopefully this issue won't dissolve into obscurity and someone will > actually make up a plan (and a patch) how to make dirty write cache > behave in a sane manner considering the fact that there are devices with > very different write speeds and requirements. It'd be ever better, if I > could specify dirty cache as a mount option (though sane defaults or > semi-automatic values based on runtime estimates won't hurt). > > Per device dirty cache seems like a nice idea, I, for one, would like to > disable it altogether or make it an absolute minimum for things like USB > flash drives - because I don't care about multithreaded performance or > delayed allocation on such devices - I'm interested in my data reaching > my USB stick ASAP - because it's how most people use them. See my other emails in this thread. There are ways to tune the amount of dirty data allowed per device. Currently the result isn't very satisfactory but we should have something usable after the next merge window. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Dilger Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Mon, 4 Nov 2013 17:50:13 -0700 Message-ID: <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\)) Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Cc: Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: "Artem S. Tashkinov" Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Oct 25, 2013, at 2:18 AM, Linus Torvalds = wrote: > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov = wrote: >>=20 >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 >> kernel built for the i686 (with PAE) and x86-64 architectures. What=92s= >> really troubling me is that the x86-64 kernel has the following = problem: >>=20 >> When I copy large files to any storage device, be it my HDD with ext4 >> partitions or flash drive with FAT32 partitions, the kernel first >> caches them in memory entirely then flushes them some time later >> (quite unpredictably though) or immediately upon invoking "sync". >=20 > Yeah, I think we default to a 10% "dirty background memory" (and > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB > of dirty memory for writeout before we even start writing, and twice > that before we start *waiting* for it. >=20 > On 32-bit x86, we only count the memory in the low 1GB (really > actually up to about 890MB), so "10% dirty" really means just about > 90MB of buffering (and a "hard limit" of ~180MB of dirty). >=20 > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults > come from the old days of less memory (and perhaps servers that don't > much care), and the fact that x86-32 ends up having much lower limits > even if you end up having more memory. I think the =93delay writes for a long time=94 is a holdover from the days when e.g. /tmp was on a disk and compilers had lousy IO patterns, then they deleted the file. Today, /tmp is always in RAM, and IMHO the =93write and delete=94 workload tested by dbench is not worthwhile optimizing for. With Lustre, we=92ve long taken the approach that if there is enough dirty data on a file to make a decent write (which is around 8MB today even for very fast storage) then there isn=92t much point to hold back for more data before starting the IO. Any decent allocator will be able to grow allocated extents to handle following data, or allocate a new extent. At 4-8MB extents, even very seek-impaired media could do 400-800MB/s (likely much faster than the underlying storage anyway). This also avoids wasting (tens of?) seconds of idle disk bandwidth. If the disk is already busy, then the IO will be delayed anyway. If it is not busy, then why aggregate GB of dirty data in memory before flushing it? Something simple like =93start writing at 16MB dirty on a single file=94 would probably avoid a lot of complexity at little real-world cost. That shouldn=92t throttle dirtying memory above 16MB, but just start writeout much earlier than it does today. Cheers, Andreas -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Figo.zhang" Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Tue, 5 Nov 2013 09:40:55 +0800 Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <20131026074349.0adc9646@notabene.brown> <476525596.14731.1382735024280.JavaMail.mail@webmail11> <20131026091112.241da260@notabene.brown> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=047d7b5d98e9a21ba204ea641f13 Cc: "Artem S. Tashkinov" , david@lang.hm, lkml , Linus Torvalds , linux-fsdevel@vger.kernel.org, axboe@kernel.dk, Linux-MM To: NeilBrown Return-path: In-Reply-To: <20131026091112.241da260@notabene.brown> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --047d7b5d98e9a21ba204ea641f13 Content-Type: text/plain; charset=ISO-8859-1 > > > > Of course, if you don't use Linux on the desktop you don't really care - > well, I do. Also > > not everyone in this world has an UPS - which means such a huge buffer > can lead to a > > serious data loss in case of a power blackout. > > I don't have a desk (just a lap), but I use Linux on all my computers and > I've never really noticed the problem. Maybe I'm just very patient, or > maybe > I don't work with large data sets and slow devices. > > However I don't think data-loss is really a related issue. Any process > that > cares about data safety *must* use fsync at appropriate places. This has > always been true. > > =>May i ask question that, some like ext4 filesystem, if some app motify the files, it create some dirty data. if some meta-data writing to the journal disk when a power backout, it will be lose some serious data and the the file will damage? --047d7b5d98e9a21ba204ea641f13 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

=
>
> Of course, if you don't use Linux on the desktop you don't rea= lly care - well, I do. Also
> not everyone in this world has an UPS - which means such a huge buffer= can lead to a
> serious data loss in case of a power blackout.

I don't have a desk (just a lap), but I use Linux on all my compu= ters and
I've never really noticed the problem. =A0Maybe I'm just very patie= nt, or maybe
I don't work with large data sets and slow devices.

However I don't think data-loss is really a related issue. =A0Any proce= ss that
cares about data safety *must* use fsync at appropriate places. =A0This has=
always been true.

=3D>May i ask question that, = some like ext4 filesystem, if some app motify the files, it create some dir= ty data. if some meta-data writing to the journal disk when a power backout= ,=A0
it will be lose some serious data and the the file will damage?
<= /div>
--047d7b5d98e9a21ba204ea641f13-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Lang Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Mon, 4 Nov 2013 17:47:34 -0800 (PST) Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <20131026074349.0adc9646@notabene.brown> <476525596.14731.1382735024280.JavaMail.mail@webmail11> <20131026091112.241da260@notabene.brown> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII Cc: NeilBrown , "Artem S. Tashkinov" , lkml , Linus Torvalds , linux-fsdevel@vger.kernel.org, axboe@kernel.dk, Linux-MM To: "Figo.zhang" Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Tue, 5 Nov 2013, Figo.zhang wrote: >>> >>> Of course, if you don't use Linux on the desktop you don't really care - >> well, I do. Also >>> not everyone in this world has an UPS - which means such a huge buffer >> can lead to a >>> serious data loss in case of a power blackout. >> >> I don't have a desk (just a lap), but I use Linux on all my computers and >> I've never really noticed the problem. Maybe I'm just very patient, or >> maybe >> I don't work with large data sets and slow devices. >> >> However I don't think data-loss is really a related issue. Any process >> that >> cares about data safety *must* use fsync at appropriate places. This has >> always been true. >> >> =>May i ask question that, some like ext4 filesystem, if some app motify > the files, it create some dirty data. if some meta-data writing to the > journal disk when a power backout, > it will be lose some serious data and the the file will damage? > with any filesystem and any OS, if you create dirty data but do not f*sync() the data, there isa possibility that the system can go down between the time the application creates the dirty data and the time the OS actually gets it on disk. If the system goes down in this timeframe, the data will be lost and it may corrupt the file if only some of the data got written. David Lang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Tue, 5 Nov 2013 13:08:14 +1100 Message-ID: <20131105130814.7127298d@notabene.brown> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <20131026074349.0adc9646@notabene.brown> <476525596.14731.1382735024280.JavaMail.mail@webmail11> <20131026091112.241da260@notabene.brown> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/gVWr30a.5_RNZoFmDoCa/kl"; protocol="application/pgp-signature" Cc: "Artem S. Tashkinov" , david@lang.hm, lkml , Linus Torvalds , linux-fsdevel@vger.kernel.org, axboe@kernel.dk, Linux-MM To: "Figo.zhang" Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org --Sig_/gVWr30a.5_RNZoFmDoCa/kl Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Tue, 5 Nov 2013 09:40:55 +0800 "Figo.zhang" wrote: > > > > > > Of course, if you don't use Linux on the desktop you don't really car= e - > > well, I do. Also > > > not everyone in this world has an UPS - which means such a huge buffer > > can lead to a > > > serious data loss in case of a power blackout. > > > > I don't have a desk (just a lap), but I use Linux on all my computers a= nd > > I've never really noticed the problem. Maybe I'm just very patient, or > > maybe > > I don't work with large data sets and slow devices. > > > > However I don't think data-loss is really a related issue. Any process > > that > > cares about data safety *must* use fsync at appropriate places. This h= as > > always been true. > > > > =3D>May i ask question that, some like ext4 filesystem, if some app mot= ify > the files, it create some dirty data. if some meta-data writing to the > journal disk when a power backout, > it will be lose some serious data and the the file will damage? If you modify a file, then you must take care that you can recover from a crash at any point in the process. If the file is small, the usual approach is to create a copy of the file wi= th the appropriate changes made, then 'fsync' the file and rename the new file over the old file. If the file is large you might need some sort of update log (in a small fil= e) so you can replay recent updates after a crash. The journalling that the filesystem provides only protects the filesystem metadata. It does not protect the consistency of the data in your file. I hope that helps. NeilBrown --Sig_/gVWr30a.5_RNZoFmDoCa/kl Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUBUnhTDjnsnt1WYoG5AQL8UhAAsymCSPXOccBun3EqBLddwOqwUyyiZq2l WxRF5e8qbj1r48FPPDVzVPxDwu91n0Er+QC1D/tC2NUxwC4rx0LNqigGAcI2l3Ic JUeNfZfaO3Gm1KcNqqdk25qOa+7mJoMakkIuQ6GQX5DtefeMiUEW6svTXsKt0nGW 3qOudkFCf3hyux/NQBNKvlsk4ljbfKyaVrOCIoxmT4js/BzxHOlkB7Vj7cnRM/Q0 DasihAIzIWKFTCqQCKhB0xMwD53XjurYGKIdMfPhmjUYOh4c42wF/Hy2h9vFm9Px 6jK+LS/XCxHt/+EiAj4LEBEeyCbfKCgOabV+qsgH+qP8yR89I/k5iGTaq4+I2rib lko5VSqUdnGvUt/GbubbCAf5DvH/dcZM1sddT+/iqI9XyA9+vvVTFOHJUW1E2ZSX jYpuZiTabCcSNZQeBFrwMzxtjj0m102mLW1jbyesIGtBbR8ozDqxplZqeyMKblMH 2yLTkv7hjANpayAiBHWB1bHHrH2GjxAf/iYToeBqB4gt45+FQIjwkcUzdUFMU/bP iPnvtflafvHGaQWI99rkrN5Kaoi9UcPlKxUd+xA9EJOpFgyZGAwFvge8QzlxH5up Pxtk2RCYVWaTznzRT40qVe/2CBzwPbH2XyAWcTdQBGLFj7TzqKEp38bnakG3mB9e nq8sQ97WdMs= =PdUs -----END PGP SIGNATURE----- --Sig_/gVWr30a.5_RNZoFmDoCa/kl-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Tue, 5 Nov 2013 15:12:45 +1100 Message-ID: <20131105041245.GY6188@dastard> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: Andreas Dilger Return-path: Content-Disposition: inline In-Reply-To: <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: >=20 > On Oct 25, 2013, at 2:18 AM, Linus Torvalds wrote: > > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: > >>=20 > >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 > >> kernel built for the i686 (with PAE) and x86-64 architectures. What=E2= =80=99s > >> really troubling me is that the x86-64 kernel has the following prob= lem: > >>=20 > >> When I copy large files to any storage device, be it my HDD with ext= 4 > >> partitions or flash drive with FAT32 partitions, the kernel first > >> caches them in memory entirely then flushes them some time later > >> (quite unpredictably though) or immediately upon invoking "sync". > >=20 > > Yeah, I think we default to a 10% "dirty background memory" (and > > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6G= B > > of dirty memory for writeout before we even start writing, and twice > > that before we start *waiting* for it. > >=20 > > On 32-bit x86, we only count the memory in the low 1GB (really > > actually up to about 890MB), so "10% dirty" really means just about > > 90MB of buffering (and a "hard limit" of ~180MB of dirty). > >=20 > > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults > > come from the old days of less memory (and perhaps servers that don't > > much care), and the fact that x86-32 ends up having much lower limits > > even if you end up having more memory. >=20 > I think the =E2=80=9Cdelay writes for a long time=E2=80=9D is a holdove= r from the > days when e.g. /tmp was on a disk and compilers had lousy IO > patterns, then they deleted the file. Today, /tmp is always in > RAM, and IMHO the =E2=80=9Cwrite and delete=E2=80=9D workload tested by= dbench > is not worthwhile optimizing for. >=20 > With Lustre, we=E2=80=99ve long taken the approach that if there is eno= ugh > dirty data on a file to make a decent write (which is around 8MB > today even for very fast storage) then there isn=E2=80=99t much point t= o > hold back for more data before starting the IO. Agreed - write-through caching is much better for high throughput streaming data environments than write back caching that can leave the devices unnecessarily idle. However, most systems are not running in high-throughput streaming data environments... :/ > Any decent allocator will be able to grow allocated extents to > handle following data, or allocate a new extent. At 4-8MB extents, > even very seek-impaired media could do 400-800MB/s (likely much > faster than the underlying storage anyway). True, but this makes the assumption that the filesystem you are using is optimising purely for write throughput and your storage is not seek limited on reads. That's simply not an assumption we can allow the generic writeback code to make. In more detail, if we simply implement "we have 8 MB of dirty pages on a single file, write it" we can maximise write throughput by allocating sequentially on disk for each subsquent write. The problem with this comes when you are writing multiple files at a time, and that leads to this pattern on disk: ABC...ABC....ABC....ABC.... And the result is a) fragmented files b) a large number of seeks during sequential read operations and c) filesystems that age and degrade rapidly under workloads that concurrently write files with different life times (i.e. due to free space fragmention). In some situations this is acceptable, but the performance degradation as the filesystem ages that this sort of allocation causes in most environments is not. I'd say that >90% of filesystems out there would suffer accelerated aging as a result of doing writeback in this manner by default. > This also avoids wasting (tens of?) seconds of idle disk bandwidth. > If the disk is already busy, then the IO will be delayed anyway. > If it is not busy, then why aggregate GB of dirty data in memory > before flushing it? There are plenty of workloads out there where delaying IO for a few seconds can result in writeback that is an order of magnitude faster. Similarly, I've seen other workloads where the writeback delay results in files that can be *read* orders of magnitude faster.... > Something simple like =E2=80=9Cstart writing at 16MB dirty on a single = file=E2=80=9D > would probably avoid a lot of complexity at little real-world cost. > That shouldn=E2=80=99t throttle dirtying memory above 16MB, but just st= art > writeout much earlier than it does today. That doesn't solve the "slow device, large file" problem. We can write data into the page cache at rates of over a GB/s, so it's irrelevant to a device that can write at 5MB/s whether we start writeback immediately or a second later when there is 500MB of dirty pages in memory. AFAIK, the only way to avoid that problem is to use write-through caching for such devices - where they throttle to the IO rate at very low levels of cached data. Realistically, there is no "one right answer" for all combinations of applications, filesystems and hardware, but writeback caching is the best *general solution* we've got right now. However, IMO users should not need to care about tuning BDI dirty ratios or even have to understand what a BDI dirty ratio is to select the rigth caching method for their devices and/or workload. The difference between writeback and write through caching is easy to explain and AFAICT those two modes suffice to solve the problems being discussed here. Further, if two modes suffice to solve the problems, then we should be able to easily define a trigger to automatically switch modes. /me notes that if we look at random vs sequential IO and the impact that has on writeback duration, then it's very similar to suddenly having a very slow device. IOWs, fadvise(RANDOM) could be used to switch an *inode* to write through mode rather than writeback mode to solve the problem aggregating massive amounts of random write IO in the page cache... So rather than treating this as a "one size fits all" type of problem, let's step back and: a) define 2-3 different caching behaviours we consider optimal for the majority of workloads/hardware we care about. b) determine optimal workloads for each caching behaviour. c) develop reliable triggers to detect when we should switch between caching behaviours. e.g: a) write back caching - what we have now write through caching - extremely low dirty threshold before writeback starts, enough to optimise for, say, stripe width of the underlying storage. b) write back caching: - general purpose workload write through caching: - slow device, write large file, sync - extremely high bandwidth devices, multi-stream sequential IO - random IO. c) write back caching: - default - fadvise(NORMAL, SEQUENTIAL, WILLNEED) write through caching: - fadvise(NOREUSE, DONTNEED, RANDOM) - random IO - sequential IO, BDI write bandwidth <<< dirty threshold - sequential IO, BDI write bandwidth >>> dirty threshold I think that covers most of the issues and use cases that have been discussed in this thread. IMO, this is the level at which we need to solve the problem (i.e. architectural), not at the level of "let's add sysfs variables so we can tweak bdi ratios". Indeed, the above implies that we need the caching behaviour to be a property of the address space, not just a property of the backing device. IOWs, the implementation needs to trickle down from a coherent high level design - that will define the knobs that we need to expose to userspace. We should not be adding new writeback behaviours by adding knobs to sysfs without first having some clue about whether we are solving the right problem and solving it in a sane manner... Cheers, Dave. --=20 Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Figo.zhang" Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Mon, 4 Nov 2013 22:32:37 -0800 Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=001a11c25700dc144704ea68326d Cc: "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: Linus Torvalds Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org --001a11c25700dc144704ea68326d Content-Type: text/plain; charset=ISO-8859-1 > Yeah, I think we default to a 10% "dirty background memory" (and > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB > of dirty memory for writeout before we even start writing, and twice > that before we start *waiting* for it. > > On 32-bit x86, we only count the memory in the low 1GB (really > actually up to about 890MB), so "10% dirty" really means just about > 90MB of buffering (and a "hard limit" of ~180MB of dirty). > => On 32-bit system, the page cache also can use the high memory, so the size of 10% "dirty background memory" maybe 1.6GB for this case. > > --001a11c25700dc144704ea68326d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable




Yeah, I think we default to a 10% "dirty background memory"= (and
allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
of dirty memory for writeout before we even start writing, and twice
that before we start *waiting* for it.

On 32-bit x86, we only count the memory in the low 1GB (really
actually up to about 890MB), so "10% dirty" really means just abo= ut
90MB of buffering (and a "hard limit" of ~180MB of dirty).
=3D> On 32-bit system, the page cache also can use the hi= gh memory, so =A0the size of 10% "dirty background memory" maybe = 1.6GB for this case.

--001a11c25700dc144704ea68326d-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Thu, 7 Nov 2013 14:48:06 +0100 Message-ID: <20131107134806.GB30832@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: Dave Chinner Return-path: Received: from cantor2.suse.de ([195.135.220.15]:53255 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753386Ab3KGNsK (ORCPT ); Thu, 7 Nov 2013 08:48:10 -0500 Content-Disposition: inline In-Reply-To: <20131105041245.GY6188@dastard> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue 05-11-13 15:12:45, Dave Chinner wrote: > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > Something simple like =E2=80=9Cstart writing at 16MB dirty on a sin= gle file=E2=80=9D > > would probably avoid a lot of complexity at little real-world cost. > > That shouldn=E2=80=99t throttle dirtying memory above 16MB, but jus= t start > > writeout much earlier than it does today. >=20 > That doesn't solve the "slow device, large file" problem. We can > write data into the page cache at rates of over a GB/s, so it's > irrelevant to a device that can write at 5MB/s whether we start > writeback immediately or a second later when there is 500MB of dirty > pages in memory. AFAIK, the only way to avoid that problem is to > use write-through caching for such devices - where they throttle to > the IO rate at very low levels of cached data. Agreed. > Realistically, there is no "one right answer" for all combinations > of applications, filesystems and hardware, but writeback caching is > the best *general solution* we've got right now. >=20 > However, IMO users should not need to care about tuning BDI dirty > ratios or even have to understand what a BDI dirty ratio is to > select the rigth caching method for their devices and/or workload. > The difference between writeback and write through caching is easy > to explain and AFAICT those two modes suffice to solve the problems > being discussed here. Further, if two modes suffice to solve the > problems, then we should be able to easily define a trigger to > automatically switch modes. >=20 > /me notes that if we look at random vs sequential IO and the impact > that has on writeback duration, then it's very similar to suddenly > having a very slow device. IOWs, fadvise(RANDOM) could be used to > switch an *inode* to write through mode rather than writeback mode > to solve the problem aggregating massive amounts of random write IO > in the page cache... I disagree here. Writeback cache is also useful for aggregating rando= m writes and making semi-sequential writes out of them. There are quite s= ome applications which rely on the fact that they can write a file in a rat= her random manner (Berkeley DB, linker, ...) but the files are written out = in one large linear sweep. That is actually the reason why SLES (and I bel= ieve RHEL as well) tune dirty_limit even higher than what's the default valu= e. So I think it's rather the other way around: If you can detect the file= is being written in a streaming manner, there's not much point in caching = too much data for it. And I agree with you that we also have to be careful = not to cache too few because otherwise two streaming writes would be interleaved too much. Currently, we have writeback_chunk_size() which determines how much we ask to write from a single inode. So streaming writers are going to be interleaved at this chunk size anyway (currentl= y that number is "measured bandwidth / 2"). So it would make sense to als= o limit amount of dirty cache for each file with streaming pattern at thi= s number. > So rather than treating this as a "one size fits all" type of > problem, let's step back and: >=20 > a) define 2-3 different caching behaviours we consider > optimal for the majority of workloads/hardware we care > about. > b) determine optimal workloads for each caching > behaviour. > c) develop reliable triggers to detect when we > should switch between caching behaviours. >=20 > e.g: >=20 > a) write back caching > - what we have now > write through caching > - extremely low dirty threshold before writeback > starts, enough to optimise for, say, stripe width > of the underlying storage. >=20 > b) write back caching: > - general purpose workload > write through caching: > - slow device, write large file, sync > - extremely high bandwidth devices, multi-stream > sequential IO > - random IO. >=20 > c) write back caching: > - default > - fadvise(NORMAL, SEQUENTIAL, WILLNEED) > write through caching: > - fadvise(NOREUSE, DONTNEED, RANDOM) > - random IO > - sequential IO, BDI write bandwidth <<< dirty threshold > - sequential IO, BDI write bandwidth >>> dirty threshold >=20 > I think that covers most of the issues and use cases that have been > discussed in this thread. IMO, this is the level at which we need to > solve the problem (i.e. architectural), not at the level of "let's > add sysfs variables so we can tweak bdi ratios". >=20 > Indeed, the above implies that we need the caching behaviour to be a > property of the address space, not just a property of the backing > device. Yes, and that would be interesting to implement and not make a mess o= ut of the whole writeback logic because the way we currently do writeback = is inherently BDI based. When we introduce some special per-inode limits, flusher threads would have to pick more carefully what to write and wha= t not. We might be forced to go that way eventually anyway because of mem= cg aware writeback but it's not a simple step. > IOWs, the implementation needs to trickle down from a coherent high > level design - that will define the knobs that we need to expose to > userspace. We should not be adding new writeback behaviours by > adding knobs to sysfs without first having some clue about whether > we are solving the right problem and solving it in a sane manner... Agreed. But the ability to limit amount of dirty pages outstanding against a particular BDI seems as a sane one to me. It's not as flexibl= e and automatic as the approach you suggested but it's much simpler and solves most of problems we currently have. The biggest objection against the sysfs-tunable approach is that most people won't have a clue meaning that the tunable is useless for them. = But I wonder if something like: 1) turn on strictlimit by default 2) don't allow dirty cache of BDI to grow over 5s of measured writeback speed won't go a long way into solving our current problems without too much complication... Honza --=20 Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Mon, 11 Nov 2013 14:22:11 +1100 Message-ID: <20131111032211.GT6188@dastard> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> <20131107134806.GB30832@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: Jan Kara Return-path: Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:46137 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751016Ab3KKDWh (ORCPT ); Sun, 10 Nov 2013 22:22:37 -0500 Content-Disposition: inline In-Reply-To: <20131107134806.GB30832@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote: > On Tue 05-11-13 15:12:45, Dave Chinner wrote: > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > > Something simple like =E2=80=9Cstart writing at 16MB dirty on a s= ingle file=E2=80=9D > > > would probably avoid a lot of complexity at little real-world cos= t. > > > That shouldn=E2=80=99t throttle dirtying memory above 16MB, but j= ust start > > > writeout much earlier than it does today. > >=20 > > That doesn't solve the "slow device, large file" problem. We can > > write data into the page cache at rates of over a GB/s, so it's > > irrelevant to a device that can write at 5MB/s whether we start > > writeback immediately or a second later when there is 500MB of dirt= y > > pages in memory. AFAIK, the only way to avoid that problem is to > > use write-through caching for such devices - where they throttle to > > the IO rate at very low levels of cached data. > Agreed. >=20 > > Realistically, there is no "one right answer" for all combinations > > of applications, filesystems and hardware, but writeback caching is > > the best *general solution* we've got right now. > >=20 > > However, IMO users should not need to care about tuning BDI dirty > > ratios or even have to understand what a BDI dirty ratio is to > > select the rigth caching method for their devices and/or workload. > > The difference between writeback and write through caching is easy > > to explain and AFAICT those two modes suffice to solve the problems > > being discussed here. Further, if two modes suffice to solve the > > problems, then we should be able to easily define a trigger to > > automatically switch modes. > >=20 > > /me notes that if we look at random vs sequential IO and the impact > > that has on writeback duration, then it's very similar to suddenly > > having a very slow device. IOWs, fadvise(RANDOM) could be used to > > switch an *inode* to write through mode rather than writeback mode > > to solve the problem aggregating massive amounts of random write IO > > in the page cache... > I disagree here. Writeback cache is also useful for aggregating ran= dom > writes and making semi-sequential writes out of them. There are quite= some > applications which rely on the fact that they can write a file in a r= ather > random manner (Berkeley DB, linker, ...) but the files are written ou= t in > one large linear sweep. That is actually the reason why SLES (and I b= elieve > RHEL as well) tune dirty_limit even higher than what's the default va= lue. Right - but the correct behaviour really depends on the pattern of randomness. The common case we get into trouble with is when no clustering occurs and we end up with small, random IO for gigabytes of cached data. That's the case where write-through caching for random data is better. It's also questionable whether writeback caching for aggregation is faster for random IO on high-IOPS devices or not. Again, I think it woul depend very much on how random the patterns are... > So I think it's rather the other way around: If you can detect the fi= le is > being written in a streaming manner, there's not much point in cachin= g too > much data for it. But we're not talking about how much data we cache here - we are considering how much data we allow to get dirty before writing it back. It doesn't matter if we use writeback or write through caching, the page cache footprint for a given workload is likely to be similar, but without any data we can't draw any conclusions here. > And I agree with you that we also have to be careful not > to cache too few because otherwise two streaming writes would be > interleaved too much. Currently, we have writeback_chunk_size() which > determines how much we ask to write from a single inode. So streaming > writers are going to be interleaved at this chunk size anyway (curren= tly > that number is "measured bandwidth / 2"). So it would make sense to a= lso > limit amount of dirty cache for each file with streaming pattern at t= his > number. My experience says that for streaming IO we typically need at least 5s of cached *dirty* data to even out delays and latencies in the writeback IO pipeline. Hence limiting a file to what we can write in a second given we might only write a file once a second is likely going to result in pipeline stalls... Remember, writeback caching is about maximising throughput, not minimising latency. The "sync latency" problem with caching too much dirty data on slow block devices is really a corner case behaviour and should not compromise the common case for bulk writeback throughput. > > Indeed, the above implies that we need the caching behaviour to be = a > > property of the address space, not just a property of the backing > > device. > Yes, and that would be interesting to implement and not make a mess= out > of the whole writeback logic because the way we currently do writebac= k is > inherently BDI based. When we introduce some special per-inode limits= , > flusher threads would have to pick more carefully what to write and w= hat > not. We might be forced to go that way eventually anyway because of m= emcg > aware writeback but it's not a simple step. Agreed, it's not simple, and that's why we need to start working from the architectural level.... > > IOWs, the implementation needs to trickle down from a coherent high > > level design - that will define the knobs that we need to expose to > > userspace. We should not be adding new writeback behaviours by > > adding knobs to sysfs without first having some clue about whether > > we are solving the right problem and solving it in a sane manner... > Agreed. But the ability to limit amount of dirty pages outstanding > against a particular BDI seems as a sane one to me. It's not as flexi= ble > and automatic as the approach you suggested but it's much simpler and > solves most of problems we currently have. That's true, but.... > The biggest objection against the sysfs-tunable approach is that most > people won't have a clue meaning that the tunable is useless for them= =2E =2E... that's the big problem I see - nobody is going to know how to use it, when to use it, or be able to tell if it's the root cause of some weird performance problem they are seeing. > But I > wonder if something like: > 1) turn on strictlimit by default > 2) don't allow dirty cache of BDI to grow over 5s of measured writeba= ck > speed >=20 > won't go a long way into solving our current problems without too muc= h > complication... Turning on strict limit by default is going to change behaviour quite markedly. Again, it's not something I'd want to see done without a bunch of data showing that it doesn't cause regressions for common workloads... Cheers, Dave. --=20 Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Mon, 11 Nov 2013 20:31:47 +0100 Message-ID: <20131111193147.GC24867@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> <20131107134806.GB30832@quack.suse.cz> <20131111032211.GT6188@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Jan Kara , Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm To: Dave Chinner Return-path: Content-Disposition: inline In-Reply-To: <20131111032211.GT6188@dastard> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Mon 11-11-13 14:22:11, Dave Chinner wrote: > On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote: > > On Tue 05-11-13 15:12:45, Dave Chinner wrote: > > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > > Realistically, there is no "one right answer" for all combinations > > > of applications, filesystems and hardware, but writeback caching is > > > the best *general solution* we've got right now. > > > > > > However, IMO users should not need to care about tuning BDI dirty > > > ratios or even have to understand what a BDI dirty ratio is to > > > select the rigth caching method for their devices and/or workload. > > > The difference between writeback and write through caching is easy > > > to explain and AFAICT those two modes suffice to solve the problems > > > being discussed here. Further, if two modes suffice to solve the > > > problems, then we should be able to easily define a trigger to > > > automatically switch modes. > > > > > > /me notes that if we look at random vs sequential IO and the impact > > > that has on writeback duration, then it's very similar to suddenly > > > having a very slow device. IOWs, fadvise(RANDOM) could be used to > > > switch an *inode* to write through mode rather than writeback mode > > > to solve the problem aggregating massive amounts of random write IO > > > in the page cache... > > I disagree here. Writeback cache is also useful for aggregating random > > writes and making semi-sequential writes out of them. There are quite some > > applications which rely on the fact that they can write a file in a rather > > random manner (Berkeley DB, linker, ...) but the files are written out in > > one large linear sweep. That is actually the reason why SLES (and I believe > > RHEL as well) tune dirty_limit even higher than what's the default value. > > Right - but the correct behaviour really depends on the pattern of > randomness. The common case we get into trouble with is when no > clustering occurs and we end up with small, random IO for gigabytes > of cached data. That's the case where write-through caching for > random data is better. > > It's also questionable whether writeback caching for aggregation is > faster for random IO on high-IOPS devices or not. Again, I think it > woul depend very much on how random the patterns are... I agree usefulness of writeback caching for random IO very much depends on the working set size vs cache size, how random the accesses really are, and HW characteristics. I just wanted to point out there are fairly common workloads & setups where writeback caching for semi-random IO really helps (because you seemed to suggest that random IO implies we should disable writeback cache). > > So I think it's rather the other way around: If you can detect the file is > > being written in a streaming manner, there's not much point in caching too > > much data for it. > > But we're not talking about how much data we cache here - we are > considering how much data we allow to get dirty before writing it > back. Sorry, I was imprecise here. I really meant that IMO it doesn't make sense to allow too much dirty data for sequentially written files. > It doesn't matter if we use writeback or write through > caching, the page cache footprint for a given workload is likely to > be similar, but without any data we can't draw any conclusions here. > > > And I agree with you that we also have to be careful not > > to cache too few because otherwise two streaming writes would be > > interleaved too much. Currently, we have writeback_chunk_size() which > > determines how much we ask to write from a single inode. So streaming > > writers are going to be interleaved at this chunk size anyway (currently > > that number is "measured bandwidth / 2"). So it would make sense to also > > limit amount of dirty cache for each file with streaming pattern at this > > number. > > My experience says that for streaming IO we typically need at least > 5s of cached *dirty* data to even out delays and latencies in the > writeback IO pipeline. Hence limiting a file to what we can write in > a second given we might only write a file once a second is likely > going to result in pipeline stalls... I guess this begs for real data. We agree in principle but differ in constants :). > Remember, writeback caching is about maximising throughput, not > minimising latency. The "sync latency" problem with caching too much > dirty data on slow block devices is really a corner case behaviour > and should not compromise the common case for bulk writeback > throughput. Agreed. As a primary goal we want to maximise throughput. But we want to maintain sane latency as well (e.g. because we have a "promise" of "dirty_writeback_centisecs" we have to cycle through dirty inodes reasonably frequently). > > Agreed. But the ability to limit amount of dirty pages outstanding > > against a particular BDI seems as a sane one to me. It's not as flexible > > and automatic as the approach you suggested but it's much simpler and > > solves most of problems we currently have. > > That's true, but.... > > > The biggest objection against the sysfs-tunable approach is that most > > people won't have a clue meaning that the tunable is useless for them. > > .... that's the big problem I see - nobody is going to know how to > use it, when to use it, or be able to tell if it's the root cause of > some weird performance problem they are seeing. > > > But I > > wonder if something like: > > 1) turn on strictlimit by default > > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback > > speed > > > > won't go a long way into solving our current problems without too much > > complication... > > Turning on strict limit by default is going to change behaviour > quite markedly. Again, it's not something I'd want to see done > without a bunch of data showing that it doesn't cause regressions > for common workloads... Agreed. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Diego Calleja Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 15 Nov 2013 16:48:13 +0100 Message-ID: <3934111.dEm1hrGs4E@diego-arch> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1999200.Zdacx0scmY@diego-arch> <20131025233225.GA32051@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Cc: "Artem S. Tashkinov" , david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org To: Fengguang Wu Return-path: In-Reply-To: <20131025233225.GA32051@localhost> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org El S=E1bado, 26 de octubre de 2013 00:32:25 Fengguang Wu escribi=F3: > What's the kernel you are running? And it's writing to a hard disk? > The stalls are most likely caused by either one of >=20 > 1) write IO starves read IO > 2) direct page reclaim blocked when > - trying to writeout PG_dirty pages > - trying to lock PG_writeback pages >=20 > Which may be confirmed by running >=20 > ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 > or > echo w > /proc/sysrq-trigger # and check dmesg >=20 > during the stalls. The latter command works more reliably. Sorry for the delay (background: rsync'ing large files from/to a hard d= isk in a desktop with 16GB of RAM makes the whole desktop unreponsive) I just triggered it today (running 3.12), and run sysrq-w: [ 5547.001505] SysRq : Show Blocked State [ 5547.001509] task PC stack pid father [ 5547.001516] btrfs-transacti D ffff880425d7a8a0 0 193 2 0x= 00000000 [ 5547.001519] ffff880425eede10 0000000000000002 ffff880425eedfd8 0000= 000000012e40 [ 5547.001521] ffff880425eedfd8 0000000000012e40 ffff880425d7a8a0 ffff= ea00104baa80 [ 5547.001523] ffff880425eedd90 ffff880425eedd68 ffff880425eedd70 ffff= ffff81080edd [ 5547.001525] Call Trace: [ 5547.001530] [] ? get_parent_ip+0xd/0x50 [ 5547.001533] [] ? sub_preempt_count+0x49/0x50 [ 5547.001535] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001552] [] ? btrfs_run_ordered_operations+0x2= 12/0x2c0 [btrfs] [ 5547.001554] [] ? get_parent_ip+0xd/0x50 [ 5547.001556] [] ? sub_preempt_count+0x49/0x50 [ 5547.001557] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.001559] [] schedule+0x29/0x70 [ 5547.001566] [] btrfs_commit_transaction+0x265/0x9= d0 [btrfs] [ 5547.001569] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001575] [] transaction_kthread+0x19d/0x220 [b= trfs] [ 5547.001581] [] ? free_fs_root+0xc0/0xc0 [btrfs] [ 5547.001583] [] kthread+0xc0/0xd0 [ 5547.001585] [] ? kthread_create_on_node+0x120/0x1= 20 [ 5547.001587] [] ret_from_fork+0x7c/0xb0 [ 5547.001588] [] ? kthread_create_on_node+0x120/0x1= 20 [ 5547.001590] systemd-journal D ffff880426e19860 0 234 1 0x= 00000000 [ 5547.001592] ffff880426d77d90 0000000000000002 ffff880426d77fd8 0000= 000000012e40 [ 5547.001593] ffff880426d77fd8 0000000000012e40 ffff880426e19860 ffff= ffff8155d7cd [ 5547.001595] 0000000000000001 0000000000000001 0000000000000000 ffff= ffff81572560 [ 5547.001596] Call Trace: [ 5547.001598] [] ? retint_restore_args+0xe/0xe [ 5547.001601] [] ? queue_unplugged+0x3b/0xe0 [ 5547.001602] [] ? blk_flush_plug_list+0x1eb/0x230 [ 5547.001604] [] schedule+0x29/0x70 [ 5547.001606] [] schedule_preempt_disabled+0x18/0x3= 0 [ 5547.001607] [] __mutex_lock_slowpath+0x124/0x1f0 [ 5547.001613] [] ? btrfs_write_marked_extents+0xbb/= 0xe0 [btrfs] [ 5547.001615] [] mutex_lock+0x17/0x30 [ 5547.001623] [] btrfs_sync_log+0x22a/0x690 [btrfs]= [ 5547.001630] [] btrfs_sync_file+0x287/0x2e0 [btrfs= ] [ 5547.001632] [] do_fsync+0x56/0x80 [ 5547.001634] [] SyS_fsync+0x10/0x20 [ 5547.001635] [] tracesys+0xdd/0xe2 [ 5547.001644] mysqld D ffff8803f0901860 0 643 579 0x= 00000000 [ 5547.001645] ffff8803f090de18 0000000000000002 ffff8803f090dfd8 0000= 000000012e40 [ 5547.001647] ffff8803f090dfd8 0000000000012e40 ffff8803f0901860 ffff= 88016d038000 [ 5547.001648] ffff880426908d00 0000000024119d80 0000000000000000 0000= 000000000000 [ 5547.001650] Call Trace: [ 5547.001657] [] ? btrfs_submit_bio_hook+0x84/0x1f0= [btrfs] [ 5547.001659] [] ? get_parent_ip+0xd/0x50 [ 5547.001660] [] ? sub_preempt_count+0x49/0x50 [ 5547.001662] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.001663] [] schedule+0x29/0x70 [ 5547.001669] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.001671] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001677] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.001680] [] ? do_writepages+0x1e/0x40 [ 5547.001686] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.001693] [] btrfs_sync_file+0x17f/0x2e0 [btrfs= ] [ 5547.001694] [] do_fsync+0x56/0x80 [ 5547.001696] [] SyS_fdatasync+0x13/0x20 [ 5547.001697] [] tracesys+0xdd/0xe2 [ 5547.001701] virtuoso-t D ffff88000310b0c0 0 617 609 0x= 00000000 [ 5547.001702] ffff8803f4867c20 0000000000000002 ffff8803f4867fd8 0000= 000000012e40 [ 5547.001704] ffff8803f4867fd8 0000000000012e40 ffff88000310b0c0 ffff= ffff813ce4af [ 5547.001705] ffffffff81860520 ffff8802d8ad8a00 ffff8803f4867ba0 ffff= ffff81231a0e [ 5547.001707] Call Trace: [ 5547.001709] [] ? scsi_pool_alloc_command+0x3f/0x8= 0 [ 5547.001712] [] ? __blk_segment_map_sg+0x4e/0x120 [ 5547.001713] [] ? blk_rq_map_sg+0x8b/0x1f0 [ 5547.001716] [] ? cfq_dispatch_requests+0xba/0xc40= [ 5547.001718] [] ? get_parent_ip+0xd/0x50 [ 5547.001721] [] ? filemap_fdatawait+0x30/0x30 [ 5547.001722] [] schedule+0x29/0x70 [ 5547.001723] [] io_schedule+0x8f/0xe0 [ 5547.001725] [] sleep_on_page+0xe/0x20 [ 5547.001727] [] __wait_on_bit+0x62/0x90 [ 5547.001728] [] wait_on_page_bit+0x7f/0x90 [ 5547.001730] [] ? wake_atomic_t_function+0x40/0x40= [ 5547.001732] [] filemap_fdatawait_range+0x11b/0x1a= 0 [ 5547.001734] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001740] [] btrfs_wait_marked_extents+0x87/0xe= 0 [btrfs] [ 5547.001747] [] btrfs_sync_log+0x4e8/0x690 [btrfs]= [ 5547.001754] [] btrfs_sync_file+0x287/0x2e0 [btrfs= ] [ 5547.001756] [] do_fsync+0x56/0x80 [ 5547.001758] [] SyS_fsync+0x10/0x20 [ 5547.001759] [] tracesys+0xdd/0xe2 [ 5547.001761] pool D ffff88040db1c100 0 657 477 0x= 00000000 [ 5547.001763] ffff8803ee809ba0 0000000000000002 ffff8803ee809fd8 0000= 000000012e40 [ 5547.001764] ffff8803ee809fd8 0000000000012e40 ffff88040db1c100 0000= 000000000004 [ 5547.001766] ffff8803ee809ae8 ffffffff8155cc86 ffff8803ee809bd0 ffff= ffffa005ada4 [ 5547.001767] Call Trace: [ 5547.001769] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001775] [] ? reserve_metadata_bytes+0x184/0x9= 30 [btrfs] [ 5547.001776] [] ? get_parent_ip+0xd/0x50 [ 5547.001778] [] ? sub_preempt_count+0x49/0x50 [ 5547.001779] [] ? get_parent_ip+0xd/0x50 [ 5547.001781] [] ? sub_preempt_count+0x49/0x50 [ 5547.001783] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.001784] [] schedule+0x29/0x70 [ 5547.001790] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.001792] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001798] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.001804] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.001810] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.001813] [] ? security_inode_permission+0x1c/0= x30 [ 5547.001815] [] vfs_create+0xb4/0x120 [ 5547.001817] [] do_last+0x904/0xea0 [ 5547.001818] [] ? link_path_walk+0x70/0x930 [ 5547.001820] [] ? get_parent_ip+0xd/0x50 [ 5547.001822] [] ? security_file_alloc+0x16/0x20 [ 5547.001824] [] path_openat+0xbb/0x6b0 [ 5547.001827] [] ? __acct_update_integrals+0x7f/0x1= 00 [ 5547.001829] [] ? account_system_time+0xa2/0x180 [ 5547.001831] [] ? get_parent_ip+0xd/0x50 [ 5547.001833] [] do_filp_open+0x3a/0x90 [ 5547.001834] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001836] [] ? __alloc_fd+0xa7/0x130 [ 5547.001839] [] do_sys_open+0x129/0x220 [ 5547.001842] [] ? syscall_trace_enter+0x135/0x230 [ 5547.001844] [] SyS_open+0x1e/0x20 [ 5547.001845] [] tracesys+0xdd/0xe2 [ 5547.001850] akregator D ffff8803ed1d4100 0 875 1 0x= 00000000 [ 5547.001851] ffff8803c7f1bba0 0000000000000002 ffff8803c7f1bfd8 0000= 000000012e40 [ 5547.001853] ffff8803c7f1bfd8 0000000000012e40 ffff8803ed1d4100 0000= 000000000004 [ 5547.001854] ffff8803c7f1bae8 ffffffff8155cc86 ffff8803c7f1bbd0 ffff= ffffa005ada4 [ 5547.001856] Call Trace: [ 5547.001858] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001863] [] ? reserve_metadata_bytes+0x184/0x9= 30 [btrfs] [ 5547.001865] [] ? get_parent_ip+0xd/0x50 [ 5547.001866] [] ? sub_preempt_count+0x49/0x50 [ 5547.001868] [] ? get_parent_ip+0xd/0x50 [ 5547.001870] [] ? sub_preempt_count+0x49/0x50 [ 5547.001871] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.001873] [] schedule+0x29/0x70 [ 5547.001879] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.001881] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001886] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.001888] [] ? get_parent_ip+0xd/0x50 [ 5547.001894] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.001900] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.001902] [] ? security_inode_permission+0x1c/0= x30 [ 5547.001904] [] vfs_create+0xb4/0x120 [ 5547.001906] [] do_last+0x904/0xea0 [ 5547.001907] [] ? link_path_walk+0x70/0x930 [ 5547.001909] [] ? get_parent_ip+0xd/0x50 [ 5547.001911] [] ? security_file_alloc+0x16/0x20 [ 5547.001912] [] path_openat+0xbb/0x6b0 [ 5547.001914] [] ? __acct_update_integrals+0x7f/0x1= 00 [ 5547.001916] [] ? account_system_time+0xa2/0x180 [ 5547.001918] [] ? get_parent_ip+0xd/0x50 [ 5547.001920] [] do_filp_open+0x3a/0x90 [ 5547.001921] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001923] [] ? __alloc_fd+0xa7/0x130 [ 5547.001925] [] do_sys_open+0x129/0x220 [ 5547.001927] [] ? syscall_trace_enter+0x135/0x230 [ 5547.001928] [] SyS_open+0x1e/0x20 [ 5547.001930] [] tracesys+0xdd/0xe2 [ 5547.001931] mpegaudioparse3 D ffff880341d10820 0 5917 1 0x= 00000000 [ 5547.001933] ffff88030f779ce0 0000000000000002 ffff88030f779fd8 0000= 000000012e40 [ 5547.001934] ffff88030f779fd8 0000000000012e40 ffff880341d10820 ffff= ffff81122a28 [ 5547.001936] ffff88043e5ddc00 ffff880400000002 ffff88043e2138d0 0000= 000000000000 [ 5547.001938] Call Trace: [ 5547.001939] [] ? __alloc_pages_nodemask+0x158/0xb= 00 [ 5547.001941] [] ? native_send_call_func_single_ipi= +0x35/0x40 [ 5547.001943] [] ? generic_exec_single+0x98/0xa0 [ 5547.001945] [] ? __enqueue_entity+0x78/0x80 [ 5547.001947] [] ? enqueue_entity+0x197/0x780 [ 5547.001948] [] ? get_parent_ip+0xd/0x50 [ 5547.001950] [] ? sleep_on_page+0x20/0x20 [ 5547.001951] [] schedule+0x29/0x70 [ 5547.001953] [] io_schedule+0x8f/0xe0 [ 5547.001954] [] sleep_on_page_killable+0xe/0x40 [ 5547.001956] [] __wait_on_bit_lock+0x5d/0xc0 [ 5547.001958] [] __lock_page_killable+0x6a/0x70 [ 5547.001960] [] ? wake_atomic_t_function+0x40/0x40= [ 5547.001961] [] generic_file_aio_read+0x435/0x700 [ 5547.001963] [] do_sync_read+0x5a/0x90 [ 5547.001965] [] vfs_read+0x9a/0x170 [ 5547.001967] [] SyS_read+0x49/0xa0 [ 5547.001968] [] tracesys+0xdd/0xe2 [ 5547.001970] mozStorage #2 D ffff8803b7aa1860 0 920 477 0x= 00000000 [ 5547.001972] ffff8803b1473d80 0000000000000002 ffff8803b1473fd8 0000= 000000012e40 [ 5547.001974] ffff8803b1473fd8 0000000000012e40 ffff8803b7aa1860 0000= 000000000004 [ 5547.001975] ffff8803b1473cc8 ffffffff8155cc86 ffff8803b1473db0 ffff= ffffa005ada4 [ 5547.001977] Call Trace: [ 5547.001978] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001984] [] ? reserve_metadata_bytes+0x184/0x9= 30 [btrfs] [ 5547.001990] [] ? __btrfs_buffered_write+0x3d9/0x4= 90 [btrfs] [ 5547.001992] [] ? get_parent_ip+0xd/0x50 [ 5547.001994] [] ? sub_preempt_count+0x49/0x50 [ 5547.001995] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.001997] [] schedule+0x29/0x70 [ 5547.002003] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.002004] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002010] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.002016] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.002023] [] btrfs_setattr+0x101/0x290 [btrfs] [ 5547.002025] [] ? rcu_eqs_enter+0x5c/0xa0 [ 5547.002027] [] notify_change+0x1dc/0x360 [ 5547.002029] [] ? sub_preempt_count+0x49/0x50 [ 5547.002030] [] do_truncate+0x6b/0xa0 [ 5547.002032] [] ? __sb_start_write+0x49/0x100 [ 5547.002033] [] SyS_ftruncate+0x10b/0x160 [ 5547.002035] [] tracesys+0xdd/0xe2 [ 5547.002036] Cache I/O D ffff8803b7aa28a0 0 922 477 0x= 00000000 [ 5547.002038] ffff8803b1495e18 0000000000000002 ffff8803b1495fd8 0000= 000000012e40 [ 5547.002039] ffff8803b1495fd8 0000000000012e40 ffff8803b7aa28a0 ffff= 8803b1495e08 [ 5547.002041] ffff8803b1495db0 ffffffff8111a25a ffff8803b1495e40 ffff= 8803b1495df0 [ 5547.002043] Call Trace: [ 5547.002045] [] ? find_get_pages_tag+0xea/0x180 [ 5547.002047] [] ? get_parent_ip+0xd/0x50 [ 5547.002048] [] ? sub_preempt_count+0x49/0x50 [ 5547.002050] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.002051] [] schedule+0x29/0x70 [ 5547.002057] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.002059] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002065] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.002071] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.002077] [] btrfs_sync_file+0x17f/0x2e0 [btrfs= ] [ 5547.002079] [] do_fsync+0x56/0x80 [ 5547.002080] [] SyS_fsync+0x10/0x20 [ 5547.002081] [] tracesys+0xdd/0xe2 [ 5547.002083] mozStorage #6 D ffff8803c0cfa8a0 0 982 477 0x= 00000000 [ 5547.002085] ffff8803a10f5ba0 0000000000000002 ffff8803a10f5fd8 0000= 000000012e40 [ 5547.002086] ffff8803a10f5fd8 0000000000012e40 ffff8803c0cfa8a0 0000= 000000000004 [ 5547.002088] ffff8803a10f5ae8 ffffffff8155cc86 ffff8803a10f5bd0 ffff= ffffa005ada4 [ 5547.002089] Call Trace: [ 5547.002091] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002096] [] ? reserve_metadata_bytes+0x184/0x9= 30 [btrfs] [ 5547.002098] [] ? native_smp_send_reschedule+0x47/= 0x60 [ 5547.002100] [] ? resched_task+0x5c/0x60 [ 5547.002101] [] ? get_parent_ip+0xd/0x50 [ 5547.002103] [] ? sub_preempt_count+0x49/0x50 [ 5547.002104] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.002106] [] schedule+0x29/0x70 [ 5547.002112] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.002113] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002119] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.002125] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.002131] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.002133] [] ? security_inode_permission+0x1c/0= x30 [ 5547.002134] [] vfs_create+0xb4/0x120 [ 5547.002136] [] do_last+0x904/0xea0 [ 5547.002138] [] ? link_path_walk+0x70/0x930 [ 5547.002139] [] ? get_parent_ip+0xd/0x50 [ 5547.002141] [] ? security_file_alloc+0x16/0x20 [ 5547.002143] [] path_openat+0xbb/0x6b0 [ 5547.002145] [] ? __acct_update_integrals+0x7f/0x1= 00 [ 5547.002147] [] ? account_system_time+0xa2/0x180 [ 5547.002148] [] ? get_parent_ip+0xd/0x50 [ 5547.002150] [] do_filp_open+0x3a/0x90 [ 5547.002152] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002153] [] ? __alloc_fd+0xa7/0x130 [ 5547.002155] [] do_sys_open+0x129/0x220 [ 5547.002157] [] ? syscall_trace_enter+0x135/0x230 [ 5547.002159] [] SyS_open+0x1e/0x20 [ 5547.002160] [] tracesys+0xdd/0xe2 [ 5547.002164] rsync D ffff8802dcde0820 0 5803 5802 0x= 00000000 [ 5547.002165] ffff8802daeb1a90 0000000000000002 ffff8802daeb1fd8 0000= 000000012e40 [ 5547.002167] ffff8802daeb1fd8 0000000000012e40 ffff8802dcde0820 ffff= 880100000002 [ 5547.002169] ffff8802daeb19e0 ffffffff81080edd ffff880308b337e0 0000= 000000000000 [ 5547.002170] Call Trace: [ 5547.002172] [] ? get_parent_ip+0xd/0x50 [ 5547.002173] [] ? get_parent_ip+0xd/0x50 [ 5547.002175] [] ? sub_preempt_count+0x49/0x50 [ 5547.002177] [] ? get_parent_ip+0xd/0x50 [ 5547.002178] [] ? add_preempt_count+0x3d/0x40 [ 5547.002180] [] ? get_parent_ip+0xd/0x50 [ 5547.002181] [] schedule+0x29/0x70 [ 5547.002182] [] schedule_timeout+0x11a/0x230 [ 5547.002185] [] ? detach_if_pending+0x120/0x120 [ 5547.002187] [] ? ktime_get_ts+0x48/0xe0 [ 5547.002189] [] io_schedule_timeout+0x9b/0xf0 [ 5547.002191] [] balance_dirty_pages_ratelimited+0x= 3d9/0xa10 [ 5547.002198] [] ? ext4_dirty_inode+0x54/0x60 [ext4= ] [ 5547.002200] [] generic_file_buffered_write+0x1b8/= 0x290 [ 5547.002202] [] __generic_file_aio_write+0x1a9/0x3= b0 [ 5547.002203] [] generic_file_aio_write+0x58/0xa0 [ 5547.002208] [] ext4_file_write+0x99/0x3e0 [ext4] [ 5547.002210] [] ? acct_account_cputime+0x1c/0x20 [ 5547.002212] [] ? account_system_time+0xa2/0x180 [ 5547.002213] [] ? get_parent_ip+0xd/0x50 [ 5547.002215] [] ? get_parent_ip+0xd/0x50 [ 5547.002216] [] do_sync_write+0x5a/0x90 [ 5547.002218] [] vfs_write+0xbd/0x1e0 [ 5547.002220] [] SyS_write+0x49/0xa0 [ 5547.002221] [] tracesys+0xdd/0xe2 [ 5547.002223] ktorrent D ffff8802e7680820 0 5806 1 0x= 00000000 [ 5547.002224] ffff8802daf7fba0 0000000000000002 ffff8802daf7ffd8 0000= 000000012e40 [ 5547.002226] ffff8802daf7ffd8 0000000000012e40 ffff8802e7680820 0000= 000000000004 [ 5547.002227] ffff8802daf7fae8 ffffffff8155cc86 ffff8802daf7fbd0 ffff= ffffa005ada4 [ 5547.002229] Call Trace: [ 5547.002230] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002236] [] ? reserve_metadata_bytes+0x184/0x9= 30 [btrfs] [ 5547.002241] [] ? btrfs_set_path_blocking+0x39/0x8= 0 [btrfs] [ 5547.002246] [] ? btrfs_search_slot+0x498/0x970 [b= trfs] [ 5547.002247] [] ? get_parent_ip+0xd/0x50 [ 5547.002249] [] ? sub_preempt_count+0x49/0x50 [ 5547.002251] [] ? _raw_spin_unlock_irqrestore+0x26= /0x60 [ 5547.002252] [] schedule+0x29/0x70 [ 5547.002258] [] wait_current_trans.isra.17+0xbf/0x= 120 [btrfs] [ 5547.002260] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002266] [] start_transaction+0x37f/0x570 [btr= fs] [ 5547.002268] [] ? sub_preempt_count+0x49/0x50 [ 5547.002273] [] btrfs_start_transaction+0x1b/0x20 = [btrfs] [ 5547.002280] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.002281] [] ? security_inode_permission+0x1c/0= x30 [ 5547.002283] [] vfs_create+0xb4/0x120 [ 5547.002285] [] do_last+0x904/0xea0 [ 5547.002287] [] ? link_path_walk+0x70/0x930 [ 5547.002288] [] ? get_parent_ip+0xd/0x50 [ 5547.002290] [] ? security_file_alloc+0x16/0x20 [ 5547.002292] [] path_openat+0xbb/0x6b0 [ 5547.002293] [] ? __acct_update_integrals+0x7f/0x1= 00 [ 5547.002295] [] ? account_system_time+0xa2/0x180 [ 5547.002297] [] ? get_parent_ip+0xd/0x50 [ 5547.002299] [] do_filp_open+0x3a/0x90 [ 5547.002300] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002302] [] ? __alloc_fd+0xa7/0x130 [ 5547.002304] [] do_sys_open+0x129/0x220 [ 5547.002306] [] ? syscall_trace_enter+0x135/0x230 [ 5547.002307] [] SyS_open+0x1e/0x20 [ 5547.002309] [] tracesys+0xdd/0xe2 [ 5547.002311] kworker/u16:0 D ffff88035c5ac920 0 6043 2 0x= 00000000 [ 5547.002313] Workqueue: writeback bdi_writeback_workfn (flush-8:32) [ 5547.002315] ffff88036c9cb898 0000000000000002 ffff88036c9cbfd8 0000= 000000012e40 [ 5547.002316] ffff88036c9cbfd8 0000000000012e40 ffff88035c5ac920 ffff= 8804281de048 [ 5547.002318] ffff88036c9cb7e8 ffffffff81080edd 0000000000000001 ffff= 88036c9cb800 [ 5547.002319] Call Trace: [ 5547.002321] [] ? get_parent_ip+0xd/0x50 [ 5547.002323] [] ? sub_preempt_count+0x49/0x50 [ 5547.002324] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002326] [] ? queue_unplugged+0x3b/0xe0 [ 5547.002328] [] schedule+0x29/0x70 [ 5547.002329] [] io_schedule+0x8f/0xe0 [ 5547.002331] [] get_request+0x1aa/0x780 [ 5547.002332] [] ? ioc_lookup_icq+0x4e/0x80 [ 5547.002334] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002336] [] blk_queue_bio+0x78/0x3e0 [ 5547.002337] [] generic_make_request+0xc2/0x110 [ 5547.002338] [] submit_bio+0x73/0x160 [ 5547.002344] [] ext4_io_submit+0x25/0x50 [ext4] [ 5547.002348] [] ext4_writepages+0x823/0xe00 [ext4]= [ 5547.002350] [] do_writepages+0x1e/0x40 [ 5547.002352] [] __writeback_single_inode+0x40/0x33= 0 [ 5547.002353] [] writeback_sb_inodes+0x262/0x450 [ 5547.002355] [] __writeback_inodes_wb+0x9f/0xd0 [ 5547.002357] [] wb_writeback+0x32b/0x360 [ 5547.002358] [] bdi_writeback_workfn+0x221/0x510 [ 5547.002361] [] process_one_work+0x167/0x450 [ 5547.002362] [] worker_thread+0x121/0x3a0 [ 5547.002364] [] ? sub_preempt_count+0x49/0x50 [ 5547.002366] [] ? manage_workers.isra.25+0x2a0/0x2= a0 [ 5547.002367] [] kthread+0xc0/0xd0 [ 5547.002369] [] ? kthread_create_on_node+0x120/0x1= 20 [ 5547.002371] [] ret_from_fork+0x7c/0xb0 [ 5547.002372] [] ? kthread_create_on_node+0x120/0x1= 20 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f43.google.com (mail-pb0-f43.google.com [209.85.160.43]) by kanga.kvack.org (Postfix) with ESMTP id E6A536B00DD for ; Fri, 25 Oct 2013 19:32:35 -0400 (EDT) Received: by mail-pb0-f43.google.com with SMTP id md12so4760573pbc.2 for ; Fri, 25 Oct 2013 16:32:35 -0700 (PDT) Received: from psmtp.com ([74.125.245.137]) by mx.google.com with SMTP id ei3si5398912pbc.350.2013.10.25.16.32.34 for ; Fri, 25 Oct 2013 16:32:35 -0700 (PDT) Date: Sat, 26 Oct 2013 00:32:25 +0100 From: Fengguang Wu Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025233225.GA32051@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <1999200.Zdacx0scmY@diego-arch> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1999200.Zdacx0scmY@diego-arch> Sender: owner-linux-mm@kvack.org List-ID: To: Diego Calleja Cc: "Artem S. Tashkinov" , david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote: > El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribiA3: > > Oct 25, 2013 05:26:45 PM, david wrote: > > >actually, I think the problem is more the impact of the huge write later > > >on. > > Exactly. And not being able to use applications which show you IO > > performance like Midnight Commander. You might prefer to use "cp -a" but I > > cannot imagine my life without being able to see the progress of a copying > > operation. With the current dirty cache there's no way to understand how > > you storage media actually behaves. > > > This is a problem I also have been suffering for a long time. It's not so much > how much and when the systems syncs dirty data, but how unreponsive the > desktop becomes when it happens (usually, with rsync + large files). Most > programs become completely unreponsive, specially if they have a large memory > consumption (ie. the browser). I need to pause rsync and wait until the > systems writes out all dirty data if I want to do simple things like scrolling > or do any action that uses I/O, otherwise I need to wait minutes. That's a problem. And it's kind of independent of the dirty threshold -- if you are doing large file copies in the background, it will lead to continuous disk writes and stalls anyway -- the large dirty threshold merely delays the write IO time. > I have 16 GB of RAM and excluding the browser (which usually uses about half > of a GB) and KDE itself, there are no memory hogs, so it seem like it's > something that shouldn't happen. I can understand that I/O operations are > laggy when there is some other intensive I/O ongoing, but right now the system > becomes completely unreponsive. If I am unlucky and Konsole also becomes > unreponsive, I need to switch to a VT (which also takes time). > > I haven't reported it before in part because I didn't know how to do it, "my > browser stalls" is not a very useful description and I didn't know what kind > of data I'm supposed to report. What's the kernel you are running? And it's writing to a hard disk? The stalls are most likely caused by either one of 1) write IO starves read IO 2) direct page reclaim blocked when - trying to writeout PG_dirty pages - trying to lock PG_writeback pages Which may be confirmed by running ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 or echo w > /proc/sysrq-trigger # and check dmesg during the stalls. The latter command works more reliably. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f50.google.com (mail-pb0-f50.google.com [209.85.160.50]) by kanga.kvack.org (Postfix) with ESMTP id 606F16B0036 for ; Fri, 1 Nov 2013 10:31:52 -0400 (EDT) Received: by mail-pb0-f50.google.com with SMTP id uo5so4329358pbc.23 for ; Fri, 01 Nov 2013 07:31:52 -0700 (PDT) Received: from psmtp.com ([74.125.245.112]) by mx.google.com with SMTP id hj4si5018658pac.242.2013.11.01.07.31.48 for ; Fri, 01 Nov 2013 07:31:48 -0700 (PDT) Subject: [PATCH] mm: add strictlimit knob From: Maxim Patlasov Date: Fri, 01 Nov 2013 18:31:40 +0400 Message-ID: <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> In-Reply-To: <20131031142612.GA28003@kipc2.localdomain> References: <20131031142612.GA28003@kipc2.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: karl.kiniger@med.ge.com Cc: jack@suse.cz, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, tytso@mit.edu, akpm@linux-foundation.org, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until max_ratio is decreased to some reasonable value. Signed-off-by: Maxim Patlasov --- mm/backing-dev.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7..4ee1d64 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, } static DEVICE_ATTR_RO(stable_pages_required); +static ssize_t strictlimit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int val; + ssize_t ret; + + ret = kstrtouint(buf, 10, &val); + if (ret < 0) + return ret; + + switch (val) { + case 0: + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; + break; + case 1: + bdi->capabilities |= BDI_CAP_STRICTLIMIT; + break; + default: + return -EINVAL; + } + + return count; +} +static ssize_t strictlimit_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); +} +static DEVICE_ATTR_RW(strictlimit); + static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, + &dev_attr_strictlimit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id BF7AB6B0035 for ; Mon, 4 Nov 2013 17:01:08 -0500 (EST) Received: by mail-pa0-f50.google.com with SMTP id fb1so7519828pad.37 for ; Mon, 04 Nov 2013 14:01:08 -0800 (PST) Received: from psmtp.com ([74.125.245.180]) by mx.google.com with SMTP id tu7si9759174pab.162.2013.11.04.14.01.07 for ; Mon, 04 Nov 2013 14:01:07 -0800 (PST) Date: Mon, 4 Nov 2013 14:01:04 -0800 From: Andrew Morton Subject: Re: [PATCH] mm: add strictlimit knob Message-Id: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> In-Reply-To: <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> References: <20131031142612.GA28003@kipc2.localdomain> <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Maxim Patlasov Cc: karl.kiniger@med.ge.com, jack@suse.cz, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, tytso@mit.edu, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov wrote: > "strictlimit" feature was introduced to enforce per-bdi dirty limits for > FUSE which sets bdi max_ratio to 1% by default: > > http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 > > However the feature can be useful for other relatively slow or untrusted > BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the > feature: > > echo 1 > /sys/class/bdi/X:Y/strictlimit > > Being enabled, the feature enforces bdi max_ratio limit even if global (10%) > dirty limit is not reached. Of course, the effect is not visible until > max_ratio is decreased to some reasonable value. I suggest replacing "max_ratio" here with the much more informative "/sys/class/bdi/X:Y/max_ratio". Also, Documentation/ABI/testing/sysfs-class-bdi will need an update please. > mm/backing-dev.c | 35 +++++++++++++++++++++++++++++++++++ > 1 file changed, 35 insertions(+) > I'm not really sure what to make of the patch. I assume you tested it and observed some effect. Could you please describe the test setup and the effects in some detail? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f169.google.com (mail-pd0-f169.google.com [209.85.192.169]) by kanga.kvack.org (Postfix) with ESMTP id BC1B66B0035 for ; Mon, 4 Nov 2013 23:12:54 -0500 (EST) Received: by mail-pd0-f169.google.com with SMTP id q10so7740250pdj.0 for ; Mon, 04 Nov 2013 20:12:54 -0800 (PST) Received: from psmtp.com ([74.125.245.173]) by mx.google.com with SMTP id qj1si6755567pbc.174.2013.11.04.20.12.51 for ; Mon, 04 Nov 2013 20:12:52 -0800 (PST) Date: Tue, 5 Nov 2013 15:12:45 +1100 From: Dave Chinner Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131105041245.GY6188@dastard> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> Sender: owner-linux-mm@kvack.org List-ID: To: Andreas Dilger Cc: "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > On Oct 25, 2013, at 2:18 AM, Linus Torvalds wrote: > > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: > >> > >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 > >> kernel built for the i686 (with PAE) and x86-64 architectures. Whata??s > >> really troubling me is that the x86-64 kernel has the following problem: > >> > >> When I copy large files to any storage device, be it my HDD with ext4 > >> partitions or flash drive with FAT32 partitions, the kernel first > >> caches them in memory entirely then flushes them some time later > >> (quite unpredictably though) or immediately upon invoking "sync". > > > > Yeah, I think we default to a 10% "dirty background memory" (and > > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB > > of dirty memory for writeout before we even start writing, and twice > > that before we start *waiting* for it. > > > > On 32-bit x86, we only count the memory in the low 1GB (really > > actually up to about 890MB), so "10% dirty" really means just about > > 90MB of buffering (and a "hard limit" of ~180MB of dirty). > > > > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults > > come from the old days of less memory (and perhaps servers that don't > > much care), and the fact that x86-32 ends up having much lower limits > > even if you end up having more memory. > > I think the a??delay writes for a long timea?? is a holdover from the > days when e.g. /tmp was on a disk and compilers had lousy IO > patterns, then they deleted the file. Today, /tmp is always in > RAM, and IMHO the a??write and deletea?? workload tested by dbench > is not worthwhile optimizing for. > > With Lustre, wea??ve long taken the approach that if there is enough > dirty data on a file to make a decent write (which is around 8MB > today even for very fast storage) then there isna??t much point to > hold back for more data before starting the IO. Agreed - write-through caching is much better for high throughput streaming data environments than write back caching that can leave the devices unnecessarily idle. However, most systems are not running in high-throughput streaming data environments... :/ > Any decent allocator will be able to grow allocated extents to > handle following data, or allocate a new extent. At 4-8MB extents, > even very seek-impaired media could do 400-800MB/s (likely much > faster than the underlying storage anyway). True, but this makes the assumption that the filesystem you are using is optimising purely for write throughput and your storage is not seek limited on reads. That's simply not an assumption we can allow the generic writeback code to make. In more detail, if we simply implement "we have 8 MB of dirty pages on a single file, write it" we can maximise write throughput by allocating sequentially on disk for each subsquent write. The problem with this comes when you are writing multiple files at a time, and that leads to this pattern on disk: ABC...ABC....ABC....ABC.... And the result is a) fragmented files b) a large number of seeks during sequential read operations and c) filesystems that age and degrade rapidly under workloads that concurrently write files with different life times (i.e. due to free space fragmention). In some situations this is acceptable, but the performance degradation as the filesystem ages that this sort of allocation causes in most environments is not. I'd say that >90% of filesystems out there would suffer accelerated aging as a result of doing writeback in this manner by default. > This also avoids wasting (tens of?) seconds of idle disk bandwidth. > If the disk is already busy, then the IO will be delayed anyway. > If it is not busy, then why aggregate GB of dirty data in memory > before flushing it? There are plenty of workloads out there where delaying IO for a few seconds can result in writeback that is an order of magnitude faster. Similarly, I've seen other workloads where the writeback delay results in files that can be *read* orders of magnitude faster.... > Something simple like a??start writing at 16MB dirty on a single filea?? > would probably avoid a lot of complexity at little real-world cost. > That shouldna??t throttle dirtying memory above 16MB, but just start > writeout much earlier than it does today. That doesn't solve the "slow device, large file" problem. We can write data into the page cache at rates of over a GB/s, so it's irrelevant to a device that can write at 5MB/s whether we start writeback immediately or a second later when there is 500MB of dirty pages in memory. AFAIK, the only way to avoid that problem is to use write-through caching for such devices - where they throttle to the IO rate at very low levels of cached data. Realistically, there is no "one right answer" for all combinations of applications, filesystems and hardware, but writeback caching is the best *general solution* we've got right now. However, IMO users should not need to care about tuning BDI dirty ratios or even have to understand what a BDI dirty ratio is to select the rigth caching method for their devices and/or workload. The difference between writeback and write through caching is easy to explain and AFAICT those two modes suffice to solve the problems being discussed here. Further, if two modes suffice to solve the problems, then we should be able to easily define a trigger to automatically switch modes. /me notes that if we look at random vs sequential IO and the impact that has on writeback duration, then it's very similar to suddenly having a very slow device. IOWs, fadvise(RANDOM) could be used to switch an *inode* to write through mode rather than writeback mode to solve the problem aggregating massive amounts of random write IO in the page cache... So rather than treating this as a "one size fits all" type of problem, let's step back and: a) define 2-3 different caching behaviours we consider optimal for the majority of workloads/hardware we care about. b) determine optimal workloads for each caching behaviour. c) develop reliable triggers to detect when we should switch between caching behaviours. e.g: a) write back caching - what we have now write through caching - extremely low dirty threshold before writeback starts, enough to optimise for, say, stripe width of the underlying storage. b) write back caching: - general purpose workload write through caching: - slow device, write large file, sync - extremely high bandwidth devices, multi-stream sequential IO - random IO. c) write back caching: - default - fadvise(NORMAL, SEQUENTIAL, WILLNEED) write through caching: - fadvise(NOREUSE, DONTNEED, RANDOM) - random IO - sequential IO, BDI write bandwidth <<< dirty threshold - sequential IO, BDI write bandwidth >>> dirty threshold I think that covers most of the issues and use cases that have been discussed in this thread. IMO, this is the level at which we need to solve the problem (i.e. architectural), not at the level of "let's add sysfs variables so we can tweak bdi ratios". Indeed, the above implies that we need the caching behaviour to be a property of the address space, not just a property of the backing device. IOWs, the implementation needs to trickle down from a coherent high level design - that will define the knobs that we need to expose to userspace. We should not be adding new writeback behaviours by adding knobs to sysfs without first having some clue about whether we are solving the right problem and solving it in a sane manner... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f45.google.com (mail-pb0-f45.google.com [209.85.160.45]) by kanga.kvack.org (Postfix) with ESMTP id 5B6836B00DC for ; Wed, 6 Nov 2013 09:30:13 -0500 (EST) Received: by mail-pb0-f45.google.com with SMTP id ma3so9026566pbc.4 for ; Wed, 06 Nov 2013 06:30:12 -0800 (PST) Received: from psmtp.com ([74.125.245.172]) by mx.google.com with SMTP id pz2si17515271pac.202.2013.11.06.06.30.07 for ; Wed, 06 Nov 2013 06:30:09 -0800 (PST) Message-ID: <527A5269.7040900@parallels.com> Date: Wed, 6 Nov 2013 18:30:01 +0400 From: Maxim Patlasov MIME-Version: 1.0 Subject: Re: [PATCH] mm: add strictlimit knob References: <20131031142612.GA28003@kipc2.localdomain> <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> In-Reply-To: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: karl.kiniger@med.ge.com, jack@suse.cz, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, tytso@mit.edu, fengguang.wu@intel.com, torvalds@linux-foundation.org Hi Andrew, On 11/05/2013 02:01 AM, Andrew Morton wrote: > On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov wrote: > >> "strictlimit" feature was introduced to enforce per-bdi dirty limits for >> FUSE which sets bdi max_ratio to 1% by default: >> >> http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 >> >> However the feature can be useful for other relatively slow or untrusted >> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the >> feature: >> >> echo 1 > /sys/class/bdi/X:Y/strictlimit >> >> Being enabled, the feature enforces bdi max_ratio limit even if global (10%) >> dirty limit is not reached. Of course, the effect is not visible until >> max_ratio is decreased to some reasonable value. > I suggest replacing "max_ratio" here with the much more informative > "/sys/class/bdi/X:Y/max_ratio". > > Also, Documentation/ABI/testing/sysfs-class-bdi will need an update > please. OK, I'll update it, fix patch description and re-send the patch. > >> mm/backing-dev.c | 35 +++++++++++++++++++++++++++++++++++ >> 1 file changed, 35 insertions(+) >> > I'm not really sure what to make of the patch. I assume you tested it > and observed some effect. Could you please describe the test setup and > the effects in some detail? I plugged 16GB USB-flash in a node with 8GB RAM running 3.12.0-rc7 and started writing a huge file by "dd" (from /dev/zero to USB-flash mount-point). While writing I was observing "Dirty" counter as reported by /proc/meminfo. As expected it stabilized on a level about 1.2GB (15% of total RAM). Immediately after dd completed, the "umount" command took about 5 minutes. This corresponded to 5MB write throughput of the flash drive. Then I repeated the experiment after setting tunables: echo 1 > /sys/class/bdi/8\:16/max_ratio echo 1 > /sys/class/bdi/8\:16/strictlimit This time, "Dirty" counter became 100 times lesser - about 12MB and "umount" took about a second. Thanks, Maxim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f43.google.com (mail-pa0-f43.google.com [209.85.220.43]) by kanga.kvack.org (Postfix) with ESMTP id 117306B00E2 for ; Wed, 6 Nov 2013 10:06:10 -0500 (EST) Received: by mail-pa0-f43.google.com with SMTP id hz1so10604404pad.2 for ; Wed, 06 Nov 2013 07:06:10 -0800 (PST) Received: from psmtp.com ([74.125.245.169]) by mx.google.com with SMTP id j10si17638718pac.54.2013.11.06.07.06.06 for ; Wed, 06 Nov 2013 07:06:08 -0800 (PST) Subject: [PATCH] mm: add strictlimit knob -v2 From: Maxim Patlasov Date: Wed, 06 Nov 2013 19:05:57 +0400 Message-ID: <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> In-Reply-To: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> References: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: akpm@linux-foundation.org Cc: karl.kiniger@med.ge.com, tytso@mit.edu, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, jack@suse.cz, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. Changed in v2: - updated patch description and documentation Signed-off-by: Maxim Patlasov --- Documentation/ABI/testing/sysfs-class-bdi | 8 +++++++ mm/backing-dev.c | 35 +++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi index d773d56..3187a18 100644 --- a/Documentation/ABI/testing/sysfs-class-bdi +++ b/Documentation/ABI/testing/sysfs-class-bdi @@ -53,3 +53,11 @@ stable_pages_required (read-only) If set, the backing device requires that all pages comprising a write request must not be changed until writeout is complete. + +strictlimit (read-write) + + Forces per-BDI checks for the share of given device in the write-back + cache even before the global background dirty limit is reached. This + is useful in situations where the global limit is much higher than + affordable for given relatively slow (or untrusted) device. Turning + strictlimit on has no visible effect if max_ratio is equal to 100%. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7..4ee1d64 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, } static DEVICE_ATTR_RO(stable_pages_required); +static ssize_t strictlimit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int val; + ssize_t ret; + + ret = kstrtouint(buf, 10, &val); + if (ret < 0) + return ret; + + switch (val) { + case 0: + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; + break; + case 1: + bdi->capabilities |= BDI_CAP_STRICTLIMIT; + break; + default: + return -EINVAL; + } + + return count; +} +static ssize_t strictlimit_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); +} +static DEVICE_ATTR_RW(strictlimit); + static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, + &dev_attr_strictlimit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f178.google.com (mail-pd0-f178.google.com [209.85.192.178]) by kanga.kvack.org (Postfix) with ESMTP id B9E186B0154 for ; Thu, 7 Nov 2013 07:27:08 -0500 (EST) Received: by mail-pd0-f178.google.com with SMTP id x10so535452pdj.9 for ; Thu, 07 Nov 2013 04:27:08 -0800 (PST) Received: from psmtp.com ([74.125.245.199]) by mx.google.com with SMTP id d2si2835121pac.213.2013.11.07.04.27.05 for ; Thu, 07 Nov 2013 04:27:06 -0800 (PST) Date: Thu, 7 Nov 2013 10:26:58 -0200 From: Henrique de Moraes Holschuh Subject: Re: [PATCH] mm: add strictlimit knob -v2 Message-ID: <20131107122658.GA3355@khazad-dum.debian.net> References: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> Sender: owner-linux-mm@kvack.org List-ID: To: Maxim Patlasov Cc: akpm@linux-foundation.org, karl.kiniger@med.ge.com, tytso@mit.edu, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, jack@suse.cz, fengguang.wu@intel.com, torvalds@linux-foundation.org Is there a reason to not enforce strictlimit by default? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id A93856B015A for ; Thu, 7 Nov 2013 08:48:14 -0500 (EST) Received: by mail-pa0-f52.google.com with SMTP id bj1so626076pad.39 for ; Thu, 07 Nov 2013 05:48:14 -0800 (PST) Received: from psmtp.com ([74.125.245.132]) by mx.google.com with SMTP id dj3si2692052pbc.250.2013.11.07.05.48.11 for ; Thu, 07 Nov 2013 05:48:12 -0800 (PST) Date: Thu, 7 Nov 2013 14:48:06 +0100 From: Jan Kara Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131107134806.GB30832@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20131105041245.GY6188@dastard> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm On Tue 05-11-13 15:12:45, Dave Chinner wrote: > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > Something simple like a??start writing at 16MB dirty on a single filea?? > > would probably avoid a lot of complexity at little real-world cost. > > That shouldna??t throttle dirtying memory above 16MB, but just start > > writeout much earlier than it does today. > > That doesn't solve the "slow device, large file" problem. We can > write data into the page cache at rates of over a GB/s, so it's > irrelevant to a device that can write at 5MB/s whether we start > writeback immediately or a second later when there is 500MB of dirty > pages in memory. AFAIK, the only way to avoid that problem is to > use write-through caching for such devices - where they throttle to > the IO rate at very low levels of cached data. Agreed. > Realistically, there is no "one right answer" for all combinations > of applications, filesystems and hardware, but writeback caching is > the best *general solution* we've got right now. > > However, IMO users should not need to care about tuning BDI dirty > ratios or even have to understand what a BDI dirty ratio is to > select the rigth caching method for their devices and/or workload. > The difference between writeback and write through caching is easy > to explain and AFAICT those two modes suffice to solve the problems > being discussed here. Further, if two modes suffice to solve the > problems, then we should be able to easily define a trigger to > automatically switch modes. > > /me notes that if we look at random vs sequential IO and the impact > that has on writeback duration, then it's very similar to suddenly > having a very slow device. IOWs, fadvise(RANDOM) could be used to > switch an *inode* to write through mode rather than writeback mode > to solve the problem aggregating massive amounts of random write IO > in the page cache... I disagree here. Writeback cache is also useful for aggregating random writes and making semi-sequential writes out of them. There are quite some applications which rely on the fact that they can write a file in a rather random manner (Berkeley DB, linker, ...) but the files are written out in one large linear sweep. That is actually the reason why SLES (and I believe RHEL as well) tune dirty_limit even higher than what's the default value. So I think it's rather the other way around: If you can detect the file is being written in a streaming manner, there's not much point in caching too much data for it. And I agree with you that we also have to be careful not to cache too few because otherwise two streaming writes would be interleaved too much. Currently, we have writeback_chunk_size() which determines how much we ask to write from a single inode. So streaming writers are going to be interleaved at this chunk size anyway (currently that number is "measured bandwidth / 2"). So it would make sense to also limit amount of dirty cache for each file with streaming pattern at this number. > So rather than treating this as a "one size fits all" type of > problem, let's step back and: > > a) define 2-3 different caching behaviours we consider > optimal for the majority of workloads/hardware we care > about. > b) determine optimal workloads for each caching > behaviour. > c) develop reliable triggers to detect when we > should switch between caching behaviours. > > e.g: > > a) write back caching > - what we have now > write through caching > - extremely low dirty threshold before writeback > starts, enough to optimise for, say, stripe width > of the underlying storage. > > b) write back caching: > - general purpose workload > write through caching: > - slow device, write large file, sync > - extremely high bandwidth devices, multi-stream > sequential IO > - random IO. > > c) write back caching: > - default > - fadvise(NORMAL, SEQUENTIAL, WILLNEED) > write through caching: > - fadvise(NOREUSE, DONTNEED, RANDOM) > - random IO > - sequential IO, BDI write bandwidth <<< dirty threshold > - sequential IO, BDI write bandwidth >>> dirty threshold > > I think that covers most of the issues and use cases that have been > discussed in this thread. IMO, this is the level at which we need to > solve the problem (i.e. architectural), not at the level of "let's > add sysfs variables so we can tweak bdi ratios". > > Indeed, the above implies that we need the caching behaviour to be a > property of the address space, not just a property of the backing > device. Yes, and that would be interesting to implement and not make a mess out of the whole writeback logic because the way we currently do writeback is inherently BDI based. When we introduce some special per-inode limits, flusher threads would have to pick more carefully what to write and what not. We might be forced to go that way eventually anyway because of memcg aware writeback but it's not a simple step. > IOWs, the implementation needs to trickle down from a coherent high > level design - that will define the knobs that we need to expose to > userspace. We should not be adding new writeback behaviours by > adding knobs to sysfs without first having some clue about whether > we are solving the right problem and solving it in a sane manner... Agreed. But the ability to limit amount of dirty pages outstanding against a particular BDI seems as a sane one to me. It's not as flexible and automatic as the approach you suggested but it's much simpler and solves most of problems we currently have. The biggest objection against the sysfs-tunable approach is that most people won't have a clue meaning that the tunable is useless for them. But I wonder if something like: 1) turn on strictlimit by default 2) don't allow dirty cache of BDI to grow over 5s of measured writeback speed won't go a long way into solving our current problems without too much complication... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f53.google.com (mail-pb0-f53.google.com [209.85.160.53]) by kanga.kvack.org (Postfix) with ESMTP id 834F56B00B4 for ; Sun, 10 Nov 2013 22:22:39 -0500 (EST) Received: by mail-pb0-f53.google.com with SMTP id up7so4578111pbc.40 for ; Sun, 10 Nov 2013 19:22:39 -0800 (PST) Received: from psmtp.com ([74.125.245.188]) by mx.google.com with SMTP id gj2si14698398pac.312.2013.11.10.19.22.36 for ; Sun, 10 Nov 2013 19:22:38 -0800 (PST) Date: Mon, 11 Nov 2013 14:22:11 +1100 From: Dave Chinner Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131111032211.GT6188@dastard> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> <20131107134806.GB30832@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20131107134806.GB30832@quack.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote: > On Tue 05-11-13 15:12:45, Dave Chinner wrote: > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > > Something simple like a??start writing at 16MB dirty on a single filea?? > > > would probably avoid a lot of complexity at little real-world cost. > > > That shouldna??t throttle dirtying memory above 16MB, but just start > > > writeout much earlier than it does today. > > > > That doesn't solve the "slow device, large file" problem. We can > > write data into the page cache at rates of over a GB/s, so it's > > irrelevant to a device that can write at 5MB/s whether we start > > writeback immediately or a second later when there is 500MB of dirty > > pages in memory. AFAIK, the only way to avoid that problem is to > > use write-through caching for such devices - where they throttle to > > the IO rate at very low levels of cached data. > Agreed. > > > Realistically, there is no "one right answer" for all combinations > > of applications, filesystems and hardware, but writeback caching is > > the best *general solution* we've got right now. > > > > However, IMO users should not need to care about tuning BDI dirty > > ratios or even have to understand what a BDI dirty ratio is to > > select the rigth caching method for their devices and/or workload. > > The difference between writeback and write through caching is easy > > to explain and AFAICT those two modes suffice to solve the problems > > being discussed here. Further, if two modes suffice to solve the > > problems, then we should be able to easily define a trigger to > > automatically switch modes. > > > > /me notes that if we look at random vs sequential IO and the impact > > that has on writeback duration, then it's very similar to suddenly > > having a very slow device. IOWs, fadvise(RANDOM) could be used to > > switch an *inode* to write through mode rather than writeback mode > > to solve the problem aggregating massive amounts of random write IO > > in the page cache... > I disagree here. Writeback cache is also useful for aggregating random > writes and making semi-sequential writes out of them. There are quite some > applications which rely on the fact that they can write a file in a rather > random manner (Berkeley DB, linker, ...) but the files are written out in > one large linear sweep. That is actually the reason why SLES (and I believe > RHEL as well) tune dirty_limit even higher than what's the default value. Right - but the correct behaviour really depends on the pattern of randomness. The common case we get into trouble with is when no clustering occurs and we end up with small, random IO for gigabytes of cached data. That's the case where write-through caching for random data is better. It's also questionable whether writeback caching for aggregation is faster for random IO on high-IOPS devices or not. Again, I think it woul depend very much on how random the patterns are... > So I think it's rather the other way around: If you can detect the file is > being written in a streaming manner, there's not much point in caching too > much data for it. But we're not talking about how much data we cache here - we are considering how much data we allow to get dirty before writing it back. It doesn't matter if we use writeback or write through caching, the page cache footprint for a given workload is likely to be similar, but without any data we can't draw any conclusions here. > And I agree with you that we also have to be careful not > to cache too few because otherwise two streaming writes would be > interleaved too much. Currently, we have writeback_chunk_size() which > determines how much we ask to write from a single inode. So streaming > writers are going to be interleaved at this chunk size anyway (currently > that number is "measured bandwidth / 2"). So it would make sense to also > limit amount of dirty cache for each file with streaming pattern at this > number. My experience says that for streaming IO we typically need at least 5s of cached *dirty* data to even out delays and latencies in the writeback IO pipeline. Hence limiting a file to what we can write in a second given we might only write a file once a second is likely going to result in pipeline stalls... Remember, writeback caching is about maximising throughput, not minimising latency. The "sync latency" problem with caching too much dirty data on slow block devices is really a corner case behaviour and should not compromise the common case for bulk writeback throughput. > > Indeed, the above implies that we need the caching behaviour to be a > > property of the address space, not just a property of the backing > > device. > Yes, and that would be interesting to implement and not make a mess out > of the whole writeback logic because the way we currently do writeback is > inherently BDI based. When we introduce some special per-inode limits, > flusher threads would have to pick more carefully what to write and what > not. We might be forced to go that way eventually anyway because of memcg > aware writeback but it's not a simple step. Agreed, it's not simple, and that's why we need to start working from the architectural level.... > > IOWs, the implementation needs to trickle down from a coherent high > > level design - that will define the knobs that we need to expose to > > userspace. We should not be adding new writeback behaviours by > > adding knobs to sysfs without first having some clue about whether > > we are solving the right problem and solving it in a sane manner... > Agreed. But the ability to limit amount of dirty pages outstanding > against a particular BDI seems as a sane one to me. It's not as flexible > and automatic as the approach you suggested but it's much simpler and > solves most of problems we currently have. That's true, but.... > The biggest objection against the sysfs-tunable approach is that most > people won't have a clue meaning that the tunable is useless for them. .... that's the big problem I see - nobody is going to know how to use it, when to use it, or be able to tell if it's the root cause of some weird performance problem they are seeing. > But I > wonder if something like: > 1) turn on strictlimit by default > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback > speed > > won't go a long way into solving our current problems without too much > complication... Turning on strict limit by default is going to change behaviour quite markedly. Again, it's not something I'd want to see done without a bunch of data showing that it doesn't cause regressions for common workloads... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f170.google.com (mail-pd0-f170.google.com [209.85.192.170]) by kanga.kvack.org (Postfix) with ESMTP id 7B2716B0035 for ; Fri, 22 Nov 2013 18:45:08 -0500 (EST) Received: by mail-pd0-f170.google.com with SMTP id g10so1906644pdj.15 for ; Fri, 22 Nov 2013 15:45:08 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTP id hb3si20978549pac.7.2013.11.22.15.45.06 for ; Fri, 22 Nov 2013 15:45:07 -0800 (PST) Date: Fri, 22 Nov 2013 15:45:05 -0800 From: Andrew Morton Subject: Re: [PATCH] mm: add strictlimit knob -v2 Message-Id: <20131122154505.3e686fcfc584534d555399e5@linux-foundation.org> In-Reply-To: <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> References: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Maxim Patlasov Cc: karl.kiniger@med.ge.com, tytso@mit.edu, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, jack@suse.cz, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com On Wed, 06 Nov 2013 19:05:57 +0400 Maxim Patlasov wrote: > "strictlimit" feature was introduced to enforce per-bdi dirty limits for > FUSE which sets bdi max_ratio to 1% by default: > > http://article.gmane.org/gmane.linux.kernel.mm/105809 > > However the feature can be useful for other relatively slow or untrusted > BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the > feature: > > echo 1 > /sys/class/bdi/X:Y/strictlimit > > Being enabled, the feature enforces bdi max_ratio limit even if global (10%) > dirty limit is not reached. Of course, the effect is not visible until > /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. > > ... > > --- a/Documentation/ABI/testing/sysfs-class-bdi > +++ b/Documentation/ABI/testing/sysfs-class-bdi > @@ -53,3 +53,11 @@ stable_pages_required (read-only) > > If set, the backing device requires that all pages comprising a write > request must not be changed until writeout is complete. > + > +strictlimit (read-write) > + > + Forces per-BDI checks for the share of given device in the write-back > + cache even before the global background dirty limit is reached. This > + is useful in situations where the global limit is much higher than > + affordable for given relatively slow (or untrusted) device. Turning > + strictlimit on has no visible effect if max_ratio is equal to 100%. > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index ce682f7..4ee1d64 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, > } > static DEVICE_ATTR_RO(stable_pages_required); > > +static ssize_t strictlimit_store(struct device *dev, > + struct device_attribute *attr, const char *buf, size_t count) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + unsigned int val; > + ssize_t ret; > + > + ret = kstrtouint(buf, 10, &val); > + if (ret < 0) > + return ret; > + > + switch (val) { > + case 0: > + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; > + break; > + case 1: > + bdi->capabilities |= BDI_CAP_STRICTLIMIT; > + break; > + default: > + return -EINVAL; > + } > + > + return count; > +} > +static ssize_t strictlimit_show(struct device *dev, > + struct device_attribute *attr, char *page) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + > + return snprintf(page, PAGE_SIZE-1, "%d\n", > + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); > +} > +static DEVICE_ATTR_RW(strictlimit); > + > static struct attribute *bdi_dev_attrs[] = { > &dev_attr_read_ahead_kb.attr, > &dev_attr_min_ratio.attr, > &dev_attr_max_ratio.attr, > &dev_attr_stable_pages_required.attr, > + &dev_attr_strictlimit.attr, > NULL, Well the patch is certainly simple and straightforward enough and *seems* like it will be useful. The main (and large!) downside is that it adds to the user interface so we'll have to maintain this feature and its functionality for ever. Given this, my concern is that while potentially useful, the feature might not be *sufficiently* useful to justify its inclusion. So we'll end up addressing these issues by other means, then we're left maintaining this obsolete legacy feature. So I'm thinking that unless someone can show that this is good and complete and sufficient for a "large enough" set of issues, I'll take a pass on the patch[1]. What do people think? [1] Actually, I'll stick it in -mm and maintain it, so next time someone reports an issue I can say "hey, try this". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751616Ab3JYHZR (ORCPT ); Fri, 25 Oct 2013 03:25:17 -0400 Received: from smtprelay0046.b.hostedemail.com ([64.98.42.46]:41950 "EHLO smtprelay.b.hostedemail.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751182Ab3JYHZQ (ORCPT ); Fri, 25 Oct 2013 03:25:16 -0400 X-Session-Marker: 742E617274656D406C79636F732E636F6D X-Spam-Summary: 50,0,0,,d41d8cd98f00b204,t.artem@lycos.com,:::::::::,RULES_HIT:41:152:355:379:582:966:967:973:988:989:1152:1260:1277:1311:1313:1314:1345:1437:1515:1516:1518:1534:1541:1593:1594:1711:1730:1747:1777:1792:2194:2196:2199:2200:2393:2525:2560:2563:2682:2685:2859:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3308:3352:3421:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4250:4361:4385:5007:6119:6261:6630:6691:7875:7903:8603:9025:9040:9108:10004:10400:10450:10455:10848:11658:11914:12043:12517:12519:12555:12663:12698:12737:13069:13071:13160:13161:13166:13229:13311:13357:19904:19999,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0 X-HE-Tag: story63_85045af207225 X-Filterd-Recvd-Size: 2073 Date: Fri, 25 Oct 2013 07:25:13 +0000 (UTC) From: "Artem S. Tashkinov" To: linux-kernel@vger.kernel.org Cc: torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Message-ID: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Subject: Disabling in-memory write cache for x86-64 in Linux II MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [46.147.29.47] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello! On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon invoking "sync". How can I disable this memory cache altogether (or at least minimize caching)? When running the i686 kernel with the same configuration I don't observe this effect - files get written out almost immediately (for instance "sync" takes less than a second, whereas on x86-64 it can take a dozen of _minutes_ depending on a file size and storage performance). I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) - firstly this command is detrimental to the performance of my PC, secondly, it won't help in this instance. Swap is totally disabled, usually my memory is entirely free. My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 Please, advise. Best regards, Artem From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752057Ab3JYISx (ORCPT ); Fri, 25 Oct 2013 04:18:53 -0400 Received: from mail-vb0-f47.google.com ([209.85.212.47]:40277 "EHLO mail-vb0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751780Ab3JYISu (ORCPT ); Fri, 25 Oct 2013 04:18:50 -0400 MIME-Version: 1.0 In-Reply-To: <160824051.3072.1382685914055.JavaMail.mail@webmail07> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Date: Fri, 25 Oct 2013 09:18:49 +0100 X-Google-Sender-Auth: udpq2D-v27I7c40sU0WFfHjjI1w Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton Cc: Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: > > On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel > built for the i686 (with PAE) and x86-64 architectures. What's really troubling me > is that the x86-64 kernel has the following problem: > > When I copy large files to any storage device, be it my HDD with ext4 partitions > or flash drive with FAT32 partitions, the kernel first caches them in memory entirely > then flushes them some time later (quite unpredictably though) or immediately upon > invoking "sync". Yeah, I think we default to a 10% "dirty background memory" (and allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB of dirty memory for writeout before we even start writing, and twice that before we start *waiting* for it. On 32-bit x86, we only count the memory in the low 1GB (really actually up to about 890MB), so "10% dirty" really means just about 90MB of buffering (and a "hard limit" of ~180MB of dirty). And that "up to 3.2GB of dirty memory" is just crazy. Our defaults come from the old days of less memory (and perhaps servers that don't much care), and the fact that x86-32 ends up having much lower limits even if you end up having more memory. You can easily tune it: echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes or similar. But you're right, we need to make the defaults much saner. Wu? Andrew? Comments? Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752294Ab3JYIa5 (ORCPT ); Fri, 25 Oct 2013 04:30:57 -0400 Received: from smtprelay0079.b.hostedemail.com ([64.98.42.79]:56648 "EHLO smtprelay.b.hostedemail.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751464Ab3JYIaz (ORCPT ); Fri, 25 Oct 2013 04:30:55 -0400 X-Session-Marker: 742E617274656D406C79636F732E636F6D X-Spam-Summary: 2,0,0,,d41d8cd98f00b204,t.artem@lycos.com,:::::::,RULES_HIT:41:152:355:379:421:467:582:599:973:988:989:1152:1260:1277:1311:1313:1314:1345:1373:1437:1515:1516:1518:1534:1542:1593:1594:1711:1730:1747:1777:1792:2393:2553:2559:2562:3138:3139:3140:3141:3142:3354:3622:3865:3866:3867:3868:3870:3871:3872:3873:3874:4250:4321:5007:6119:6261:6630:6691:7903:10004:10226:10400:10848:11026:11232:11658:11914:12043:12050:12438:12517:12519:12740:13160:13229,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0 X-HE-Tag: wheel45_7c40367d11360 X-Filterd-Recvd-Size: 3140 Date: Fri, 25 Oct 2013 08:30:53 +0000 (UTC) From: "Artem S. Tashkinov" To: torvalds@linux-foundation.org Cc: fengguang.wu@intel.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Message-ID: <1814253454.3449.1382689853825.JavaMail.mail@webmail07> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> Subject: Re: Disabling in-memory write cache for x86-64 in Linux II MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [46.147.29.47] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote: On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: >> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel >> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me >> is that the x86-64 kernel has the following problem: >> >> When I copy large files to any storage device, be it my HDD with ext4 partitions >> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely >> then flushes them some time later (quite unpredictably though) or immediately upon >> invoking "sync". > >Yeah, I think we default to a 10% "dirty background memory" (and >allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB >of dirty memory for writeout before we even start writing, and twice >that before we start *waiting* for it. > >On 32-bit x86, we only count the memory in the low 1GB (really >actually up to about 890MB), so "10% dirty" really means just about >90MB of buffering (and a "hard limit" of ~180MB of dirty). > >And that "up to 3.2GB of dirty memory" is just crazy. Our defaults >come from the old days of less memory (and perhaps servers that don't >much care), and the fact that x86-32 ends up having much lower limits >even if you end up having more memory. > >You can easily tune it: > > echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes > echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes > >or similar. But you're right, we need to make the defaults much saner. > >Wu? Andrew? Comments? > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or more) this value becomes unrealistic (13GB) and I've already had some unpleasant effects due to it. I.e. when I dump a large MySQL database (its dump weighs around 10GB) - it appears on the disk almost immediately, but then, later, when the kernel decides to flush it to the disk, the server almost stalls and other IO requests take a lot more time to complete even though mysqldump is run with ionice -c3, so the use of ionice has no real effect. Artem From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752305Ab3JYIni (ORCPT ); Fri, 25 Oct 2013 04:43:38 -0400 Received: from mail-vb0-f45.google.com ([209.85.212.45]:55649 "EHLO mail-vb0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751464Ab3JYIng (ORCPT ); Fri, 25 Oct 2013 04:43:36 -0400 MIME-Version: 1.0 In-Reply-To: <1814253454.3449.1382689853825.JavaMail.mail@webmail07> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> Date: Fri, 25 Oct 2013 09:43:36 +0100 X-Google-Sender-Auth: hf4643Yk54gwMwzkFtnmMGOjVVc Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: "Artem S. Tashkinov" Cc: Wu Fengguang , Andrew Morton , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov wrote: > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or > more) this value becomes unrealistic (13GB) and I've already had some > unpleasant effects due to it. Right. The percentage notion really goes back to the days when we typically had 8-64 *megabytes* of memory So if you had a 8MB machine you wouldn't want to have more than one megabyte of dirty data, but if you were "Mr Moneybags" and could afford 64MB, you might want to have up to 8MB dirty!! Things have changed. So I would suggest we change the defaults. Or pwehaps make the rule be that "the ratio numbers are 'ratio of memory up to 1GB'", to make the semantics similar across 32-bit HIGHMEM machines and 64-bit machines. The modern way of expressing the dirty limits are to give the actual absolute byte amounts, but we default to the legacy ratio mode.. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752595Ab3JYJP0 (ORCPT ); Fri, 25 Oct 2013 05:15:26 -0400 Received: from exprod5og105.obsmtp.com ([64.18.0.180]:32824 "EHLO exprod5og105.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751610Ab3JYJPZ (ORCPT ); Fri, 25 Oct 2013 05:15:25 -0400 Date: Fri, 25 Oct 2013 11:15:55 +0200 From: Karl Kiniger To: Linus Torvalds Cc: "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025091555.GA30895@kipc2.localdomain> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-GEHealthcare-MailScanner: Found to be clean X-GEHealthcare-MailScanner-From: karl.kiniger@med.ge.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 131025, Linus Torvalds wrote: > On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov wrote: > > > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be > > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or > > more) this value becomes unrealistic (13GB) and I've already had some > > unpleasant effects due to it. > > Right. The percentage notion really goes back to the days when we > typically had 8-64 *megabytes* of memory So if you had a 8MB machine > you wouldn't want to have more than one megabyte of dirty data, but if > you were "Mr Moneybags" and could afford 64MB, you might want to have > up to 8MB dirty!! > > Things have changed. > > So I would suggest we change the defaults. Or pwehaps make the rule be > that "the ratio numbers are 'ratio of memory up to 1GB'", to make the > semantics similar across 32-bit HIGHMEM machines and 64-bit machines. > > The modern way of expressing the dirty limits are to give the actual > absolute byte amounts, but we default to the legacy ratio mode.. > > Linus Is it currently possible to somehow set above values per block device? I want default behaviour for almost everything but DVD drives in DVD+RW packet writing mode may easily take several minutes in case of a sync. Karl From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752736Ab3JYJSv (ORCPT ); Fri, 25 Oct 2013 05:18:51 -0400 Received: from imap.thunk.org ([74.207.234.97]:50532 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751855Ab3JYJSu (ORCPT ); Fri, 25 Oct 2013 05:18:50 -0400 Date: Fri, 25 Oct 2013 05:18:42 -0400 From: "Theodore Ts'o" To: "Artem S. Tashkinov" Cc: torvalds@linux-foundation.org, fengguang.wu@intel.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025091842.GA28681@thunk.org> Mail-Followup-To: Theodore Ts'o , "Artem S. Tashkinov" , torvalds@linux-foundation.org, fengguang.wu@intel.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1814253454.3449.1382689853825.JavaMail.mail@webmail07> User-Agent: Mutt/1.5.21 (2010-09-15) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on imap.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote: > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or > more) this value becomes unrealistic (13GB) and I've already had some > unpleasant effects due to it. What I think would make sense is to dynamically measure the speed of writeback, so that we can set these limits as a function of the device speed. It's already the case that the writeback limits don't make sense on a slow USB 2.0 storage stick; I suspect that for really huge RAID arrays or very fast flash devices, it doesn't make much sense either. The problem is that if you have a system that has *both* a USB stick _and_ a fast flash/RAID storage array both needing writeback, this doesn't work well --- but what we have right now doesn't work all that well anyway. - Ted From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752658Ab3JYJ2l (ORCPT ); Fri, 25 Oct 2013 05:28:41 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:49964 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751967Ab3JYJ2k (ORCPT ); Fri, 25 Oct 2013 05:28:40 -0400 Date: Fri, 25 Oct 2013 02:29:37 -0700 From: Andrew Morton To: "Theodore Ts'o" Cc: "Artem S. Tashkinov" , torvalds@linux-foundation.org, fengguang.wu@intel.com, linux-kernel@vger.kernel.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-Id: <20131025022937.12623dcd.akpm@linux-foundation.org> In-Reply-To: <20131025091842.GA28681@thunk.org> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" wrote: > What I think would make sense is to dynamically measure the speed of > writeback, so that we can set these limits as a function of the device > speed. We attempt to do this now - have a look through struct backing_dev_info. Apparently all this stuff isn't working as desired (and perhaps as designed) in this case. Will take a look after a return to normalcy ;) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752658Ab3JYJcS (ORCPT ); Fri, 25 Oct 2013 05:32:18 -0400 Received: from mail-vb0-f51.google.com ([209.85.212.51]:36246 "EHLO mail-vb0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751377Ab3JYJcR (ORCPT ); Fri, 25 Oct 2013 05:32:17 -0400 MIME-Version: 1.0 In-Reply-To: <20131025022937.12623dcd.akpm@linux-foundation.org> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> Date: Fri, 25 Oct 2013 10:32:16 +0100 X-Google-Sender-Auth: Opy72cm7zev4qzYO7PtIp8EuQQU Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: Andrew Morton Cc: "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton wrote: > > Apparently all this stuff isn't working as desired (and perhaps as designed) > in this case. Will take a look after a return to normalcy ;) It definitely doesn't work. I can trivially reproduce problems by just having a cheap (==slow) USB key with an ext3 filesystem, and going a git clone to it. The end result is not pretty, and that's actually not even a huge amount of data. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753490Ab3JYL2c (ORCPT ); Fri, 25 Oct 2013 07:28:32 -0400 Received: from mail.lang.hm ([64.81.33.126]:38450 "EHLO bifrost.lang.hm" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752305Ab3JYL2b (ORCPT ); Fri, 25 Oct 2013 07:28:31 -0400 Date: Fri, 25 Oct 2013 04:28:27 -0700 (PDT) From: David Lang X-X-Sender: dlang@asgard.lang.hm To: Linus Torvalds cc: "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II In-Reply-To: Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Oct 2013, Linus Torvalds wrote: > On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov wrote: >> >> My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be >> percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or >> more) this value becomes unrealistic (13GB) and I've already had some >> unpleasant effects due to it. > > Right. The percentage notion really goes back to the days when we > typically had 8-64 *megabytes* of memory So if you had a 8MB machine > you wouldn't want to have more than one megabyte of dirty data, but if > you were "Mr Moneybags" and could afford 64MB, you might want to have > up to 8MB dirty!! > > Things have changed. > > So I would suggest we change the defaults. Or pwehaps make the rule be > that "the ratio numbers are 'ratio of memory up to 1GB'", to make the > semantics similar across 32-bit HIGHMEM machines and 64-bit machines. If you go this direction, allow ratios larger than 100%, some people may be willing to have huge amounts of dirty data on large memory machines (if the load is extremely bursty, they don't have other needs for I/O, or they have a very fast storage system, as a few examples) David Lang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753336Ab3JYL0w (ORCPT ); Fri, 25 Oct 2013 07:26:52 -0400 Received: from mail.lang.hm ([64.81.33.126]:39973 "EHLO bifrost.lang.hm" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752573Ab3JYL0u (ORCPT ); Fri, 25 Oct 2013 07:26:50 -0400 Date: Fri, 25 Oct 2013 04:26:37 -0700 (PDT) From: David Lang X-X-Sender: dlang@asgard.lang.hm To: NeilBrown cc: "Artem S. Tashkinov" , linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II In-Reply-To: <20131025214952.3eb41201@notabene.brown> Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 25 Oct 2013, NeilBrown wrote: > On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov" > wrote: > >> Hello! >> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel >> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me >> is that the x86-64 kernel has the following problem: >> >> When I copy large files to any storage device, be it my HDD with ext4 partitions >> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely >> then flushes them some time later (quite unpredictably though) or immediately upon >> invoking "sync". >> >> How can I disable this memory cache altogether (or at least minimize caching)? When >> running the i686 kernel with the same configuration I don't observe this effect - files get >> written out almost immediately (for instance "sync" takes less than a second, whereas >> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage >> performance). > > What exactly is bothering you about this? The amount of memory used or the > time until data is flushed? actually, I think the problem is more the impact of the huge write later on. David Lang > If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to > look. > This defaults to 30 seconds (3000 centisecs). > You could make it smaller (providing you also shrink > dirty_writeback_centisecs in a similar ratio) and the VM will flush out data > more quickly. > > NeilBrown > > >> >> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) >> - firstly this command is detrimental to the performance of my PC, secondly, it won't help >> in this instance. >> >> Swap is totally disabled, usually my memory is entirely free. >> >> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 >> >> Please, advise. >> >> Best regards, >> >> Artem >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754948Ab3JYS02 (ORCPT ); Fri, 25 Oct 2013 14:26:28 -0400 Received: from smtprelay0123.b.hostedemail.com ([64.98.42.123]:56249 "EHLO smtprelay.b.hostedemail.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752703Ab3JYS01 (ORCPT ); Fri, 25 Oct 2013 14:26:27 -0400 X-Session-Marker: 742E617274656D406C79636F732E636F6D X-Spam-Summary: 2,0,0,,d41d8cd98f00b204,t.artem@lycos.com,:::::::::::::,RULES_HIT:41:152:355:379:582:599:973:988:989:1152:1260:1277:1311:1313:1314:1345:1437:1515:1516:1518:1534:1541:1593:1594:1711:1730:1747:1777:1792:2393:2553:2559:2562:2690:2692:2693:3138:3139:3140:3141:3142:3353:3622:3865:3866:3867:3868:3870:3871:3872:3873:3874:4250:4361:5007:6261:7875:7903:8526:8957:10004:10400:10450:10455:10848:11232:11658:11914:12517:12519:12663:12740:13069:13160:13229:13311:13357:13869:19904:19999,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0 X-HE-Tag: jar43_54f05e920e923 X-Filterd-Recvd-Size: 2404 Date: Fri, 25 Oct 2013 18:26:23 +0000 (UTC) From: "Artem S. Tashkinov" To: david@lang.hm Cc: neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Message-ID: <154617470.12445.1382725583671.JavaMail.mail@webmail11> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> Subject: Re: Disabling in-memory write cache for x86-64 in Linux II MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [46.147.29.47] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oct 25, 2013 05:26:45 PM, david wrote: On Fri, 25 Oct 2013, NeilBrown wrote: > >> >> What exactly is bothering you about this? The amount of memory used or the >> time until data is flushed? > >actually, I think the problem is more the impact of the huge write later on. Exactly. And not being able to use applications which show you IO performance like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine my life without being able to see the progress of a copying operation. With the current dirty cache there's no way to understand how you storage media actually behaves. Hopefully this issue won't dissolve into obscurity and someone will actually make up a plan (and a patch) how to make dirty write cache behave in a sane manner considering the fact that there are devices with very different write speeds and requirements. It'd be ever better, if I could specify dirty cache as a mount option (though sane defaults or semi-automatic values based on runtime estimates won't hurt). Per device dirty cache seems like a nice idea, I, for one, would like to disable it altogether or make it an absolute minimum for things like USB flash drives - because I don't care about multithreaded performance or delayed allocation on such devices - I'm interested in my data reaching my USB stick ASAP - because it's how most people use them. Regards, Artem From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754831Ab3JYTnM (ORCPT ); Fri, 25 Oct 2013 15:43:12 -0400 Received: from mail-wi0-f182.google.com ([209.85.212.182]:35134 "EHLO mail-wi0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753020Ab3JYTnK convert rfc822-to-8bit (ORCPT ); Fri, 25 Oct 2013 15:43:10 -0400 From: Diego Calleja To: "Artem S. Tashkinov" Cc: david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 25 Oct 2013 21:40:13 +0200 Message-ID: <1999200.Zdacx0scmY@diego-arch> User-Agent: KMail/4.11.2 (Linux/3.12.0-rc5; KDE/4.11.2; x86_64; ; ) In-Reply-To: <154617470.12445.1382725583671.JavaMail.mail@webmail11> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <154617470.12445.1382725583671.JavaMail.mail@webmail11> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain; charset="iso-8859-1" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribi: > Oct 25, 2013 05:26:45 PM, david wrote: > >actually, I think the problem is more the impact of the huge write later > >on. > Exactly. And not being able to use applications which show you IO > performance like Midnight Commander. You might prefer to use "cp -a" but I > cannot imagine my life without being able to see the progress of a copying > operation. With the current dirty cache there's no way to understand how > you storage media actually behaves. This is a problem I also have been suffering for a long time. It's not so much how much and when the systems syncs dirty data, but how unreponsive the desktop becomes when it happens (usually, with rsync + large files). Most programs become completely unreponsive, specially if they have a large memory consumption (ie. the browser). I need to pause rsync and wait until the systems writes out all dirty data if I want to do simple things like scrolling or do any action that uses I/O, otherwise I need to wait minutes. I have 16 GB of RAM and excluding the browser (which usually uses about half of a GB) and KDE itself, there are no memory hogs, so it seem like it's something that shouldn't happen. I can understand that I/O operations are laggy when there is some other intensive I/O ongoing, but right now the system becomes completely unreponsive. If I am unlucky and Konsole also becomes unreponsive, I need to switch to a VT (which also takes time). I haven't reported it before in part because I didn't know how to do it, "my browser stalls" is not a very useful description and I didn't know what kind of data I'm supposed to report. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755505Ab3JYVDr (ORCPT ); Fri, 25 Oct 2013 17:03:47 -0400 Received: from smtprelay0170.b.hostedemail.com ([64.98.42.170]:49509 "EHLO smtprelay.b.hostedemail.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755364Ab3JYVDq (ORCPT ); Fri, 25 Oct 2013 17:03:46 -0400 X-Session-Marker: 742E617274656D406C79636F732E636F6D X-Spam-Summary: 2,0,0,,d41d8cd98f00b204,t.artem@lycos.com,:::::::::::::,RULES_HIT:41:152:355:379:582:599:973:988:989:1152:1260:1277:1311:1313:1314:1345:1437:1515:1516:1518:1534:1542:1593:1594:1711:1730:1747:1777:1792:2393:2553:2559:2562:2692:2693:3138:3139:3140:3141:3142:3355:3622:3865:3866:3867:3868:3870:3871:3872:3873:3874:4250:4361:5007:6119:6261:6691:7875:7903:8526:8660:10004:10400:10848:10967:11232:11658:11914:12517:12519:12663:12740:13148:13230:13869,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0 X-HE-Tag: star49_3896e2947c58 X-Filterd-Recvd-Size: 3418 Date: Fri, 25 Oct 2013 21:03:44 +0000 (UTC) From: "Artem S. Tashkinov" To: neilb@suse.de Cc: david@lang.hm, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Message-ID: <476525596.14731.1382735024280.JavaMail.mail@webmail11> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11><20131026074349.0adc9646@notabene.brown> Subject: Re: Disabling in-memory write cache for x86-64 in Linux II MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [46.147.29.47] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oct 26, 2013 02:44:07 AM, neil wrote: On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov" >> >> Exactly. And not being able to use applications which show you IO performance >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine >> my life without being able to see the progress of a copying operation. With the current >> dirty cache there's no way to understand how you storage media actually behaves. > >So fix Midnight Commander. If you want the copy to be actually finished when >it says it is finished, then it needs to call 'fsync()' at the end. This sounds like a very bad joke. How applications are supposed to show and calculate an _average_ write speed if there are no kernel calls/ioctls to actually make the kernel flush dirty buffers _during_ copying? Actually it's a good way to solve this problem in user space - alas, even if such calls are implemented, user space will start using them only in 2018 if not further from that. >> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it >> altogether or make it an absolute minimum for things like USB flash drives - because >> I don't care about multithreaded performance or delayed allocation on such devices - >> I'm interested in my data reaching my USB stick ASAP - because it's how most people >> use them. >> > >As has already been said, you can substantially disable the cache by tuning >down various values in /proc/sys/vm/. >Have you tried? I don't understand who you are replying to. I asked about per device settings, you are again referring me to system wide settings - they don't look that good if we're talking about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense to allocate 20% of physical RAM for things which don't belong to it in the first place. I don't know any other OS which has a similar behaviour. And like people (including me) have already mentioned, such a huge dirty cache can stall their PCs/servers for a considerable amount of time. Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also not everyone in this world has an UPS - which means such a huge buffer can lead to a serious data loss in case of a power blackout. Regards, Artem From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752736Ab3JYWhu (ORCPT ); Fri, 25 Oct 2013 18:37:50 -0400 Received: from mga03.intel.com ([143.182.124.21]:39360 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751935Ab3JYWht (ORCPT ); Fri, 25 Oct 2013 18:37:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.93,573,1378882800"; d="scan'208";a="417367322" Date: Fri, 25 Oct 2013 23:37:42 +0100 From: Fengguang Wu To: Andrew Morton Cc: "Theodore Ts'o" , "Artem S. Tashkinov" , torvalds@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025223742.GA31280@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131025022937.12623dcd.akpm@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 02:29:37AM -0700, Andrew Morton wrote: > On Fri, 25 Oct 2013 05:18:42 -0400 "Theodore Ts'o" wrote: > > > What I think would make sense is to dynamically measure the speed of > > writeback, so that we can set these limits as a function of the device > > speed. > > We attempt to do this now - have a look through struct backing_dev_info. To be exact, it's backing_dev_info.write_bandwidth which is estimated in bdi_update_write_bandwidth() and exported as "BdiWriteBandwidth" in debugfs file bdi.stats. > Apparently all this stuff isn't working as desired (and perhaps as designed) > in this case. Will take a look after a return to normalcy ;) Right. The write bandwidth estimation is only estimated and used when background dirty threshold is reached and hence the disk is actively doing writeback IO -- which is the case that we can do reasonable estimation of the writeback bandwidth. Note that this estimated BdiWriteBandwidth may better be named "writeback" bandwidth because it may change depending on the workload at the time -- eg. sequential vs. random writes; whether there are parallel reads or direct IO competing the disk time. BdiWriteBandwidth is only designed for use by the dirty throttling logic and is not generally useful/reliable for other purposes. It's a bit late and I'd like to carry the original question as exercises in tomorrow's airplanes. :) Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753348Ab3JYXFy (ORCPT ); Fri, 25 Oct 2013 19:05:54 -0400 Received: from mga11.intel.com ([192.55.52.93]:15080 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752364Ab3JYXFx (ORCPT ); Fri, 25 Oct 2013 19:05:53 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.93,573,1378882800"; d="scan'208";a="417377553" Date: Sat, 26 Oct 2013 00:05:45 +0100 From: Fengguang Wu To: "Theodore Ts'o" , "Artem S. Tashkinov" , torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Cc: Diego Calleja , David Lang , NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025230545.GB31280@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131025091842.GA28681@thunk.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 05:18:42AM -0400, Theodore Ts'o wrote: > On Fri, Oct 25, 2013 at 08:30:53AM +0000, Artem S. Tashkinov wrote: > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be > > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or > > more) this value becomes unrealistic (13GB) and I've already had some > > unpleasant effects due to it. > > What I think would make sense is to dynamically measure the speed of > writeback, so that we can set these limits as a function of the device > speed. It's already the case that the writeback limits don't make > sense on a slow USB 2.0 storage stick; I suspect that for really huge > RAID arrays or very fast flash devices, it doesn't make much sense > either. > > The problem is that if you have a system that has *both* a USB stick > _and_ a fast flash/RAID storage array both needing writeback, this > doesn't work well --- but what we have right now doesn't work all that > well anyway. Ted, when trying to follow up your email, I got a crazy idea and it'd be better throw it out rather than carrying it to bed. :) We could do per-bdi dirty thresholds - which has been proposed 1-2 times before by different people. The per-bdi dirty thresholds could be auto set by the kernel this way: start it with an initial value of 100MB. When reached, put all the 100MB dirty data to IO and get an estimation of the write bandwidth. >>From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth, where N is the seconds of dirty data we'd like to cache in memory. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753186Ab3JYXiH (ORCPT ); Fri, 25 Oct 2013 19:38:07 -0400 Received: from imap.thunk.org ([74.207.234.97]:50878 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751514Ab3JYXiF (ORCPT ); Fri, 25 Oct 2013 19:38:05 -0400 Date: Fri, 25 Oct 2013 19:37:53 -0400 From: "Theodore Ts'o" To: Fengguang Wu Cc: "Artem S. Tashkinov" , torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Diego Calleja , David Lang , NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025233753.GD19823@thunk.org> Mail-Followup-To: Theodore Ts'o , Fengguang Wu , "Artem S. Tashkinov" , torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Diego Calleja , David Lang , NeilBrown References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025230545.GB31280@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131025230545.GB31280@localhost> User-Agent: Mutt/1.5.21 (2010-09-15) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on imap.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote: > > Ted, when trying to follow up your email, I got a crazy idea and it'd > be better throw it out rather than carrying it to bed. :) > > We could do per-bdi dirty thresholds - which has been proposed 1-2 > times before by different people. > > The per-bdi dirty thresholds could be auto set by the kernel this way: > start it with an initial value of 100MB. When reached, put all the > 100MB dirty data to IO and get an estimation of the write bandwidth. > From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth, > where N is the seconds of dirty data we'd like to cache in memory. Sure, although I wonder if it would be worth it calcuate some kind of rolling average of the write bandwidth while we are doing writeback, so if it turns out we got unlucky with the contents of the first 100MB of dirty data (it could be either highly random or highly sequential) the we'll eventually correct to the right level. This means that VM would have to keep dirty page counters for each BDI --- which I thought we weren't doing right now, which is why we have a global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I have cause and effect reversed? :-) - Ted From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753230Ab3JYXcf (ORCPT ); Fri, 25 Oct 2013 19:32:35 -0400 Received: from mga14.intel.com ([143.182.124.37]:14622 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751514Ab3JYXce (ORCPT ); Fri, 25 Oct 2013 19:32:34 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.93,573,1378882800"; d="scan'208";a="417387682" Date: Sat, 26 Oct 2013 00:32:25 +0100 From: Fengguang Wu To: Diego Calleja Cc: "Artem S. Tashkinov" , david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131025233225.GA32051@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <1999200.Zdacx0scmY@diego-arch> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1999200.Zdacx0scmY@diego-arch> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote: > El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió: > > Oct 25, 2013 05:26:45 PM, david wrote: > > >actually, I think the problem is more the impact of the huge write later > > >on. > > Exactly. And not being able to use applications which show you IO > > performance like Midnight Commander. You might prefer to use "cp -a" but I > > cannot imagine my life without being able to see the progress of a copying > > operation. With the current dirty cache there's no way to understand how > > you storage media actually behaves. > > > This is a problem I also have been suffering for a long time. It's not so much > how much and when the systems syncs dirty data, but how unreponsive the > desktop becomes when it happens (usually, with rsync + large files). Most > programs become completely unreponsive, specially if they have a large memory > consumption (ie. the browser). I need to pause rsync and wait until the > systems writes out all dirty data if I want to do simple things like scrolling > or do any action that uses I/O, otherwise I need to wait minutes. That's a problem. And it's kind of independent of the dirty threshold -- if you are doing large file copies in the background, it will lead to continuous disk writes and stalls anyway -- the large dirty threshold merely delays the write IO time. > I have 16 GB of RAM and excluding the browser (which usually uses about half > of a GB) and KDE itself, there are no memory hogs, so it seem like it's > something that shouldn't happen. I can understand that I/O operations are > laggy when there is some other intensive I/O ongoing, but right now the system > becomes completely unreponsive. If I am unlucky and Konsole also becomes > unreponsive, I need to switch to a VT (which also takes time). > > I haven't reported it before in part because I didn't know how to do it, "my > browser stalls" is not a very useful description and I didn't know what kind > of data I'm supposed to report. What's the kernel you are running? And it's writing to a hard disk? The stalls are most likely caused by either one of 1) write IO starves read IO 2) direct page reclaim blocked when - trying to writeout PG_dirty pages - trying to lock PG_writeback pages Which may be confirmed by running ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 or echo w > /proc/sysrq-trigger # and check dmesg during the stalls. The latter command works more reliably. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752317Ab3JZLco (ORCPT ); Sat, 26 Oct 2013 07:32:44 -0400 Received: from atrey.karlin.mff.cuni.cz ([195.113.26.193]:59674 "EHLO atrey.karlin.mff.cuni.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751558Ab3JZLcn (ORCPT ); Sat, 26 Oct 2013 07:32:43 -0400 Date: Sat, 26 Oct 2013 13:32:38 +0200 From: Pavel Machek To: Linus Torvalds Cc: Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131026113238.GC1792@Nokia-N900> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 2013-10-25 10:32:16, Linus Torvalds wrote: > On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton > wrote: > > > > Apparently all this stuff isn't working as desired (and perhaps as designed) > > in this case. Will take a look after a return to normalcy ;) > > It definitely doesn't work. I can trivially reproduce problems by just > having a cheap (==slow) USB key with an ext3 filesystem, and going a > git clone to it. The end result is not pretty, and that's actually not > even a huge amount of data. Hmm, I'd expect the result to be "dead USB key". Putting ext3 on cheap flash device normally just kills the devic :-(. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753842Ab3JZUD6 (ORCPT ); Sat, 26 Oct 2013 16:03:58 -0400 Received: from mail-vb0-f43.google.com ([209.85.212.43]:53824 "EHLO mail-vb0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753624Ab3JZUD5 (ORCPT ); Sat, 26 Oct 2013 16:03:57 -0400 MIME-Version: 1.0 In-Reply-To: <20131026113238.GC1792@Nokia-N900> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131026113238.GC1792@Nokia-N900> Date: Sat, 26 Oct 2013 13:03:56 -0700 X-Google-Sender-Auth: 0RSMrceYxz6tqTovNYao_JP4uQQ Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: Pavel Machek Cc: Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Oct 26, 2013 at 4:32 AM, Pavel Machek wrote: > > Hmm, I'd expect the result to be "dead USB key". Putting > ext3 on cheap flash device normally just kills the devic :-(. Not my experience. It may be true for some really cheap devices, but normal USB keys seem to just get really slow, probably due to having had their flash rewrite algorithm tuned for FAT accesses. I *do* suspect that to see the really bad behavior, you don't write just one large file to it, but many smaller ones. "git clone" will check out all the kernel tree files, obviously. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753176Ab3J2Ua5 (ORCPT ); Tue, 29 Oct 2013 16:30:57 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49623 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752101Ab3J2Ua4 (ORCPT ); Tue, 29 Oct 2013 16:30:56 -0400 Date: Tue, 29 Oct 2013 21:30:50 +0100 From: Jan Kara To: Karl Kiniger Cc: Linus Torvalds , "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131029203050.GE9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091555.GA30895@kipc2.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131025091555.GA30895@kipc2.localdomain> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-10-13 11:15:55, Karl Kiniger wrote: > On Fri 131025, Linus Torvalds wrote: > > On Fri, Oct 25, 2013 at 9:30 AM, Artem S. Tashkinov wrote: > > > > > > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be > > > percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or > > > more) this value becomes unrealistic (13GB) and I've already had some > > > unpleasant effects due to it. > > > > Right. The percentage notion really goes back to the days when we > > typically had 8-64 *megabytes* of memory So if you had a 8MB machine > > you wouldn't want to have more than one megabyte of dirty data, but if > > you were "Mr Moneybags" and could afford 64MB, you might want to have > > up to 8MB dirty!! > > > > Things have changed. > > > > So I would suggest we change the defaults. Or pwehaps make the rule be > > that "the ratio numbers are 'ratio of memory up to 1GB'", to make the > > semantics similar across 32-bit HIGHMEM machines and 64-bit machines. > > > > The modern way of expressing the dirty limits are to give the actual > > absolute byte amounts, but we default to the legacy ratio mode.. > > > > Linus > > Is it currently possible to somehow set above values per block device? Yes, to some extent. You can set /sys/block//bdi/max_ratio to the maximum proportion the device's dirty data can take from the total amount. The caveat currently is that this setting only takes effect after we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in total because that is an amount of dirty data when we start to throttle processes. So if the device you'd like to limit is the only one which is currently written to, the limiting doesn't have a big effect. Andrew has queued up a patch series from Maxim Patlasov which removes this caveat but currently we don't have a way admin can switch that from userspace. But I'd like to have that tunable from userspace exactly for the cases as you describe below. > I want default behaviour for almost everything but DVD drives in DVD+RW > packet writing mode may easily take several minutes in case of a sync. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752635Ab3J2UlD (ORCPT ); Tue, 29 Oct 2013 16:41:03 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49893 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751615Ab3J2UlA (ORCPT ); Tue, 29 Oct 2013 16:41:00 -0400 Date: Tue, 29 Oct 2013 21:40:52 +0100 From: Jan Kara To: "Theodore Ts'o" Cc: Fengguang Wu , "Artem S. Tashkinov" , torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Diego Calleja , David Lang , NeilBrown Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131029204052.GF9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025230545.GB31280@localhost> <20131025233753.GD19823@thunk.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131025233753.GD19823@thunk.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-10-13 19:37:53, Ted Tso wrote: > On Sat, Oct 26, 2013 at 12:05:45AM +0100, Fengguang Wu wrote: > > > > Ted, when trying to follow up your email, I got a crazy idea and it'd > > be better throw it out rather than carrying it to bed. :) > > > > We could do per-bdi dirty thresholds - which has been proposed 1-2 > > times before by different people. > > > > The per-bdi dirty thresholds could be auto set by the kernel this way: > > start it with an initial value of 100MB. When reached, put all the > > 100MB dirty data to IO and get an estimation of the write bandwidth. > > From then on, set the bdi's dirty threshold to N * bdi_write_bandwidth, > > where N is the seconds of dirty data we'd like to cache in memory. > > Sure, although I wonder if it would be worth it calcuate some kind of > rolling average of the write bandwidth while we are doing writeback, > so if it turns out we got unlucky with the contents of the first 100MB > of dirty data (it could be either highly random or highly sequential) > the we'll eventually correct to the right level. We already do average measured throughput over a longer time window and have kind of rolling average algorithm doing some averaging. > This means that VM would have to keep dirty page counters for each BDI > --- which I thought we weren't doing right now, which is why we have a > global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I > have cause and effect reversed? :-) And we do currently keep the number of dirty & under writeback pages per BDI. We have global limits because mm wants to limit the total number of dirty pages (as those are harder to free). It doesn't care as much to which device these pages belong (although it probably should care a bit more because there are huge differences between how quickly can different devices get rid of dirty pages). Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752763Ab3J2Unt (ORCPT ); Tue, 29 Oct 2013 16:43:49 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:38628 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752040Ab3J2Uns (ORCPT ); Tue, 29 Oct 2013 16:43:48 -0400 Date: Tue, 29 Oct 2013 13:43:46 -0700 From: Andrew Morton To: Jan Kara Cc: Karl Kiniger , Linus Torvalds , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-Id: <20131029134346.9e5873bae3630e9a69891773@linux-foundation.org> In-Reply-To: <20131029203050.GE9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091555.GA30895@kipc2.localdomain> <20131029203050.GE9568@quack.suse.cz> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara wrote: > Andrew has queued up a patch series from Maxim Patlasov which removes this > caveat but currently we don't have a way admin can switch that from > userspace. But I'd like to have that tunable from userspace exactly for the > cases as you describe below. This? commit 5a53748568f79641eaf40e41081a2f4987f005c2 Author: Maxim Patlasov AuthorDate: Wed Sep 11 14:22:46 2013 -0700 Commit: Linus Torvalds CommitDate: Wed Sep 11 15:58:04 2013 -0700 mm/page-writeback.c: add strictlimit feature That's already in mainline, for 3.12. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752092Ab3J2U6A (ORCPT ); Tue, 29 Oct 2013 16:58:00 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50470 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751380Ab3J2U57 (ORCPT ); Tue, 29 Oct 2013 16:57:59 -0400 Date: Tue, 29 Oct 2013 21:57:56 +0100 From: Jan Kara To: Linus Torvalds Cc: Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List , mgorman@suse.de Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131029205756.GH9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-10-13 10:32:16, Linus Torvalds wrote: > On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton > wrote: > > > > Apparently all this stuff isn't working as desired (and perhaps as designed) > > in this case. Will take a look after a return to normalcy ;) > > It definitely doesn't work. I can trivially reproduce problems by just > having a cheap (==slow) USB key with an ext3 filesystem, and going a > git clone to it. The end result is not pretty, and that's actually not > even a huge amount of data. I'll try to reproduce this tomorrow so that I can have a look where exactly are we stuck. But in last few releases problems like this were caused by problems in reclaim which got fed up by seeing lots of dirty / under writeback pages and ended up stuck waiting for IO to finish. Mel has been tweaking the logic here and there but maybe it haven't got fixed completely. Mel, do you know about any outstanding issues? Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752738Ab3J2Utl (ORCPT ); Tue, 29 Oct 2013 16:49:41 -0400 Received: from cantor2.suse.de ([195.135.220.15]:50070 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751396Ab3J2Utj (ORCPT ); Tue, 29 Oct 2013 16:49:39 -0400 Date: Tue, 29 Oct 2013 21:49:37 +0100 From: Jan Kara To: "Artem S. Tashkinov" Cc: david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131029204937.GG9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <154617470.12445.1382725583671.JavaMail.mail@webmail11> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-10-13 18:26:23, Artem S. Tashkinov wrote: > Oct 25, 2013 05:26:45 PM, david wrote: > On Fri, 25 Oct 2013, NeilBrown wrote: > > > >> > >> What exactly is bothering you about this? The amount of memory used or the > >> time until data is flushed? > > > >actually, I think the problem is more the impact of the huge write later on. > > Exactly. And not being able to use applications which show you IO > performance like Midnight Commander. You might prefer to use "cp -a" but > I cannot imagine my life without being able to see the progress of a > copying operation. With the current dirty cache there's no way to > understand how you storage media actually behaves. Large writes shouldn't stall your desktop, that's certain and we must fix that. I don't find the problem with copy progress indicators that pressing... > Hopefully this issue won't dissolve into obscurity and someone will > actually make up a plan (and a patch) how to make dirty write cache > behave in a sane manner considering the fact that there are devices with > very different write speeds and requirements. It'd be ever better, if I > could specify dirty cache as a mount option (though sane defaults or > semi-automatic values based on runtime estimates won't hurt). > > Per device dirty cache seems like a nice idea, I, for one, would like to > disable it altogether or make it an absolute minimum for things like USB > flash drives - because I don't care about multithreaded performance or > delayed allocation on such devices - I'm interested in my data reaching > my USB stick ASAP - because it's how most people use them. See my other emails in this thread. There are ways to tune the amount of dirty data allowed per device. Currently the result isn't very satisfactory but we should have something usable after the next merge window. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752803Ab3J2Vam (ORCPT ); Tue, 29 Oct 2013 17:30:42 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51334 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751517Ab3J2Val (ORCPT ); Tue, 29 Oct 2013 17:30:41 -0400 Date: Tue, 29 Oct 2013 22:30:37 +0100 From: Jan Kara To: Andrew Morton Cc: Jan Kara , Karl Kiniger , Linus Torvalds , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131029213037.GB12814@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091555.GA30895@kipc2.localdomain> <20131029203050.GE9568@quack.suse.cz> <20131029134346.9e5873bae3630e9a69891773@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131029134346.9e5873bae3630e9a69891773@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 29-10-13 13:43:46, Andrew Morton wrote: > On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara wrote: > > > Andrew has queued up a patch series from Maxim Patlasov which removes this > > caveat but currently we don't have a way admin can switch that from > > userspace. But I'd like to have that tunable from userspace exactly for the > > cases as you describe below. > > This? > > commit 5a53748568f79641eaf40e41081a2f4987f005c2 > Author: Maxim Patlasov > AuthorDate: Wed Sep 11 14:22:46 2013 -0700 > Commit: Linus Torvalds > CommitDate: Wed Sep 11 15:58:04 2013 -0700 > > mm/page-writeback.c: add strictlimit feature > > That's already in mainline, for 3.12. Yes, I should have checked the code... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752921Ab3J2Vd4 (ORCPT ); Tue, 29 Oct 2013 17:33:56 -0400 Received: from mail-ve0-f171.google.com ([209.85.128.171]:49892 "EHLO mail-ve0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751929Ab3J2Vdz (ORCPT ); Tue, 29 Oct 2013 17:33:55 -0400 MIME-Version: 1.0 In-Reply-To: <20131029205756.GH9568@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> Date: Tue, 29 Oct 2013 14:33:53 -0700 X-Google-Sender-Auth: IdzdQ7_x3O9C2oYJdoFcZF2gZV0 Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: Jan Kara Cc: Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List , Mel Gorman Content-Type: multipart/mixed; boundary=089e0122f0f22fad9404e9e7f948 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --089e0122f0f22fad9404e9e7f948 Content-Type: text/plain; charset=UTF-8 On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara wrote: > On Fri 25-10-13 10:32:16, Linus Torvalds wrote: >> >> It definitely doesn't work. I can trivially reproduce problems by just >> having a cheap (==slow) USB key with an ext3 filesystem, and going a >> git clone to it. The end result is not pretty, and that's actually not >> even a huge amount of data. > > I'll try to reproduce this tomorrow so that I can have a look where > exactly are we stuck. But in last few releases problems like this were > caused by problems in reclaim which got fed up by seeing lots of dirty > / under writeback pages and ended up stuck waiting for IO to finish. Mel > has been tweaking the logic here and there but maybe it haven't got fixed > completely. Mel, do you know about any outstanding issues? I'm not sure this has ever worked, and in the last few years the common desktop memory size has continued to grow. For servers and "serious" desktops, having tons of dirty data doesn't tend to be as much of a problem, because those environments are pretty much defined by also having fairly good IO subsystems, and people seldom use crappy USB devices for more than doing things like reading pictures off them etc. And you'd not even see the problem under any such load. But it's actually really easy to reproduce by just taking your average USB key and trying to write to it. I just did it with a random ISO image, and it's _painful_. And it's not that it's painful for doing most other things in the background, but if you just happen to run anything that does "sync" (and it happens in scripts), the thing just comes to a screeching halt. For minutes. Same obviously goes with trying to eject/unmount the media etc. We've had this problem before with the whole "ratio of dirty memory" thing. It was a mistake. It made sense (and came from) back in the days when people had 16MB or 32MB of RAM, and the concept of "let's limit dirty memory to x% of that" was actually fairly reasonable. But that "x%" doesn't make much sense any more. x% of 16GB (which is quite the reasonable amount of memory for any modern desktop) is a huge thing, and in the meantime the performance of disks have gone up a lot (largely thanks to SSD's), but the *minimum* performance of disks hasn't really improved all that much (largely thanks to USB ;). So how about we just admit that the whole "ratio" thing was a big mistake, and tell people that if they want to set a dirty limit, they should do so in bytes? Which we already really do, but we default to that ratio nevertheless. Which is why I'd suggest we just say "the ratio works fine up to a certain amount, and makes no sense past it". Why not make that "the ratio works fine up to a certain amount, and makes no sense past it" be part of the calculations. We actually *hace* exactly that on HIGHMEM machines, where we have this configuration option of "vm_highmem_is_dirtyable" that defaults to off. It just doesn't trigger on nonhighmem machines (today: "64-bit"). So I would suggest that we just expose that "vm_highmem_is_dirtyable" on 64-bit too, and just say that anything over 1GB is highmem. That means that 32-bit and 64-bit environments will basically act the same, and I think it makes the defaults a bit saner. Limiting the amount of dirty memory to 100MB/200MB (for "start background writing" and "wait synchronously" respectively) even if you happen to have 16GB of memory sounds like a good idea. Sure, it might make some benchmarks a bit slower, but it will at least avoid the "wait forever" symptom. And if you really have a very studly IO subsystem, the fact that it starts writing out earlier won't really be a problem. After all, there are two reasons to do delayed writes: - temp-files may not be written out at all. Quite frankly, if you have multi-hundred-megabyte temptiles, you've got issues - coalescing writes improves throughput There are very much diminishing returns, and the big return is to make sure that we write things out in a good order, which a 100MB buffer should make more than possible. so I really think that it's insane to default to 1.6GB of dirty data before you even start writing it out if you happen to have 16GB of memory. And again: if your benchmark is to create a kernel tree and then immediately delete it, and you used to do that without doing any actual IO, then yes, the attached patch will make that go much slower. But for that benchmark, maybe you should just set the dirty limits (in bytes) by hand, rather than expect the default kernel values to prefer benchmarks over sanity? Suggested patch attached. Comments? Linus --089e0122f0f22fad9404e9e7f948 Content-Type: text/x-patch; charset=US-ASCII; name="patch.diff" Content-Disposition: attachment; filename="patch.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hndnkfpp0 IGtlcm5lbC9zeXNjdGwuYyAgICAgfCAyIC0tCiBtbS9wYWdlLXdyaXRlYmFjay5jIHwgNyArKysr KystCiAyIGZpbGVzIGNoYW5nZWQsIDYgaW5zZXJ0aW9ucygrKSwgMyBkZWxldGlvbnMoLSkKCmRp ZmYgLS1naXQgYS9rZXJuZWwvc3lzY3RsLmMgYi9rZXJuZWwvc3lzY3RsLmMKaW5kZXggYjJmMDZm M2M2YTNmLi40MTFkYTU2Y2Q3MzIgMTAwNjQ0Ci0tLSBhL2tlcm5lbC9zeXNjdGwuYworKysgYi9r ZXJuZWwvc3lzY3RsLmMKQEAgLTE0MDYsNyArMTQwNiw2IEBAIHN0YXRpYyBzdHJ1Y3QgY3RsX3Rh YmxlIHZtX3RhYmxlW10gPSB7CiAJCS5leHRyYTEJCT0gJnplcm8sCiAJfSwKICNlbmRpZgotI2lm ZGVmIENPTkZJR19ISUdITUVNCiAJewogCQkucHJvY25hbWUJPSAiaGlnaG1lbV9pc19kaXJ0eWFi bGUiLAogCQkuZGF0YQkJPSAmdm1faGlnaG1lbV9pc19kaXJ0eWFibGUsCkBAIC0xNDE2LDcgKzE0 MTUsNiBAQCBzdGF0aWMgc3RydWN0IGN0bF90YWJsZSB2bV90YWJsZVtdID0gewogCQkuZXh0cmEx CQk9ICZ6ZXJvLAogCQkuZXh0cmEyCQk9ICZvbmUsCiAJfSwKLSNlbmRpZgogCXsKIAkJLnByb2Nu YW1lCT0gInNjYW5fdW5ldmljdGFibGVfcGFnZXMiLAogCQkuZGF0YQkJPSAmc2Nhbl91bmV2aWN0 YWJsZV9wYWdlcywKZGlmZiAtLWdpdCBhL21tL3BhZ2Utd3JpdGViYWNrLmMgYi9tbS9wYWdlLXdy aXRlYmFjay5jCmluZGV4IDYzODA3NTgzZDhlOC4uYjNiY2UxY2Q1OWQ1IDEwMDY0NAotLS0gYS9t bS9wYWdlLXdyaXRlYmFjay5jCisrKyBiL21tL3BhZ2Utd3JpdGViYWNrLmMKQEAgLTI0MSw4ICsy NDEsMTMgQEAgc3RhdGljIHVuc2lnbmVkIGxvbmcgZ2xvYmFsX2RpcnR5YWJsZV9tZW1vcnkodm9p ZCkKIAl4ID0gZ2xvYmFsX3BhZ2Vfc3RhdGUoTlJfRlJFRV9QQUdFUykgKyBnbG9iYWxfcmVjbGFp bWFibGVfcGFnZXMoKTsKIAl4IC09IG1pbih4LCBkaXJ0eV9iYWxhbmNlX3Jlc2VydmUpOwogCi0J aWYgKCF2bV9oaWdobWVtX2lzX2RpcnR5YWJsZSkKKwlpZiAoIXZtX2hpZ2htZW1faXNfZGlydHlh YmxlKSB7CisJCWNvbnN0IHVuc2lnbmVkIGxvbmcgR0JfcGFnZXMgPSAxMDI0KjEwMjQqMTAyNCAv IFBBR0VfU0laRTsKKwogCQl4IC09IGhpZ2htZW1fZGlydHlhYmxlX21lbW9yeSh4KTsKKwkJaWYg KHggPiBHQl9wYWdlcykKKwkJCXggPSBHQl9wYWdlczsKKwl9CiAKIAlyZXR1cm4geCArIDE7CS8q IEVuc3VyZSB0aGF0IHdlIG5ldmVyIHJldHVybiAwICovCiB9Cg== --089e0122f0f22fad9404e9e7f948-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753361Ab3J2Vgp (ORCPT ); Tue, 29 Oct 2013 17:36:45 -0400 Received: from mail-vc0-f182.google.com ([209.85.220.182]:56815 "EHLO mail-vc0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752039Ab3J2Vgo (ORCPT ); Tue, 29 Oct 2013 17:36:44 -0400 MIME-Version: 1.0 In-Reply-To: <20131029134346.9e5873bae3630e9a69891773@linux-foundation.org> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091555.GA30895@kipc2.localdomain> <20131029203050.GE9568@quack.suse.cz> <20131029134346.9e5873bae3630e9a69891773@linux-foundation.org> Date: Tue, 29 Oct 2013 14:36:43 -0700 X-Google-Sender-Auth: v3G6IltcXgij3bCjsEK_1LQVhjo Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: Andrew Morton Cc: Jan Kara , Karl Kiniger , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 29, 2013 at 1:43 PM, Andrew Morton wrote: > On Tue, 29 Oct 2013 21:30:50 +0100 Jan Kara wrote: > >> Andrew has queued up a patch series from Maxim Patlasov which removes this >> caveat but currently we don't have a way admin can switch that from >> userspace. But I'd like to have that tunable from userspace exactly for the >> cases as you describe below. > > This? > > mm/page-writeback.c: add strictlimit feature > > That's already in mainline, for 3.12. Nothing currently actually *sets* the BDI_CAP_STRICTLIMIT flag, though. So it's a potential fix, but it's certainly not a fix now. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751436Ab3J2WNa (ORCPT ); Tue, 29 Oct 2013 18:13:30 -0400 Received: from cantor2.suse.de ([195.135.220.15]:52423 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750974Ab3J2WN3 (ORCPT ); Tue, 29 Oct 2013 18:13:29 -0400 Date: Tue, 29 Oct 2013 23:13:24 +0100 From: Jan Kara To: Linus Torvalds Cc: Jan Kara , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List , Mel Gorman Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131029221324.GC12814@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 29-10-13 14:33:53, Linus Torvalds wrote: > On Tue, Oct 29, 2013 at 1:57 PM, Jan Kara wrote: > > On Fri 25-10-13 10:32:16, Linus Torvalds wrote: > >> > >> It definitely doesn't work. I can trivially reproduce problems by just > >> having a cheap (==slow) USB key with an ext3 filesystem, and going a > >> git clone to it. The end result is not pretty, and that's actually not > >> even a huge amount of data. > > > > I'll try to reproduce this tomorrow so that I can have a look where > > exactly are we stuck. But in last few releases problems like this were > > caused by problems in reclaim which got fed up by seeing lots of dirty > > / under writeback pages and ended up stuck waiting for IO to finish. Mel > > has been tweaking the logic here and there but maybe it haven't got fixed > > completely. Mel, do you know about any outstanding issues? > > I'm not sure this has ever worked, and in the last few years the > common desktop memory size has continued to grow. > > For servers and "serious" desktops, having tons of dirty data doesn't > tend to be as much of a problem, because those environments are pretty > much defined by also having fairly good IO subsystems, and people > seldom use crappy USB devices for more than doing things like reading > pictures off them etc. And you'd not even see the problem under any > such load. > > But it's actually really easy to reproduce by just taking your average > USB key and trying to write to it. I just did it with a random ISO > image, and it's _painful_. And it's not that it's painful for doing > most other things in the background, but if you just happen to run > anything that does "sync" (and it happens in scripts), the thing just > comes to a screeching halt. For minutes. Yes, I agree that caching more than couple of seconds worth of writeback for a device isn't good. > Same obviously goes with trying to eject/unmount the media etc. > > We've had this problem before with the whole "ratio of dirty memory" > thing. It was a mistake. It made sense (and came from) back in the > days when people had 16MB or 32MB of RAM, and the concept of "let's > limit dirty memory to x% of that" was actually fairly reasonable. But > that "x%" doesn't make much sense any more. x% of 16GB (which is quite > the reasonable amount of memory for any modern desktop) is a huge > thing, and in the meantime the performance of disks have gone up a lot > (largely thanks to SSD's), but the *minimum* performance of disks > hasn't really improved all that much (largely thanks to USB ;). > > So how about we just admit that the whole "ratio" thing was a big > mistake, and tell people that if they want to set a dirty limit, they > should do so in bytes? Which we already really do, but we default to > that ratio nevertheless. Which is why I'd suggest we just say "the > ratio works fine up to a certain amount, and makes no sense past it". > > Why not make that "the ratio works fine up to a certain amount, and > makes no sense past it" be part of the calculations. We actually > *hace* exactly that on HIGHMEM machines, where we have this > configuration option of "vm_highmem_is_dirtyable" that defaults to > off. It just doesn't trigger on nonhighmem machines (today: "64-bit"). > > So I would suggest that we just expose that "vm_highmem_is_dirtyable" > on 64-bit too, and just say that anything over 1GB is highmem. That > means that 32-bit and 64-bit environments will basically act the same, > and I think it makes the defaults a bit saner. > > Limiting the amount of dirty memory to 100MB/200MB (for "start > background writing" and "wait synchronously" respectively) even if you > happen to have 16GB of memory sounds like a good idea. Sure, it might > make some benchmarks a bit slower, but it will at least avoid the > "wait forever" symptom. And if you really have a very studly IO > subsystem, the fact that it starts writing out earlier won't really be > a problem. So I think we both realize this is only about what the default should be. There will always be people who have loads which benefit from setting dirty limits high but I agree they are minority. The reason why we left the limits at what they are now despite them having less and less sence is that we didn't want to break user expectations. If we cap the dirty limits as you suggest, I bet we'll get some user complaints and "don't break users" policy thus tells me we shouldn't do such changes ;) Also I'm not sure capping dirty limits at 200MB is the best spot. It may be but I think we should experiment with numbers a bit to check whether we didn't miss something. > After all, there are two reasons to do delayed writes: > > - temp-files may not be written out at all. > > Quite frankly, if you have multi-hundred-megabyte temptiles, you've > got issues Actually people do stuff like this e.g. when generating ISO images before burning them. > - coalescing writes improves throughput > > There are very much diminishing returns, and the big return is to > make sure that we write things out in a good order, which a 100MB > buffer should make more than possible. True. There is one more aspect: - transforming random writes into mostly sequential writes Different userspace programs use simple memory mapped databases which do random writes into their data files. The less you writeback these the better (at least from throughput POV). I'm not sure how large are these files together on average user desktop though but my guess would be that 100 MB *should* be enough for them. Can anyone with GNOME / KDE desktop try running with limits set this low for some time? > so I really think that it's insane to default to 1.6GB of dirty data > before you even start writing it out if you happen to have 16GB of > memory. > > And again: if your benchmark is to create a kernel tree and then > immediately delete it, and you used to do that without doing any > actual IO, then yes, the attached patch will make that go much slower. > But for that benchmark, maybe you should just set the dirty limits (in > bytes) by hand, rather than expect the default kernel values to prefer > benchmarks over sanity? > > Suggested patch attached. Comments? Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751792Ab3J2WmL (ORCPT ); Tue, 29 Oct 2013 18:42:11 -0400 Received: from mail-ve0-f177.google.com ([209.85.128.177]:47421 "EHLO mail-ve0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751125Ab3J2WmJ (ORCPT ); Tue, 29 Oct 2013 18:42:09 -0400 MIME-Version: 1.0 In-Reply-To: <20131029221324.GC12814@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> <20131029221324.GC12814@quack.suse.cz> Date: Tue, 29 Oct 2013 15:42:08 -0700 X-Google-Sender-Auth: Eci0rvjNkDc35AtshZfxGjRHvl8 Message-ID: Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Linus Torvalds To: Jan Kara Cc: Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List , Mel Gorman , Maxim Patlasov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara wrote: > > So I think we both realize this is only about what the default should be. Yes. Most people will use the defaults, but there will always be people who tune things for particular loads. In fact, I think we have gone much too far in saying "all policy in user space", because the fact is, user space isn't very good at policy. Especially not at reacting to complex situations with different devices. From what I've seen, "policy in user space" has resulted in exactly two modes: - user space does something stupid and wrong (example: "nice -19 X" to work around some scheduler oddities) - user space does nothing at all, and the kernel people say "hey, user space _could_ set this value Xyz, so it's not our problem, and it's policy, so we shouldn't touch it". I think we in the kernel should say "our defaults should be what everybody sane can use, and they should work fine on average". With "policy in user space" being for crazy people that do really odd things and can really spare the time to tune for their particular issue. So the "policy in user space" should be about *overriding* kernel policy choices, not about the kernel never having them. And this kind of "you can have many different devices and they act quite differently" is a good example of something complicated that user space really doesn't have a great model for. And we actually have much better possible information in the kernel than user space ever is likely to have. > Also I'm not sure capping dirty limits at 200MB is the best spot. It may be > but I think we should experiment with numbers a bit to check whether we > didn't miss something. Sure. That said, the patch I suggested basically makes the numbers be at least roughly comparable across different architectures. So it's been at least somewhat tested, even if 16GB x86-32 machines are hopefully pretty rare (but I hear about people installing 32-bit on modern machines much too often). >> - temp-files may not be written out at all. >> >> Quite frankly, if you have multi-hundred-megabyte temptiles, you've >> got issues > Actually people do stuff like this e.g. when generating ISO images before > burning them. Yes, but then the temp-file is long-lived enough that it *will* hit the disk anyway. So it's only the "create temporary file and pretty much immediately delete it" case that changes behavior (ie compiler assembly files etc). If the temp-file is for something like burning an ISO image, the burning part is slow enough that the temp-file will hit the disk regardless of when we start writing it. > There is one more aspect: > - transforming random writes into mostly sequential writes Sure. And I think that if you have a big database, that's when you do end up tweaking the dirty limits. That said, I'd certainly like it even *more* if the limits really were per-BDI, and the global limit was in addition to the per-bdi ones. Because when you have a USB device that gets maybe 10MB/s on contiguous writes, and 100kB/s on random 4k writes, I think it would make more sense to make the "start writeout" limits be 1MB/2MB, not 100MB/200MB. So my patch doesn't even take it far enough, it's just a "let's not be ridiculous". The per-BDI limits don't seem quite ready for prime time yet, though. Even the new "strict" limits seems to be more about "trusted filesystems" than about really sane writeback limits. Fengguang, comments? (And I added Maxim to the cc, since he's the author of the strict mode, and while it is currently limited to FUSE, he did mention USB storage in the commit message..). Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753751Ab3J3KHM (ORCPT ); Wed, 30 Oct 2013 06:07:12 -0400 Received: from smtprelay0030.b.hostedemail.com ([64.98.42.30]:60166 "EHLO smtprelay.b.hostedemail.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753721Ab3J3KHK (ORCPT ); Wed, 30 Oct 2013 06:07:10 -0400 X-Session-Marker: 742E617274656D406C79636F732E636F6D X-Spam-Summary: 2,0,0,,d41d8cd98f00b204,t.artem@lycos.com,:::::::::::::::::,RULES_HIT:41:152:355:379:582:599:966:968:988:989:1152:1260:1277:1311:1313:1314:1345:1373:1437:1515:1516:1518:1534:1541:1593:1594:1711:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2693:3138:3139:3140:3141:3142:3353:3622:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:5007:6119:6261:7875:7903:10004:10400:10450:10455:10848:11026:11232:11658:11914:12043:12438:12517:12519:12740:13069:13161:13229:13311:13357:13618:19904:19999,0,RBL:none,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0 X-HE-Tag: cart46_5b7963ac93e24 X-Filterd-Recvd-Size: 2801 Date: Wed, 30 Oct 2013 10:07:08 +0000 (UTC) From: "Artem S. Tashkinov" To: jack@suse.cz Cc: tytso@mit.edu, fengguang.wu@intel.com, torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, diegocg@gmail.com, david@lang.hm, neilb@suse.de Message-ID: <1532891663.73423.1383127628582.JavaMail.mail@webmail14> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025230545.GB31280@localhost> <20131025233753.GD19823@thunk.org><20131029204052.GF9568@quack.suse.cz> Subject: Re: Disabling in-memory write cache for x86-64 in Linux II MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [46.146.117.87] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oct 30, 2013 02:41:01 AM, Jack wrote: On Fri 25-10-13 19:37:53, Ted Tso wrote: >> Sure, although I wonder if it would be worth it calcuate some kind of >> rolling average of the write bandwidth while we are doing writeback, >> so if it turns out we got unlucky with the contents of the first 100MB >> of dirty data (it could be either highly random or highly sequential) >> the we'll eventually correct to the right level. > We already do average measured throughput over a longer time window and >have kind of rolling average algorithm doing some averaging. > >> This means that VM would have to keep dirty page counters for each BDI >> --- which I thought we weren't doing right now, which is why we have a >> global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I >> have cause and effect reversed? :-) > And we do currently keep the number of dirty & under writeback pages per >BDI. We have global limits because mm wants to limit the total number of dirty >pages (as those are harder to free). It doesn't care as much to which device >these pages belong (although it probably should care a bit more because >there are huge differences between how quickly can different devices get rid >of dirty pages). This might sound like an absolutely stupid question which makes no sense at all, so I want to apologize for it in advance, but since the Linux kernel lacks revoke(), does that mean that dirty buffers will always occupy the kernel memory if I for instance remove my USB stick before the kernel has had the time to flush these buffers? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753790Ab3J3MB6 (ORCPT ); Wed, 30 Oct 2013 08:01:58 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46544 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752494Ab3J3MB5 (ORCPT ); Wed, 30 Oct 2013 08:01:57 -0400 Date: Wed, 30 Oct 2013 12:01:52 +0000 From: Mel Gorman To: Jan Kara Cc: Linus Torvalds , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131030120152.GM2400@suse.de> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20131029205756.GH9568@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 29, 2013 at 09:57:56PM +0100, Jan Kara wrote: > On Fri 25-10-13 10:32:16, Linus Torvalds wrote: > > On Fri, Oct 25, 2013 at 10:29 AM, Andrew Morton > > wrote: > > > > > > Apparently all this stuff isn't working as desired (and perhaps as designed) > > > in this case. Will take a look after a return to normalcy ;) > > > > It definitely doesn't work. I can trivially reproduce problems by just > > having a cheap (==slow) USB key with an ext3 filesystem, and going a > > git clone to it. The end result is not pretty, and that's actually not > > even a huge amount of data. > > I'll try to reproduce this tomorrow so that I can have a look where > exactly are we stuck. But in last few releases problems like this were > caused by problems in reclaim which got fed up by seeing lots of dirty > / under writeback pages and ended up stuck waiting for IO to finish. Mel > has been tweaking the logic here and there but maybe it haven't got fixed > completely. Mel, do you know about any outstanding issues? > Yeah, there are still a few. The work in that general area dealt with such problems as dirty pages reaching the end of the LRU (excessive CPU usage), calling wait_on_page_writeback from reclaim context (random processes stalling even though there was not much memory pressure), desktop applications stalling randomly (second quick write stalling on stable writeback). The systemtap script caught those type of areas and I believe they are fixed up. There are still problems though. If all dirty pages were backed by a slow device then dirty limiting is still eventually going to cause stalls in dirty page balancing. If there is a global sync then the shit can really hit the fan if it all gets stuck waiting on something like journal space. Applications that are very fsync happy can still get stalled for long periods of time behind slower writers as they wait for the IO to flush. When all this happens there still make be spikes in CPU usage if it scans the dirty pages excessively without sleeping. Consciously or unconsciously my desktop applications generally do not fall foul of these problems. At least one of the desktop environments can stall because it calls fsync on history and preference files constantly but I cannot remember which one of if it has been fixed since. I did have a problem with gnome-terminal as it depended on a library that implemented scrollback buffering by writing single-line files to /tmp and then truncating them which would "freeze" the terminal under IO. I now use tmpfs for /tmp to get around this. When I'm writing to USB sticks I think it tends to stay between the point where background writing starts and dirty throttling occurs so I rarely notice any major problems. I'm probably unconsciously avoiding doing any write-heavy work while a USB stick is plugged in. Addressing this goes back to tuning dirty ratio or replacing it. Tuning it always falls foul of "works for one person and not another" and fails utterly when there is storage with differet speeds. We talked about this a few months ago but I still suspect that we will have to bite the bullet and tune based on "do not dirty more data than it takes N seconds to writeback" using per-bdi writeback estimations. It's just not that trivial to implement as the writeback speeds can change for a variety of reasons (multiple IO sources, random vs sequential etc). Hence at one point we think we are within our target window and then get it completely wrong. Dirty ratio is a hard guarantee, dirty writeback estimation is best-effort that will go wrong in some cases. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752351Ab3J3PMb (ORCPT ); Wed, 30 Oct 2013 11:12:31 -0400 Received: from cantor2.suse.de ([195.135.220.15]:54917 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751088Ab3J3PMa (ORCPT ); Wed, 30 Oct 2013 11:12:30 -0400 Date: Wed, 30 Oct 2013 16:12:25 +0100 From: Jan Kara To: "Artem S. Tashkinov" Cc: jack@suse.cz, tytso@mit.edu, fengguang.wu@intel.com, torvalds@linux-foundation.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, diegocg@gmail.com, david@lang.hm, neilb@suse.de Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131030151225.GA15202@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025230545.GB31280@localhost> <20131025233753.GD19823@thunk.org> <20131029204052.GF9568@quack.suse.cz> <1532891663.73423.1383127628582.JavaMail.mail@webmail14> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1532891663.73423.1383127628582.JavaMail.mail@webmail14> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 30-10-13 10:07:08, Artem S. Tashkinov wrote: > Oct 30, 2013 02:41:01 AM, Jack wrote: > On Fri 25-10-13 19:37:53, Ted Tso wrote: > >> Sure, although I wonder if it would be worth it calcuate some kind of > >> rolling average of the write bandwidth while we are doing writeback, > >> so if it turns out we got unlucky with the contents of the first 100MB > >> of dirty data (it could be either highly random or highly sequential) > >> the we'll eventually correct to the right level. > > We already do average measured throughput over a longer time window and > >have kind of rolling average algorithm doing some averaging. > > > >> This means that VM would have to keep dirty page counters for each BDI > >> --- which I thought we weren't doing right now, which is why we have a > >> global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I > >> have cause and effect reversed? :-) > > And we do currently keep the number of dirty & under writeback pages per > >BDI. We have global limits because mm wants to limit the total number of dirty > >pages (as those are harder to free). It doesn't care as much to which device > >these pages belong (although it probably should care a bit more because > >there are huge differences between how quickly can different devices get rid > >of dirty pages). > > This might sound like an absolutely stupid question which makes no sense at > all, so I want to apologize for it in advance, but since the Linux kernel lacks > revoke(), does that mean that dirty buffers will always occupy the kernel memory > if I for instance remove my USB stick before the kernel has had the time to flush > these buffers? That's actually a good question. And the answer is that currently when we hit EIO while writing out dirty data, we just throw away that data. Not an ideal solution for some cases but it solves the problem with unwriteable data... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754988Ab3JaOZv (ORCPT ); Thu, 31 Oct 2013 10:25:51 -0400 Received: from exprod5og118.obsmtp.com ([64.18.0.160]:58538 "EHLO exprod5og118.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754961Ab3JaOZt (ORCPT ); Thu, 31 Oct 2013 10:25:49 -0400 Date: Thu, 31 Oct 2013 15:26:12 +0100 From: Karl Kiniger To: Jan Kara Cc: Linus Torvalds , "Artem S. Tashkinov" , Wu Fengguang , Andrew Morton , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131031142612.GA28003@kipc2.localdomain> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091555.GA30895@kipc2.localdomain> <20131029203050.GE9568@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131029203050.GE9568@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) X-GEHealthcare-MailScanner: Found to be clean X-GEHealthcare-MailScanner-From: karl.kiniger@med.ge.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 131029, Jan Kara wrote: > On Fri 25-10-13 11:15:55, Karl Kiniger wrote: > > On Fri 131025, Linus Torvalds wrote: .... > > Is it currently possible to somehow set above values per block device? > Yes, to some extent. You can set /sys/block//bdi/max_ratio to > the maximum proportion the device's dirty data can take from the total > amount. The caveat currently is that this setting only takes effect after > we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in > total because that is an amount of dirty data when we start to throttle > processes. So if the device you'd like to limit is the only one which is > currently written to, the limiting doesn't have a big effect. Thanks for the info - thats was I am looking for. You are right that the limiting doesn't have a big effect right now: on my 4x speed DVD+RW on /dev/sr0, x86_64, 4GB, Fedora19: max_ratio set to 100 - about 500MB buffered, sync time 2:10 min. max_ratio set to 1 - about 330MB buffered, sync time 1:23 min. ... way too much buffering. (measured with strace -tt -ewrite dd if=/dev/zero of=bigfile bs=1M count=1000 by looking at the timestamps). Karl .... Honza > -- > Jan Kara > SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753728Ab3KAO0H (ORCPT ); Fri, 1 Nov 2013 10:26:07 -0400 Received: from relay.parallels.com ([195.214.232.42]:45718 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751937Ab3KAO0F (ORCPT ); Fri, 1 Nov 2013 10:26:05 -0400 Subject: Re: Disabling in-memory write cache for x86-64 in Linux II To: karl.kiniger@med.ge.com From: Maxim Patlasov Cc: jack@suse.cz, linux-kernel@vger.kernel.org, t.artem@lycos.com, mgorman@suse.de, tytso@mit.edu, akpm@linux-foundation.org, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com Date: Fri, 01 Nov 2013 18:25:56 +0400 Message-ID: <20131101142426.1065.25534.stgit@dhcp-10-30-17-2.sw.ru> In-Reply-To: <20131031142612.GA28003@kipc2.localdomain> References: <20131031142612.GA28003@kipc2.localdomain> User-Agent: StGit/0.16 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 31-10-13 14:26:12, Karl Kiniger wrote: > On Tue 131029, Jan Kara wrote: > > On Fri 25-10-13 11:15:55, Karl Kiniger wrote: > > > On Fri 131025, Linus Torvalds wrote: > .... > > > Is it currently possible to somehow set above values per block device? > > Yes, to some extent. You can set /sys/block//bdi/max_ratio to > > the maximum proportion the device's dirty data can take from the total > > amount. The caveat currently is that this setting only takes effect after > > we have more than (dirty_background_ratio + dirty_ratio)/2 dirty data in > > total because that is an amount of dirty data when we start to throttle > > processes. So if the device you'd like to limit is the only one which is > > currently written to, the limiting doesn't have a big effect. > > Thanks for the info - thats was I am looking for. > > You are right that the limiting doesn't have a big effect right now: > > on my 4x speed DVD+RW on /dev/sr0, x86_64, 4GB, > Fedora19: > > max_ratio set to 100 - about 500MB buffered, sync time 2:10 min. > max_ratio set to 1 - about 330MB buffered, sync time 1:23 min. > > ... way too much buffering. "strictlimit" feature must fit your and Artem's needs quite well. The feature enforces per-BDI dirty limits even if the global dirty limit is not reached yet. I'll send a patch adding knob to turn it on/off. Thanks, Maxim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752061Ab3KAObq (ORCPT ); Fri, 1 Nov 2013 10:31:46 -0400 Received: from relay.parallels.com ([195.214.232.42]:46418 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751205Ab3KAObp (ORCPT ); Fri, 1 Nov 2013 10:31:45 -0400 Subject: [PATCH] mm: add strictlimit knob To: karl.kiniger@med.ge.com From: Maxim Patlasov Cc: jack@suse.cz, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, tytso@mit.edu, akpm@linux-foundation.org, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com Date: Fri, 01 Nov 2013 18:31:40 +0400 Message-ID: <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> In-Reply-To: <20131031142612.GA28003@kipc2.localdomain> References: <20131031142612.GA28003@kipc2.localdomain> User-Agent: StGit/0.16 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until max_ratio is decreased to some reasonable value. Signed-off-by: Maxim Patlasov --- mm/backing-dev.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7..4ee1d64 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, } static DEVICE_ATTR_RO(stable_pages_required); +static ssize_t strictlimit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int val; + ssize_t ret; + + ret = kstrtouint(buf, 10, &val); + if (ret < 0) + return ret; + + switch (val) { + case 0: + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; + break; + case 1: + bdi->capabilities |= BDI_CAP_STRICTLIMIT; + break; + default: + return -EINVAL; + } + + return count; +} +static ssize_t strictlimit_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); +} +static DEVICE_ATTR_RW(strictlimit); + static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, + &dev_attr_strictlimit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754843Ab3KARZ0 (ORCPT ); Fri, 1 Nov 2013 13:25:26 -0400 Received: from mga02.intel.com ([134.134.136.20]:43567 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754198Ab3KARZZ (ORCPT ); Fri, 1 Nov 2013 13:25:25 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.93,618,1378882800"; d="scan'208";a="420961729" Date: Sat, 2 Nov 2013 01:22:22 +0800 From: Fengguang Wu To: Linus Torvalds Cc: Jan Kara , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Linux Kernel Mailing List , Mel Gorman , Maxim Patlasov Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131101172222.GA19478@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> <20131029221324.GC12814@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org // Sorry for the late response! I'm in marriage leave these days. :) On Tue, Oct 29, 2013 at 03:42:08PM -0700, Linus Torvalds wrote: > On Tue, Oct 29, 2013 at 3:13 PM, Jan Kara wrote: > > > > So I think we both realize this is only about what the default should be. > > Yes. Most people will use the defaults, but there will always be > people who tune things for particular loads. > > In fact, I think we have gone much too far in saying "all policy in > user space", because the fact is, user space isn't very good at > policy. Especially not at reacting to complex situations with > different devices. From what I've seen, "policy in user space" has > resulted in exactly two modes: > > - user space does something stupid and wrong (example: "nice -19 X" > to work around some scheduler oddities) > > - user space does nothing at all, and the kernel people say "hey, > user space _could_ set this value Xyz, so it's not our problem, and > it's policy, so we shouldn't touch it". > > I think we in the kernel should say "our defaults should be what > everybody sane can use, and they should work fine on average". With > "policy in user space" being for crazy people that do really odd > things and can really spare the time to tune for their particular > issue. > > So the "policy in user space" should be about *overriding* kernel > policy choices, not about the kernel never having them. Agreed totally. The kernel defaults should better be geared to the typical use case by the majority users, unless it will lead to insane behaviors in some less frequent but still relevant use cases. > And this kind of "you can have many different devices and they act > quite differently" is a good example of something complicated that > user space really doesn't have a great model for. And we actually have > much better possible information in the kernel than user space ever is > likely to have. > > > Also I'm not sure capping dirty limits at 200MB is the best spot. It may be > > but I think we should experiment with numbers a bit to check whether we > > didn't miss something. > > Sure. That said, the patch I suggested basically makes the numbers be > at least roughly comparable across different architectures. So it's > been at least somewhat tested, even if 16GB x86-32 machines are > hopefully pretty rare (but I hear about people installing 32-bit on > modern machines much too often). Yeah, it's interesting the new policy rule actually makes x86_64 behave more consistent with i386, and hence have been reasonably tested. > >> - temp-files may not be written out at all. > >> > >> Quite frankly, if you have multi-hundred-megabyte temptiles, you've > >> got issues > > Actually people do stuff like this e.g. when generating ISO images before > > burning them. > > Yes, but then the temp-file is long-lived enough that it *will* hit > the disk anyway. So it's only the "create temporary file and pretty > much immediately delete it" case that changes behavior (ie compiler > assembly files etc). > > If the temp-file is for something like burning an ISO image, the > burning part is slow enough that the temp-file will hit the disk > regardless of when we start writing it. The temp-file IO avoidance is an optimization not a guarantee. If a user want to avoid IO seriously, he will probably use tmpfs and disable swap. So if we have to do some trade-offs in the optimization, I agree that we should optimize more towards the "large copies to USB stick" use case. The alternative solution, per-bdi dirty thresholds, could eliminate the need to do such trade-offs. So it's worth looking at the two solutions side by side. > > There is one more aspect: > > - transforming random writes into mostly sequential writes > > Sure. And I think that if you have a big database, that's when you do > end up tweaking the dirty limits. Sure. In general, whenever we have to make some tradeoffs, it's probably better to "sacrifice" the embedded and super computing worlds much more than the desktop. Because in the former areas, people tend to have the skill and mind set to do customizations and optimizations. I wonder if some hand-held devices will set dirty_background_bytes to 0 for better data safety. > That said, I'd certainly like it even *more* if the limits really were > per-BDI, and the global limit was in addition to the per-bdi ones. > Because when you have a USB device that gets maybe 10MB/s on > contiguous writes, and 100kB/s on random 4k writes, I think it would > make more sense to make the "start writeout" limits be 1MB/2MB, not > 100MB/200MB. So my patch doesn't even take it far enough, it's just a > "let's not be ridiculous". The per-BDI limits don't seem quite ready > for prime time yet, though. Even the new "strict" limits seems to be > more about "trusted filesystems" than about really sane writeback > limits. > > Fengguang, comments? Basically A) lowering the global dirty limit is a reasonable tradeoff, and B) the time based per-bdi dirty limits seems like the ultimate solution that could offer the sane defaults to your heart's content. Since both will be user interface (including semantic) changes, we have to be careful. It's obvious that if ever (B) can be implemented properly and made mature quickly, it would be the best choice and will eliminate the need to do (A). But as Mel said in the other email, (B) is not that easy to implement... > (And I added Maxim to the cc, since he's the author of the strict > mode, and while it is currently limited to FUSE, he did mention USB > storage in the commit message..). The *bytes* based per-bdi limits are relatively easy. It's only a question of code matureness. When exported user interface to the user space, we can guarantee the exact limit to the user. However for *time* based per-bdi limits, there will always be estimation errors as summarized in Mel's email. It offers the sane semantics to the user, however may not always work to the expectation, since writeback bandwidth may change over time depending on the workload. It feels much better to have some hard guarantee. So even when the time based limits are implemented, we'll probably still want to disable the slippery time/bandwidth estimation when the user is able to provide some bytes based per-bdi limits: hey I don't care about random writes etc. subtle situations. I know this disk's max write bandwidth is 100MB/s and it's a good rule of thumb. Let's simply set its dirty limit to 100MB. Or shall we do the more simple and less volatile "max write bandwidth" estimation and use it for auto per-bdi dirty limits? Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752249Ab3KDMTg (ORCPT ); Mon, 4 Nov 2013 07:19:36 -0500 Received: from atrey.karlin.mff.cuni.cz ([195.113.26.193]:48348 "EHLO atrey.karlin.mff.cuni.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750852Ab3KDMTf (ORCPT ); Mon, 4 Nov 2013 07:19:35 -0500 Date: Mon, 4 Nov 2013 13:19:33 +0100 From: Pavel Machek To: Fengguang Wu Cc: Linus Torvalds , Jan Kara , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Linux Kernel Mailing List , Mel Gorman , Maxim Patlasov Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131104121933.GA24407@amd.pavel.ucw.cz> References: <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> <20131029221324.GC12814@quack.suse.cz> <20131101172222.GA19478@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131101172222.GA19478@localhost> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi! > > Yes, but then the temp-file is long-lived enough that it *will* hit > > the disk anyway. So it's only the "create temporary file and pretty > > much immediately delete it" case that changes behavior (ie compiler > > assembly files etc). > > > > If the temp-file is for something like burning an ISO image, the > > burning part is slow enough that the temp-file will hit the disk > > regardless of when we start writing it. > > The temp-file IO avoidance is an optimization not a guarantee. If a > user want to avoid IO seriously, he will probably use tmpfs and > disable swap. No, sorry, they can't. Assuming ISO image fits in tmpfs would be cruel. > So if we have to do some trade-offs in the optimization, I agree that > we should optimize more towards the "large copies to USB stick" use case. > > The alternative solution, per-bdi dirty thresholds, could eliminate > the need to do such trade-offs. So it's worth looking at the two > solutions side by side. Yes, please. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752692Ab3KDM0Q (ORCPT ); Mon, 4 Nov 2013 07:26:16 -0500 Received: from atrey.karlin.mff.cuni.cz ([195.113.26.193]:48566 "EHLO atrey.karlin.mff.cuni.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751949Ab3KDM0P (ORCPT ); Mon, 4 Nov 2013 07:26:15 -0500 Date: Mon, 4 Nov 2013 13:26:13 +0100 From: Pavel Machek To: Linus Torvalds Cc: Jan Kara , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List , Mel Gorman , Maxim Patlasov Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131104122613.GB24407@amd.pavel.ucw.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1814253454.3449.1382689853825.JavaMail.mail@webmail07> <20131025091842.GA28681@thunk.org> <20131025022937.12623dcd.akpm@linux-foundation.org> <20131029205756.GH9568@quack.suse.cz> <20131029221324.GC12814@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi! > >> - temp-files may not be written out at all. > >> > >> Quite frankly, if you have multi-hundred-megabyte temptiles, you've > >> got issues > > Actually people do stuff like this e.g. when generating ISO images before > > burning them. > > Yes, but then the temp-file is long-lived enough that it *will* hit > the disk anyway. So it's only the "create temporary file and pretty > much immediately delete it" case that changes behavior (ie compiler > assembly files etc). > > If the temp-file is for something like burning an ISO image, the > burning part is slow enough that the temp-file will hit the disk > regardless of when we start writing it. It will hit the disk, but with proposed change, burning still will be slower. Before: create 700MB iso burn the CD, at the same time writing the iso to disk After: create 700MB iso and write most of it to disk burn the CD, writing the rest. But yes, limiting dirty ammounts is good idea. > That said, I'd certainly like it even *more* if the limits really were > per-BDI, and the global limit was in addition to the per-bdi ones. > Because when you have a USB device that gets maybe 10MB/s on > contiguous writes, and 100kB/s on random 4k writes, I think it would > make more sense to make the "start writeout" limits be 1MB/2MB, not Actually I believe I seen 10kB/sec on an SD card... would expect that from USB sticks, too. And yes, there are actually real problems with this at least on N900. You do apt-get install . apt internally does fsyncs. It results in big enough latencies that watchdogs kick in and kill the machine. http://pavelmachek.livejournal.com/117089.html People are doing echo 3 > /proc/sys/vm/dirty_ratio echo 3 > /proc/sys/vm/dirty_background_ratio echo 100 > /proc/sys/vm/dirty_writeback_centisecs echo 100 > /proc/sys/vm/dirty_expire_centisecs echo 4096 > /proc/sys/vm/min_free_kbytes echo 50 > /proc/sys/vm/swappiness echo 200 > /proc/sys/vm/vfs_cache_pressure echo 8 > /proc/sys/vm/page-cluster echo 4 > /sys/block/mmcblk0/queue/nr_requests echo 4 > /sys/block/mmcblk1/queue/nr_requests .. to avoid it, but IIRC it only makes the watchdog reset less likely :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753014Ab3KDWBH (ORCPT ); Mon, 4 Nov 2013 17:01:07 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:48411 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751056Ab3KDWBG (ORCPT ); Mon, 4 Nov 2013 17:01:06 -0500 Date: Mon, 4 Nov 2013 14:01:04 -0800 From: Andrew Morton To: Maxim Patlasov Cc: karl.kiniger@med.ge.com, jack@suse.cz, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, tytso@mit.edu, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com Subject: Re: [PATCH] mm: add strictlimit knob Message-Id: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> In-Reply-To: <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> References: <20131031142612.GA28003@kipc2.localdomain> <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov wrote: > "strictlimit" feature was introduced to enforce per-bdi dirty limits for > FUSE which sets bdi max_ratio to 1% by default: > > http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 > > However the feature can be useful for other relatively slow or untrusted > BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the > feature: > > echo 1 > /sys/class/bdi/X:Y/strictlimit > > Being enabled, the feature enforces bdi max_ratio limit even if global (10%) > dirty limit is not reached. Of course, the effect is not visible until > max_ratio is decreased to some reasonable value. I suggest replacing "max_ratio" here with the much more informative "/sys/class/bdi/X:Y/max_ratio". Also, Documentation/ABI/testing/sysfs-class-bdi will need an update please. > mm/backing-dev.c | 35 +++++++++++++++++++++++++++++++++++ > 1 file changed, 35 insertions(+) > I'm not really sure what to make of the patch. I assume you tested it and observed some effect. Could you please describe the test setup and the effects in some detail? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753837Ab3KEAuT (ORCPT ); Mon, 4 Nov 2013 19:50:19 -0500 Received: from mail-pb0-f52.google.com ([209.85.160.52]:60881 "EHLO mail-pb0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753309Ab3KEAuR convert rfc822-to-8bit (ORCPT ); Mon, 4 Nov 2013 19:50:17 -0500 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\)) Subject: Re: Disabling in-memory write cache for x86-64 in Linux II From: Andreas Dilger In-Reply-To: Date: Mon, 4 Nov 2013 17:50:13 -0700 Cc: Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm Content-Transfer-Encoding: 8BIT Message-Id: <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> To: "Artem S. Tashkinov" X-Mailer: Apple Mail (2.1816) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Oct 25, 2013, at 2:18 AM, Linus Torvalds wrote: > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: >> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 >> kernel built for the i686 (with PAE) and x86-64 architectures. Whats >> really troubling me is that the x86-64 kernel has the following problem: >> >> When I copy large files to any storage device, be it my HDD with ext4 >> partitions or flash drive with FAT32 partitions, the kernel first >> caches them in memory entirely then flushes them some time later >> (quite unpredictably though) or immediately upon invoking "sync". > > Yeah, I think we default to a 10% "dirty background memory" (and > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB > of dirty memory for writeout before we even start writing, and twice > that before we start *waiting* for it. > > On 32-bit x86, we only count the memory in the low 1GB (really > actually up to about 890MB), so "10% dirty" really means just about > 90MB of buffering (and a "hard limit" of ~180MB of dirty). > > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults > come from the old days of less memory (and perhaps servers that don't > much care), and the fact that x86-32 ends up having much lower limits > even if you end up having more memory. I think the delay writes for a long time is a holdover from the days when e.g. /tmp was on a disk and compilers had lousy IO patterns, then they deleted the file. Today, /tmp is always in RAM, and IMHO the write and delete workload tested by dbench is not worthwhile optimizing for. With Lustre, weve long taken the approach that if there is enough dirty data on a file to make a decent write (which is around 8MB today even for very fast storage) then there isnt much point to hold back for more data before starting the IO. Any decent allocator will be able to grow allocated extents to handle following data, or allocate a new extent. At 4-8MB extents, even very seek-impaired media could do 400-800MB/s (likely much faster than the underlying storage anyway). This also avoids wasting (tens of?) seconds of idle disk bandwidth. If the disk is already busy, then the IO will be delayed anyway. If it is not busy, then why aggregate GB of dirty data in memory before flushing it? Something simple like start writing at 16MB dirty on a single file would probably avoid a lot of complexity at little real-world cost. That shouldnt throttle dirtying memory above 16MB, but just start writeout much earlier than it does today. Cheers, Andreas From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754272Ab3KEEMw (ORCPT ); Mon, 4 Nov 2013 23:12:52 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:19253 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750732Ab3KEEMu (ORCPT ); Mon, 4 Nov 2013 23:12:50 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aq0GAClveFJ5LOn3/2dsb2JhbABZgweDPFe2boVFgScXdIIlAQEEASMPASMjBQsIAw4KAgIFIQICDwUlAyETGQKHYAWre5I9FoETjGMBC4EzB4JrgUMDlCuDXpIKgzoogSwBHw Date: Tue, 5 Nov 2013 15:12:45 +1100 From: Dave Chinner To: Andreas Dilger Cc: "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131105041245.GY6188@dastard> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > On Oct 25, 2013, at 2:18 AM, Linus Torvalds wrote: > > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: > >> > >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 > >> kernel built for the i686 (with PAE) and x86-64 architectures. What’s > >> really troubling me is that the x86-64 kernel has the following problem: > >> > >> When I copy large files to any storage device, be it my HDD with ext4 > >> partitions or flash drive with FAT32 partitions, the kernel first > >> caches them in memory entirely then flushes them some time later > >> (quite unpredictably though) or immediately upon invoking "sync". > > > > Yeah, I think we default to a 10% "dirty background memory" (and > > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB > > of dirty memory for writeout before we even start writing, and twice > > that before we start *waiting* for it. > > > > On 32-bit x86, we only count the memory in the low 1GB (really > > actually up to about 890MB), so "10% dirty" really means just about > > 90MB of buffering (and a "hard limit" of ~180MB of dirty). > > > > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults > > come from the old days of less memory (and perhaps servers that don't > > much care), and the fact that x86-32 ends up having much lower limits > > even if you end up having more memory. > > I think the “delay writes for a long time” is a holdover from the > days when e.g. /tmp was on a disk and compilers had lousy IO > patterns, then they deleted the file. Today, /tmp is always in > RAM, and IMHO the “write and delete” workload tested by dbench > is not worthwhile optimizing for. > > With Lustre, we’ve long taken the approach that if there is enough > dirty data on a file to make a decent write (which is around 8MB > today even for very fast storage) then there isn’t much point to > hold back for more data before starting the IO. Agreed - write-through caching is much better for high throughput streaming data environments than write back caching that can leave the devices unnecessarily idle. However, most systems are not running in high-throughput streaming data environments... :/ > Any decent allocator will be able to grow allocated extents to > handle following data, or allocate a new extent. At 4-8MB extents, > even very seek-impaired media could do 400-800MB/s (likely much > faster than the underlying storage anyway). True, but this makes the assumption that the filesystem you are using is optimising purely for write throughput and your storage is not seek limited on reads. That's simply not an assumption we can allow the generic writeback code to make. In more detail, if we simply implement "we have 8 MB of dirty pages on a single file, write it" we can maximise write throughput by allocating sequentially on disk for each subsquent write. The problem with this comes when you are writing multiple files at a time, and that leads to this pattern on disk: ABC...ABC....ABC....ABC.... And the result is a) fragmented files b) a large number of seeks during sequential read operations and c) filesystems that age and degrade rapidly under workloads that concurrently write files with different life times (i.e. due to free space fragmention). In some situations this is acceptable, but the performance degradation as the filesystem ages that this sort of allocation causes in most environments is not. I'd say that >90% of filesystems out there would suffer accelerated aging as a result of doing writeback in this manner by default. > This also avoids wasting (tens of?) seconds of idle disk bandwidth. > If the disk is already busy, then the IO will be delayed anyway. > If it is not busy, then why aggregate GB of dirty data in memory > before flushing it? There are plenty of workloads out there where delaying IO for a few seconds can result in writeback that is an order of magnitude faster. Similarly, I've seen other workloads where the writeback delay results in files that can be *read* orders of magnitude faster.... > Something simple like “start writing at 16MB dirty on a single file” > would probably avoid a lot of complexity at little real-world cost. > That shouldn’t throttle dirtying memory above 16MB, but just start > writeout much earlier than it does today. That doesn't solve the "slow device, large file" problem. We can write data into the page cache at rates of over a GB/s, so it's irrelevant to a device that can write at 5MB/s whether we start writeback immediately or a second later when there is 500MB of dirty pages in memory. AFAIK, the only way to avoid that problem is to use write-through caching for such devices - where they throttle to the IO rate at very low levels of cached data. Realistically, there is no "one right answer" for all combinations of applications, filesystems and hardware, but writeback caching is the best *general solution* we've got right now. However, IMO users should not need to care about tuning BDI dirty ratios or even have to understand what a BDI dirty ratio is to select the rigth caching method for their devices and/or workload. The difference between writeback and write through caching is easy to explain and AFAICT those two modes suffice to solve the problems being discussed here. Further, if two modes suffice to solve the problems, then we should be able to easily define a trigger to automatically switch modes. /me notes that if we look at random vs sequential IO and the impact that has on writeback duration, then it's very similar to suddenly having a very slow device. IOWs, fadvise(RANDOM) could be used to switch an *inode* to write through mode rather than writeback mode to solve the problem aggregating massive amounts of random write IO in the page cache... So rather than treating this as a "one size fits all" type of problem, let's step back and: a) define 2-3 different caching behaviours we consider optimal for the majority of workloads/hardware we care about. b) determine optimal workloads for each caching behaviour. c) develop reliable triggers to detect when we should switch between caching behaviours. e.g: a) write back caching - what we have now write through caching - extremely low dirty threshold before writeback starts, enough to optimise for, say, stripe width of the underlying storage. b) write back caching: - general purpose workload write through caching: - slow device, write large file, sync - extremely high bandwidth devices, multi-stream sequential IO - random IO. c) write back caching: - default - fadvise(NORMAL, SEQUENTIAL, WILLNEED) write through caching: - fadvise(NOREUSE, DONTNEED, RANDOM) - random IO - sequential IO, BDI write bandwidth <<< dirty threshold - sequential IO, BDI write bandwidth >>> dirty threshold I think that covers most of the issues and use cases that have been discussed in this thread. IMO, this is the level at which we need to solve the problem (i.e. architectural), not at the level of "let's add sysfs variables so we can tweak bdi ratios". Indeed, the above implies that we need the caching behaviour to be a property of the address space, not just a property of the backing device. IOWs, the implementation needs to trickle down from a coherent high level design - that will define the knobs that we need to expose to userspace. We should not be adding new writeback behaviours by adding knobs to sysfs without first having some clue about whether we are solving the right problem and solving it in a sane manner... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932199Ab3KFObN (ORCPT ); Wed, 6 Nov 2013 09:31:13 -0500 Received: from relay.parallels.com ([195.214.232.42]:48144 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756366Ab3KFObL (ORCPT ); Wed, 6 Nov 2013 09:31:11 -0500 Message-ID: <527A5269.7040900@parallels.com> Date: Wed, 6 Nov 2013 18:30:01 +0400 From: Maxim Patlasov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Andrew Morton CC: , , , , , , , , Subject: Re: [PATCH] mm: add strictlimit knob References: <20131031142612.GA28003@kipc2.localdomain> <20131101142941.1161.40314.stgit@dhcp-10-30-17-2.sw.ru> <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> In-Reply-To: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.30.17.2] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andrew, On 11/05/2013 02:01 AM, Andrew Morton wrote: > On Fri, 01 Nov 2013 18:31:40 +0400 Maxim Patlasov wrote: > >> "strictlimit" feature was introduced to enforce per-bdi dirty limits for >> FUSE which sets bdi max_ratio to 1% by default: >> >> http://www.http.com//article.gmane.org/gmane.linux.kernel.mm/105809 >> >> However the feature can be useful for other relatively slow or untrusted >> BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the >> feature: >> >> echo 1 > /sys/class/bdi/X:Y/strictlimit >> >> Being enabled, the feature enforces bdi max_ratio limit even if global (10%) >> dirty limit is not reached. Of course, the effect is not visible until >> max_ratio is decreased to some reasonable value. > I suggest replacing "max_ratio" here with the much more informative > "/sys/class/bdi/X:Y/max_ratio". > > Also, Documentation/ABI/testing/sysfs-class-bdi will need an update > please. OK, I'll update it, fix patch description and re-send the patch. > >> mm/backing-dev.c | 35 +++++++++++++++++++++++++++++++++++ >> 1 file changed, 35 insertions(+) >> > I'm not really sure what to make of the patch. I assume you tested it > and observed some effect. Could you please describe the test setup and > the effects in some detail? I plugged 16GB USB-flash in a node with 8GB RAM running 3.12.0-rc7 and started writing a huge file by "dd" (from /dev/zero to USB-flash mount-point). While writing I was observing "Dirty" counter as reported by /proc/meminfo. As expected it stabilized on a level about 1.2GB (15% of total RAM). Immediately after dd completed, the "umount" command took about 5 minutes. This corresponded to 5MB write throughput of the flash drive. Then I repeated the experiment after setting tunables: echo 1 > /sys/class/bdi/8\:16/max_ratio echo 1 > /sys/class/bdi/8\:16/strictlimit This time, "Dirty" counter became 100 times lesser - about 12MB and "umount" took about a second. Thanks, Maxim From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932485Ab3KFPGE (ORCPT ); Wed, 6 Nov 2013 10:06:04 -0500 Received: from relay.parallels.com ([195.214.232.42]:55719 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932252Ab3KFPGC (ORCPT ); Wed, 6 Nov 2013 10:06:02 -0500 Subject: [PATCH] mm: add strictlimit knob -v2 To: akpm@linux-foundation.org From: Maxim Patlasov Cc: karl.kiniger@med.ge.com, tytso@mit.edu, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, jack@suse.cz, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com Date: Wed, 06 Nov 2013 19:05:57 +0400 Message-ID: <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> In-Reply-To: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> References: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> User-Agent: StGit/0.16 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. Changed in v2: - updated patch description and documentation Signed-off-by: Maxim Patlasov --- Documentation/ABI/testing/sysfs-class-bdi | 8 +++++++ mm/backing-dev.c | 35 +++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi index d773d56..3187a18 100644 --- a/Documentation/ABI/testing/sysfs-class-bdi +++ b/Documentation/ABI/testing/sysfs-class-bdi @@ -53,3 +53,11 @@ stable_pages_required (read-only) If set, the backing device requires that all pages comprising a write request must not be changed until writeout is complete. + +strictlimit (read-write) + + Forces per-BDI checks for the share of given device in the write-back + cache even before the global background dirty limit is reached. This + is useful in situations where the global limit is much higher than + affordable for given relatively slow (or untrusted) device. Turning + strictlimit on has no visible effect if max_ratio is equal to 100%. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index ce682f7..4ee1d64 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, } static DEVICE_ATTR_RO(stable_pages_required); +static ssize_t strictlimit_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned int val; + ssize_t ret; + + ret = kstrtouint(buf, 10, &val); + if (ret < 0) + return ret; + + switch (val) { + case 0: + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; + break; + case 1: + bdi->capabilities |= BDI_CAP_STRICTLIMIT; + break; + default: + return -EINVAL; + } + + return count; +} +static ssize_t strictlimit_show(struct device *dev, + struct device_attribute *attr, char *page) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + + return snprintf(page, PAGE_SIZE-1, "%d\n", + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); +} +static DEVICE_ATTR_RW(strictlimit); + static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, + &dev_attr_strictlimit.attr, NULL, }; ATTRIBUTE_GROUPS(bdi_dev); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751107Ab3KFW0K (ORCPT ); Wed, 6 Nov 2013 17:26:10 -0500 Received: from mail.lang.hm ([64.81.33.126]:47652 "EHLO bifrost.lang.hm" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750709Ab3KFW0I (ORCPT ); Wed, 6 Nov 2013 17:26:08 -0500 Date: Mon, 4 Nov 2013 17:47:34 -0800 (PST) From: David Lang X-X-Sender: dlang@asgard.lang.hm To: "Figo.zhang" cc: NeilBrown , "Artem S. Tashkinov" , lkml , Linus Torvalds , linux-fsdevel@vger.kernel.org, axboe@kernel.dk, Linux-MM Subject: Re: Disabling in-memory write cache for x86-64 in Linux II In-Reply-To: Message-ID: References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <20131025214952.3eb41201@notabene.brown> <154617470.12445.1382725583671.JavaMail.mail@webmail11> <20131026074349.0adc9646@notabene.brown> <476525596.14731.1382735024280.JavaMail.mail@webmail11> <20131026091112.241da260@notabene.brown> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 5 Nov 2013, Figo.zhang wrote: >>> >>> Of course, if you don't use Linux on the desktop you don't really care - >> well, I do. Also >>> not everyone in this world has an UPS - which means such a huge buffer >> can lead to a >>> serious data loss in case of a power blackout. >> >> I don't have a desk (just a lap), but I use Linux on all my computers and >> I've never really noticed the problem. Maybe I'm just very patient, or >> maybe >> I don't work with large data sets and slow devices. >> >> However I don't think data-loss is really a related issue. Any process >> that >> cares about data safety *must* use fsync at appropriate places. This has >> always been true. >> >> =>May i ask question that, some like ext4 filesystem, if some app motify > the files, it create some dirty data. if some meta-data writing to the > journal disk when a power backout, > it will be lose some serious data and the the file will damage? > with any filesystem and any OS, if you create dirty data but do not f*sync() the data, there isa possibility that the system can go down between the time the application creates the dirty data and the time the OS actually gets it on disk. If the system goes down in this timeframe, the data will be lost and it may corrupt the file if only some of the data got written. David Lang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752729Ab3KGM1J (ORCPT ); Thu, 7 Nov 2013 07:27:09 -0500 Received: from out2-smtp.messagingengine.com ([66.111.4.26]:49688 "EHLO out2-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750964Ab3KGM1E (ORCPT ); Thu, 7 Nov 2013 07:27:04 -0500 X-Sasl-enc: qnr7I/kKAmh3kZFpd7Czr1CfKg1XL7dTXoFiLQaoZ/sN 1383827221 Date: Thu, 7 Nov 2013 10:26:58 -0200 From: Henrique de Moraes Holschuh To: Maxim Patlasov Cc: akpm@linux-foundation.org, karl.kiniger@med.ge.com, tytso@mit.edu, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, jack@suse.cz, fengguang.wu@intel.com, torvalds@linux-foundation.org Subject: Re: [PATCH] mm: add strictlimit knob -v2 Message-ID: <20131107122658.GA3355@khazad-dum.debian.net> References: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> X-GPG-Fingerprint1: 4096R/39CB4807 C467 A717 507B BAFE D3C1 6092 0BD9 E811 39CB 4807 X-GPG-Fingerprint2: 1024D/1CDB0FE3 5422 5C61 F6B7 06FB 7E04 3738 EE25 DE3F 1CDB 0FE3 User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Is there a reason to not enforce strictlimit by default? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755097Ab3KGNsN (ORCPT ); Thu, 7 Nov 2013 08:48:13 -0500 Received: from cantor2.suse.de ([195.135.220.15]:53255 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753386Ab3KGNsK (ORCPT ); Thu, 7 Nov 2013 08:48:10 -0500 Date: Thu, 7 Nov 2013 14:48:06 +0100 From: Jan Kara To: Dave Chinner Cc: Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131107134806.GB30832@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20131105041245.GY6188@dastard> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 05-11-13 15:12:45, Dave Chinner wrote: > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > Something simple like “start writing at 16MB dirty on a single file” > > would probably avoid a lot of complexity at little real-world cost. > > That shouldn’t throttle dirtying memory above 16MB, but just start > > writeout much earlier than it does today. > > That doesn't solve the "slow device, large file" problem. We can > write data into the page cache at rates of over a GB/s, so it's > irrelevant to a device that can write at 5MB/s whether we start > writeback immediately or a second later when there is 500MB of dirty > pages in memory. AFAIK, the only way to avoid that problem is to > use write-through caching for such devices - where they throttle to > the IO rate at very low levels of cached data. Agreed. > Realistically, there is no "one right answer" for all combinations > of applications, filesystems and hardware, but writeback caching is > the best *general solution* we've got right now. > > However, IMO users should not need to care about tuning BDI dirty > ratios or even have to understand what a BDI dirty ratio is to > select the rigth caching method for their devices and/or workload. > The difference between writeback and write through caching is easy > to explain and AFAICT those two modes suffice to solve the problems > being discussed here. Further, if two modes suffice to solve the > problems, then we should be able to easily define a trigger to > automatically switch modes. > > /me notes that if we look at random vs sequential IO and the impact > that has on writeback duration, then it's very similar to suddenly > having a very slow device. IOWs, fadvise(RANDOM) could be used to > switch an *inode* to write through mode rather than writeback mode > to solve the problem aggregating massive amounts of random write IO > in the page cache... I disagree here. Writeback cache is also useful for aggregating random writes and making semi-sequential writes out of them. There are quite some applications which rely on the fact that they can write a file in a rather random manner (Berkeley DB, linker, ...) but the files are written out in one large linear sweep. That is actually the reason why SLES (and I believe RHEL as well) tune dirty_limit even higher than what's the default value. So I think it's rather the other way around: If you can detect the file is being written in a streaming manner, there's not much point in caching too much data for it. And I agree with you that we also have to be careful not to cache too few because otherwise two streaming writes would be interleaved too much. Currently, we have writeback_chunk_size() which determines how much we ask to write from a single inode. So streaming writers are going to be interleaved at this chunk size anyway (currently that number is "measured bandwidth / 2"). So it would make sense to also limit amount of dirty cache for each file with streaming pattern at this number. > So rather than treating this as a "one size fits all" type of > problem, let's step back and: > > a) define 2-3 different caching behaviours we consider > optimal for the majority of workloads/hardware we care > about. > b) determine optimal workloads for each caching > behaviour. > c) develop reliable triggers to detect when we > should switch between caching behaviours. > > e.g: > > a) write back caching > - what we have now > write through caching > - extremely low dirty threshold before writeback > starts, enough to optimise for, say, stripe width > of the underlying storage. > > b) write back caching: > - general purpose workload > write through caching: > - slow device, write large file, sync > - extremely high bandwidth devices, multi-stream > sequential IO > - random IO. > > c) write back caching: > - default > - fadvise(NORMAL, SEQUENTIAL, WILLNEED) > write through caching: > - fadvise(NOREUSE, DONTNEED, RANDOM) > - random IO > - sequential IO, BDI write bandwidth <<< dirty threshold > - sequential IO, BDI write bandwidth >>> dirty threshold > > I think that covers most of the issues and use cases that have been > discussed in this thread. IMO, this is the level at which we need to > solve the problem (i.e. architectural), not at the level of "let's > add sysfs variables so we can tweak bdi ratios". > > Indeed, the above implies that we need the caching behaviour to be a > property of the address space, not just a property of the backing > device. Yes, and that would be interesting to implement and not make a mess out of the whole writeback logic because the way we currently do writeback is inherently BDI based. When we introduce some special per-inode limits, flusher threads would have to pick more carefully what to write and what not. We might be forced to go that way eventually anyway because of memcg aware writeback but it's not a simple step. > IOWs, the implementation needs to trickle down from a coherent high > level design - that will define the knobs that we need to expose to > userspace. We should not be adding new writeback behaviours by > adding knobs to sysfs without first having some clue about whether > we are solving the right problem and solving it in a sane manner... Agreed. But the ability to limit amount of dirty pages outstanding against a particular BDI seems as a sane one to me. It's not as flexible and automatic as the approach you suggested but it's much simpler and solves most of problems we currently have. The biggest objection against the sysfs-tunable approach is that most people won't have a clue meaning that the tunable is useless for them. But I wonder if something like: 1) turn on strictlimit by default 2) don't allow dirty cache of BDI to grow over 5s of measured writeback speed won't go a long way into solving our current problems without too much complication... Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752551Ab3KKDWq (ORCPT ); Sun, 10 Nov 2013 22:22:46 -0500 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:46137 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751016Ab3KKDWh (ORCPT ); Sun, 10 Nov 2013 22:22:37 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AroGALpMgFJ5LGc//2dsb2JhbABZgweDf7ZchUCBLBd0giUBAQQBIw8BIyMFCwgDDgoCAgUhAgIPBSUDIRMbh2AFq0eSFhaBE4x0DIE+B4JrgUUDmA6SC4M6KIEt Date: Mon, 11 Nov 2013 14:22:11 +1100 From: Dave Chinner To: Jan Kara Cc: Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131111032211.GT6188@dastard> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> <20131107134806.GB30832@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20131107134806.GB30832@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote: > On Tue 05-11-13 15:12:45, Dave Chinner wrote: > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > > Something simple like “start writing at 16MB dirty on a single file” > > > would probably avoid a lot of complexity at little real-world cost. > > > That shouldn’t throttle dirtying memory above 16MB, but just start > > > writeout much earlier than it does today. > > > > That doesn't solve the "slow device, large file" problem. We can > > write data into the page cache at rates of over a GB/s, so it's > > irrelevant to a device that can write at 5MB/s whether we start > > writeback immediately or a second later when there is 500MB of dirty > > pages in memory. AFAIK, the only way to avoid that problem is to > > use write-through caching for such devices - where they throttle to > > the IO rate at very low levels of cached data. > Agreed. > > > Realistically, there is no "one right answer" for all combinations > > of applications, filesystems and hardware, but writeback caching is > > the best *general solution* we've got right now. > > > > However, IMO users should not need to care about tuning BDI dirty > > ratios or even have to understand what a BDI dirty ratio is to > > select the rigth caching method for their devices and/or workload. > > The difference between writeback and write through caching is easy > > to explain and AFAICT those two modes suffice to solve the problems > > being discussed here. Further, if two modes suffice to solve the > > problems, then we should be able to easily define a trigger to > > automatically switch modes. > > > > /me notes that if we look at random vs sequential IO and the impact > > that has on writeback duration, then it's very similar to suddenly > > having a very slow device. IOWs, fadvise(RANDOM) could be used to > > switch an *inode* to write through mode rather than writeback mode > > to solve the problem aggregating massive amounts of random write IO > > in the page cache... > I disagree here. Writeback cache is also useful for aggregating random > writes and making semi-sequential writes out of them. There are quite some > applications which rely on the fact that they can write a file in a rather > random manner (Berkeley DB, linker, ...) but the files are written out in > one large linear sweep. That is actually the reason why SLES (and I believe > RHEL as well) tune dirty_limit even higher than what's the default value. Right - but the correct behaviour really depends on the pattern of randomness. The common case we get into trouble with is when no clustering occurs and we end up with small, random IO for gigabytes of cached data. That's the case where write-through caching for random data is better. It's also questionable whether writeback caching for aggregation is faster for random IO on high-IOPS devices or not. Again, I think it woul depend very much on how random the patterns are... > So I think it's rather the other way around: If you can detect the file is > being written in a streaming manner, there's not much point in caching too > much data for it. But we're not talking about how much data we cache here - we are considering how much data we allow to get dirty before writing it back. It doesn't matter if we use writeback or write through caching, the page cache footprint for a given workload is likely to be similar, but without any data we can't draw any conclusions here. > And I agree with you that we also have to be careful not > to cache too few because otherwise two streaming writes would be > interleaved too much. Currently, we have writeback_chunk_size() which > determines how much we ask to write from a single inode. So streaming > writers are going to be interleaved at this chunk size anyway (currently > that number is "measured bandwidth / 2"). So it would make sense to also > limit amount of dirty cache for each file with streaming pattern at this > number. My experience says that for streaming IO we typically need at least 5s of cached *dirty* data to even out delays and latencies in the writeback IO pipeline. Hence limiting a file to what we can write in a second given we might only write a file once a second is likely going to result in pipeline stalls... Remember, writeback caching is about maximising throughput, not minimising latency. The "sync latency" problem with caching too much dirty data on slow block devices is really a corner case behaviour and should not compromise the common case for bulk writeback throughput. > > Indeed, the above implies that we need the caching behaviour to be a > > property of the address space, not just a property of the backing > > device. > Yes, and that would be interesting to implement and not make a mess out > of the whole writeback logic because the way we currently do writeback is > inherently BDI based. When we introduce some special per-inode limits, > flusher threads would have to pick more carefully what to write and what > not. We might be forced to go that way eventually anyway because of memcg > aware writeback but it's not a simple step. Agreed, it's not simple, and that's why we need to start working from the architectural level.... > > IOWs, the implementation needs to trickle down from a coherent high > > level design - that will define the knobs that we need to expose to > > userspace. We should not be adding new writeback behaviours by > > adding knobs to sysfs without first having some clue about whether > > we are solving the right problem and solving it in a sane manner... > Agreed. But the ability to limit amount of dirty pages outstanding > against a particular BDI seems as a sane one to me. It's not as flexible > and automatic as the approach you suggested but it's much simpler and > solves most of problems we currently have. That's true, but.... > The biggest objection against the sysfs-tunable approach is that most > people won't have a clue meaning that the tunable is useless for them. .... that's the big problem I see - nobody is going to know how to use it, when to use it, or be able to tell if it's the root cause of some weird performance problem they are seeing. > But I > wonder if something like: > 1) turn on strictlimit by default > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback > speed > > won't go a long way into solving our current problems without too much > complication... Turning on strict limit by default is going to change behaviour quite markedly. Again, it's not something I'd want to see done without a bunch of data showing that it doesn't cause regressions for common workloads... Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755367Ab3KKTvZ (ORCPT ); Mon, 11 Nov 2013 14:51:25 -0500 Received: from cantor2.suse.de ([195.135.220.15]:55326 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754872Ab3KKTvU (ORCPT ); Mon, 11 Nov 2013 14:51:20 -0500 Date: Mon, 11 Nov 2013 20:31:47 +0100 From: Jan Kara To: Dave Chinner Cc: Jan Kara , Andreas Dilger , "Artem S. Tashkinov" , Wu Fengguang , Linus Torvalds , Andrew Morton , Linux Kernel Mailing List , linux-fsdevel , Jens Axboe , linux-mm Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131111193147.GC24867@quack.suse.cz> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <89AE8FE8-5B15-41DB-B9CE-DFF73531D821@dilger.ca> <20131105041245.GY6188@dastard> <20131107134806.GB30832@quack.suse.cz> <20131111032211.GT6188@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20131111032211.GT6188@dastard> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 11-11-13 14:22:11, Dave Chinner wrote: > On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote: > > On Tue 05-11-13 15:12:45, Dave Chinner wrote: > > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote: > > > Realistically, there is no "one right answer" for all combinations > > > of applications, filesystems and hardware, but writeback caching is > > > the best *general solution* we've got right now. > > > > > > However, IMO users should not need to care about tuning BDI dirty > > > ratios or even have to understand what a BDI dirty ratio is to > > > select the rigth caching method for their devices and/or workload. > > > The difference between writeback and write through caching is easy > > > to explain and AFAICT those two modes suffice to solve the problems > > > being discussed here. Further, if two modes suffice to solve the > > > problems, then we should be able to easily define a trigger to > > > automatically switch modes. > > > > > > /me notes that if we look at random vs sequential IO and the impact > > > that has on writeback duration, then it's very similar to suddenly > > > having a very slow device. IOWs, fadvise(RANDOM) could be used to > > > switch an *inode* to write through mode rather than writeback mode > > > to solve the problem aggregating massive amounts of random write IO > > > in the page cache... > > I disagree here. Writeback cache is also useful for aggregating random > > writes and making semi-sequential writes out of them. There are quite some > > applications which rely on the fact that they can write a file in a rather > > random manner (Berkeley DB, linker, ...) but the files are written out in > > one large linear sweep. That is actually the reason why SLES (and I believe > > RHEL as well) tune dirty_limit even higher than what's the default value. > > Right - but the correct behaviour really depends on the pattern of > randomness. The common case we get into trouble with is when no > clustering occurs and we end up with small, random IO for gigabytes > of cached data. That's the case where write-through caching for > random data is better. > > It's also questionable whether writeback caching for aggregation is > faster for random IO on high-IOPS devices or not. Again, I think it > woul depend very much on how random the patterns are... I agree usefulness of writeback caching for random IO very much depends on the working set size vs cache size, how random the accesses really are, and HW characteristics. I just wanted to point out there are fairly common workloads & setups where writeback caching for semi-random IO really helps (because you seemed to suggest that random IO implies we should disable writeback cache). > > So I think it's rather the other way around: If you can detect the file is > > being written in a streaming manner, there's not much point in caching too > > much data for it. > > But we're not talking about how much data we cache here - we are > considering how much data we allow to get dirty before writing it > back. Sorry, I was imprecise here. I really meant that IMO it doesn't make sense to allow too much dirty data for sequentially written files. > It doesn't matter if we use writeback or write through > caching, the page cache footprint for a given workload is likely to > be similar, but without any data we can't draw any conclusions here. > > > And I agree with you that we also have to be careful not > > to cache too few because otherwise two streaming writes would be > > interleaved too much. Currently, we have writeback_chunk_size() which > > determines how much we ask to write from a single inode. So streaming > > writers are going to be interleaved at this chunk size anyway (currently > > that number is "measured bandwidth / 2"). So it would make sense to also > > limit amount of dirty cache for each file with streaming pattern at this > > number. > > My experience says that for streaming IO we typically need at least > 5s of cached *dirty* data to even out delays and latencies in the > writeback IO pipeline. Hence limiting a file to what we can write in > a second given we might only write a file once a second is likely > going to result in pipeline stalls... I guess this begs for real data. We agree in principle but differ in constants :). > Remember, writeback caching is about maximising throughput, not > minimising latency. The "sync latency" problem with caching too much > dirty data on slow block devices is really a corner case behaviour > and should not compromise the common case for bulk writeback > throughput. Agreed. As a primary goal we want to maximise throughput. But we want to maintain sane latency as well (e.g. because we have a "promise" of "dirty_writeback_centisecs" we have to cycle through dirty inodes reasonably frequently). > > Agreed. But the ability to limit amount of dirty pages outstanding > > against a particular BDI seems as a sane one to me. It's not as flexible > > and automatic as the approach you suggested but it's much simpler and > > solves most of problems we currently have. > > That's true, but.... > > > The biggest objection against the sysfs-tunable approach is that most > > people won't have a clue meaning that the tunable is useless for them. > > .... that's the big problem I see - nobody is going to know how to > use it, when to use it, or be able to tell if it's the root cause of > some weird performance problem they are seeing. > > > But I > > wonder if something like: > > 1) turn on strictlimit by default > > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback > > speed > > > > won't go a long way into solving our current problems without too much > > complication... > > Turning on strict limit by default is going to change behaviour > quite markedly. Again, it's not something I'd want to see done > without a bunch of data showing that it doesn't cause regressions > for common workloads... Agreed. Honza -- Jan Kara SUSE Labs, CR From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752215Ab3KOPvt (ORCPT ); Fri, 15 Nov 2013 10:51:49 -0500 Received: from mail-wi0-f171.google.com ([209.85.212.171]:54633 "EHLO mail-wi0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750992Ab3KOPvl convert rfc822-to-8bit (ORCPT ); Fri, 15 Nov 2013 10:51:41 -0500 From: Diego Calleja To: Fengguang Wu Cc: "Artem S. Tashkinov" , david@lang.hm, neilb@suse.de, linux-kernel@vger.kernel.org, torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org, axboe@kernel.dk, linux-mm@kvack.org Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Date: Fri, 15 Nov 2013 16:48:13 +0100 Message-ID: <3934111.dEm1hrGs4E@diego-arch> User-Agent: KMail/4.11.3 (Linux/3.12.0; KDE/4.11.3; x86_64; ; ) In-Reply-To: <20131025233225.GA32051@localhost> References: <160824051.3072.1382685914055.JavaMail.mail@webmail07> <1999200.Zdacx0scmY@diego-arch> <20131025233225.GA32051@localhost> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain; charset="iso-8859-1" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org El Sbado, 26 de octubre de 2013 00:32:25 Fengguang Wu escribi: > What's the kernel you are running? And it's writing to a hard disk? > The stalls are most likely caused by either one of > > 1) write IO starves read IO > 2) direct page reclaim blocked when > - trying to writeout PG_dirty pages > - trying to lock PG_writeback pages > > Which may be confirmed by running > > ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 > or > echo w > /proc/sysrq-trigger # and check dmesg > > during the stalls. The latter command works more reliably. Sorry for the delay (background: rsync'ing large files from/to a hard disk in a desktop with 16GB of RAM makes the whole desktop unreponsive) I just triggered it today (running 3.12), and run sysrq-w: [ 5547.001505] SysRq : Show Blocked State [ 5547.001509] task PC stack pid father [ 5547.001516] btrfs-transacti D ffff880425d7a8a0 0 193 2 0x00000000 [ 5547.001519] ffff880425eede10 0000000000000002 ffff880425eedfd8 0000000000012e40 [ 5547.001521] ffff880425eedfd8 0000000000012e40 ffff880425d7a8a0 ffffea00104baa80 [ 5547.001523] ffff880425eedd90 ffff880425eedd68 ffff880425eedd70 ffffffff81080edd [ 5547.001525] Call Trace: [ 5547.001530] [] ? get_parent_ip+0xd/0x50 [ 5547.001533] [] ? sub_preempt_count+0x49/0x50 [ 5547.001535] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001552] [] ? btrfs_run_ordered_operations+0x212/0x2c0 [btrfs] [ 5547.001554] [] ? get_parent_ip+0xd/0x50 [ 5547.001556] [] ? sub_preempt_count+0x49/0x50 [ 5547.001557] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.001559] [] schedule+0x29/0x70 [ 5547.001566] [] btrfs_commit_transaction+0x265/0x9d0 [btrfs] [ 5547.001569] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001575] [] transaction_kthread+0x19d/0x220 [btrfs] [ 5547.001581] [] ? free_fs_root+0xc0/0xc0 [btrfs] [ 5547.001583] [] kthread+0xc0/0xd0 [ 5547.001585] [] ? kthread_create_on_node+0x120/0x120 [ 5547.001587] [] ret_from_fork+0x7c/0xb0 [ 5547.001588] [] ? kthread_create_on_node+0x120/0x120 [ 5547.001590] systemd-journal D ffff880426e19860 0 234 1 0x00000000 [ 5547.001592] ffff880426d77d90 0000000000000002 ffff880426d77fd8 0000000000012e40 [ 5547.001593] ffff880426d77fd8 0000000000012e40 ffff880426e19860 ffffffff8155d7cd [ 5547.001595] 0000000000000001 0000000000000001 0000000000000000 ffffffff81572560 [ 5547.001596] Call Trace: [ 5547.001598] [] ? retint_restore_args+0xe/0xe [ 5547.001601] [] ? queue_unplugged+0x3b/0xe0 [ 5547.001602] [] ? blk_flush_plug_list+0x1eb/0x230 [ 5547.001604] [] schedule+0x29/0x70 [ 5547.001606] [] schedule_preempt_disabled+0x18/0x30 [ 5547.001607] [] __mutex_lock_slowpath+0x124/0x1f0 [ 5547.001613] [] ? btrfs_write_marked_extents+0xbb/0xe0 [btrfs] [ 5547.001615] [] mutex_lock+0x17/0x30 [ 5547.001623] [] btrfs_sync_log+0x22a/0x690 [btrfs] [ 5547.001630] [] btrfs_sync_file+0x287/0x2e0 [btrfs] [ 5547.001632] [] do_fsync+0x56/0x80 [ 5547.001634] [] SyS_fsync+0x10/0x20 [ 5547.001635] [] tracesys+0xdd/0xe2 [ 5547.001644] mysqld D ffff8803f0901860 0 643 579 0x00000000 [ 5547.001645] ffff8803f090de18 0000000000000002 ffff8803f090dfd8 0000000000012e40 [ 5547.001647] ffff8803f090dfd8 0000000000012e40 ffff8803f0901860 ffff88016d038000 [ 5547.001648] ffff880426908d00 0000000024119d80 0000000000000000 0000000000000000 [ 5547.001650] Call Trace: [ 5547.001657] [] ? btrfs_submit_bio_hook+0x84/0x1f0 [btrfs] [ 5547.001659] [] ? get_parent_ip+0xd/0x50 [ 5547.001660] [] ? sub_preempt_count+0x49/0x50 [ 5547.001662] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.001663] [] schedule+0x29/0x70 [ 5547.001669] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.001671] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001677] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.001680] [] ? do_writepages+0x1e/0x40 [ 5547.001686] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.001693] [] btrfs_sync_file+0x17f/0x2e0 [btrfs] [ 5547.001694] [] do_fsync+0x56/0x80 [ 5547.001696] [] SyS_fdatasync+0x13/0x20 [ 5547.001697] [] tracesys+0xdd/0xe2 [ 5547.001701] virtuoso-t D ffff88000310b0c0 0 617 609 0x00000000 [ 5547.001702] ffff8803f4867c20 0000000000000002 ffff8803f4867fd8 0000000000012e40 [ 5547.001704] ffff8803f4867fd8 0000000000012e40 ffff88000310b0c0 ffffffff813ce4af [ 5547.001705] ffffffff81860520 ffff8802d8ad8a00 ffff8803f4867ba0 ffffffff81231a0e [ 5547.001707] Call Trace: [ 5547.001709] [] ? scsi_pool_alloc_command+0x3f/0x80 [ 5547.001712] [] ? __blk_segment_map_sg+0x4e/0x120 [ 5547.001713] [] ? blk_rq_map_sg+0x8b/0x1f0 [ 5547.001716] [] ? cfq_dispatch_requests+0xba/0xc40 [ 5547.001718] [] ? get_parent_ip+0xd/0x50 [ 5547.001721] [] ? filemap_fdatawait+0x30/0x30 [ 5547.001722] [] schedule+0x29/0x70 [ 5547.001723] [] io_schedule+0x8f/0xe0 [ 5547.001725] [] sleep_on_page+0xe/0x20 [ 5547.001727] [] __wait_on_bit+0x62/0x90 [ 5547.001728] [] wait_on_page_bit+0x7f/0x90 [ 5547.001730] [] ? wake_atomic_t_function+0x40/0x40 [ 5547.001732] [] filemap_fdatawait_range+0x11b/0x1a0 [ 5547.001734] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001740] [] btrfs_wait_marked_extents+0x87/0xe0 [btrfs] [ 5547.001747] [] btrfs_sync_log+0x4e8/0x690 [btrfs] [ 5547.001754] [] btrfs_sync_file+0x287/0x2e0 [btrfs] [ 5547.001756] [] do_fsync+0x56/0x80 [ 5547.001758] [] SyS_fsync+0x10/0x20 [ 5547.001759] [] tracesys+0xdd/0xe2 [ 5547.001761] pool D ffff88040db1c100 0 657 477 0x00000000 [ 5547.001763] ffff8803ee809ba0 0000000000000002 ffff8803ee809fd8 0000000000012e40 [ 5547.001764] ffff8803ee809fd8 0000000000012e40 ffff88040db1c100 0000000000000004 [ 5547.001766] ffff8803ee809ae8 ffffffff8155cc86 ffff8803ee809bd0 ffffffffa005ada4 [ 5547.001767] Call Trace: [ 5547.001769] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001775] [] ? reserve_metadata_bytes+0x184/0x930 [btrfs] [ 5547.001776] [] ? get_parent_ip+0xd/0x50 [ 5547.001778] [] ? sub_preempt_count+0x49/0x50 [ 5547.001779] [] ? get_parent_ip+0xd/0x50 [ 5547.001781] [] ? sub_preempt_count+0x49/0x50 [ 5547.001783] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.001784] [] schedule+0x29/0x70 [ 5547.001790] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.001792] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001798] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.001804] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.001810] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.001813] [] ? security_inode_permission+0x1c/0x30 [ 5547.001815] [] vfs_create+0xb4/0x120 [ 5547.001817] [] do_last+0x904/0xea0 [ 5547.001818] [] ? link_path_walk+0x70/0x930 [ 5547.001820] [] ? get_parent_ip+0xd/0x50 [ 5547.001822] [] ? security_file_alloc+0x16/0x20 [ 5547.001824] [] path_openat+0xbb/0x6b0 [ 5547.001827] [] ? __acct_update_integrals+0x7f/0x100 [ 5547.001829] [] ? account_system_time+0xa2/0x180 [ 5547.001831] [] ? get_parent_ip+0xd/0x50 [ 5547.001833] [] do_filp_open+0x3a/0x90 [ 5547.001834] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001836] [] ? __alloc_fd+0xa7/0x130 [ 5547.001839] [] do_sys_open+0x129/0x220 [ 5547.001842] [] ? syscall_trace_enter+0x135/0x230 [ 5547.001844] [] SyS_open+0x1e/0x20 [ 5547.001845] [] tracesys+0xdd/0xe2 [ 5547.001850] akregator D ffff8803ed1d4100 0 875 1 0x00000000 [ 5547.001851] ffff8803c7f1bba0 0000000000000002 ffff8803c7f1bfd8 0000000000012e40 [ 5547.001853] ffff8803c7f1bfd8 0000000000012e40 ffff8803ed1d4100 0000000000000004 [ 5547.001854] ffff8803c7f1bae8 ffffffff8155cc86 ffff8803c7f1bbd0 ffffffffa005ada4 [ 5547.001856] Call Trace: [ 5547.001858] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001863] [] ? reserve_metadata_bytes+0x184/0x930 [btrfs] [ 5547.001865] [] ? get_parent_ip+0xd/0x50 [ 5547.001866] [] ? sub_preempt_count+0x49/0x50 [ 5547.001868] [] ? get_parent_ip+0xd/0x50 [ 5547.001870] [] ? sub_preempt_count+0x49/0x50 [ 5547.001871] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.001873] [] schedule+0x29/0x70 [ 5547.001879] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.001881] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.001886] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.001888] [] ? get_parent_ip+0xd/0x50 [ 5547.001894] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.001900] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.001902] [] ? security_inode_permission+0x1c/0x30 [ 5547.001904] [] vfs_create+0xb4/0x120 [ 5547.001906] [] do_last+0x904/0xea0 [ 5547.001907] [] ? link_path_walk+0x70/0x930 [ 5547.001909] [] ? get_parent_ip+0xd/0x50 [ 5547.001911] [] ? security_file_alloc+0x16/0x20 [ 5547.001912] [] path_openat+0xbb/0x6b0 [ 5547.001914] [] ? __acct_update_integrals+0x7f/0x100 [ 5547.001916] [] ? account_system_time+0xa2/0x180 [ 5547.001918] [] ? get_parent_ip+0xd/0x50 [ 5547.001920] [] do_filp_open+0x3a/0x90 [ 5547.001921] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001923] [] ? __alloc_fd+0xa7/0x130 [ 5547.001925] [] do_sys_open+0x129/0x220 [ 5547.001927] [] ? syscall_trace_enter+0x135/0x230 [ 5547.001928] [] SyS_open+0x1e/0x20 [ 5547.001930] [] tracesys+0xdd/0xe2 [ 5547.001931] mpegaudioparse3 D ffff880341d10820 0 5917 1 0x00000000 [ 5547.001933] ffff88030f779ce0 0000000000000002 ffff88030f779fd8 0000000000012e40 [ 5547.001934] ffff88030f779fd8 0000000000012e40 ffff880341d10820 ffffffff81122a28 [ 5547.001936] ffff88043e5ddc00 ffff880400000002 ffff88043e2138d0 0000000000000000 [ 5547.001938] Call Trace: [ 5547.001939] [] ? __alloc_pages_nodemask+0x158/0xb00 [ 5547.001941] [] ? native_send_call_func_single_ipi+0x35/0x40 [ 5547.001943] [] ? generic_exec_single+0x98/0xa0 [ 5547.001945] [] ? __enqueue_entity+0x78/0x80 [ 5547.001947] [] ? enqueue_entity+0x197/0x780 [ 5547.001948] [] ? get_parent_ip+0xd/0x50 [ 5547.001950] [] ? sleep_on_page+0x20/0x20 [ 5547.001951] [] schedule+0x29/0x70 [ 5547.001953] [] io_schedule+0x8f/0xe0 [ 5547.001954] [] sleep_on_page_killable+0xe/0x40 [ 5547.001956] [] __wait_on_bit_lock+0x5d/0xc0 [ 5547.001958] [] __lock_page_killable+0x6a/0x70 [ 5547.001960] [] ? wake_atomic_t_function+0x40/0x40 [ 5547.001961] [] generic_file_aio_read+0x435/0x700 [ 5547.001963] [] do_sync_read+0x5a/0x90 [ 5547.001965] [] vfs_read+0x9a/0x170 [ 5547.001967] [] SyS_read+0x49/0xa0 [ 5547.001968] [] tracesys+0xdd/0xe2 [ 5547.001970] mozStorage #2 D ffff8803b7aa1860 0 920 477 0x00000000 [ 5547.001972] ffff8803b1473d80 0000000000000002 ffff8803b1473fd8 0000000000012e40 [ 5547.001974] ffff8803b1473fd8 0000000000012e40 ffff8803b7aa1860 0000000000000004 [ 5547.001975] ffff8803b1473cc8 ffffffff8155cc86 ffff8803b1473db0 ffffffffa005ada4 [ 5547.001977] Call Trace: [ 5547.001978] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.001984] [] ? reserve_metadata_bytes+0x184/0x930 [btrfs] [ 5547.001990] [] ? __btrfs_buffered_write+0x3d9/0x490 [btrfs] [ 5547.001992] [] ? get_parent_ip+0xd/0x50 [ 5547.001994] [] ? sub_preempt_count+0x49/0x50 [ 5547.001995] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.001997] [] schedule+0x29/0x70 [ 5547.002003] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.002004] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002010] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.002016] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.002023] [] btrfs_setattr+0x101/0x290 [btrfs] [ 5547.002025] [] ? rcu_eqs_enter+0x5c/0xa0 [ 5547.002027] [] notify_change+0x1dc/0x360 [ 5547.002029] [] ? sub_preempt_count+0x49/0x50 [ 5547.002030] [] do_truncate+0x6b/0xa0 [ 5547.002032] [] ? __sb_start_write+0x49/0x100 [ 5547.002033] [] SyS_ftruncate+0x10b/0x160 [ 5547.002035] [] tracesys+0xdd/0xe2 [ 5547.002036] Cache I/O D ffff8803b7aa28a0 0 922 477 0x00000000 [ 5547.002038] ffff8803b1495e18 0000000000000002 ffff8803b1495fd8 0000000000012e40 [ 5547.002039] ffff8803b1495fd8 0000000000012e40 ffff8803b7aa28a0 ffff8803b1495e08 [ 5547.002041] ffff8803b1495db0 ffffffff8111a25a ffff8803b1495e40 ffff8803b1495df0 [ 5547.002043] Call Trace: [ 5547.002045] [] ? find_get_pages_tag+0xea/0x180 [ 5547.002047] [] ? get_parent_ip+0xd/0x50 [ 5547.002048] [] ? sub_preempt_count+0x49/0x50 [ 5547.002050] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.002051] [] schedule+0x29/0x70 [ 5547.002057] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.002059] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002065] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.002071] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.002077] [] btrfs_sync_file+0x17f/0x2e0 [btrfs] [ 5547.002079] [] do_fsync+0x56/0x80 [ 5547.002080] [] SyS_fsync+0x10/0x20 [ 5547.002081] [] tracesys+0xdd/0xe2 [ 5547.002083] mozStorage #6 D ffff8803c0cfa8a0 0 982 477 0x00000000 [ 5547.002085] ffff8803a10f5ba0 0000000000000002 ffff8803a10f5fd8 0000000000012e40 [ 5547.002086] ffff8803a10f5fd8 0000000000012e40 ffff8803c0cfa8a0 0000000000000004 [ 5547.002088] ffff8803a10f5ae8 ffffffff8155cc86 ffff8803a10f5bd0 ffffffffa005ada4 [ 5547.002089] Call Trace: [ 5547.002091] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002096] [] ? reserve_metadata_bytes+0x184/0x930 [btrfs] [ 5547.002098] [] ? native_smp_send_reschedule+0x47/0x60 [ 5547.002100] [] ? resched_task+0x5c/0x60 [ 5547.002101] [] ? get_parent_ip+0xd/0x50 [ 5547.002103] [] ? sub_preempt_count+0x49/0x50 [ 5547.002104] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.002106] [] schedule+0x29/0x70 [ 5547.002112] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.002113] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002119] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.002125] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.002131] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.002133] [] ? security_inode_permission+0x1c/0x30 [ 5547.002134] [] vfs_create+0xb4/0x120 [ 5547.002136] [] do_last+0x904/0xea0 [ 5547.002138] [] ? link_path_walk+0x70/0x930 [ 5547.002139] [] ? get_parent_ip+0xd/0x50 [ 5547.002141] [] ? security_file_alloc+0x16/0x20 [ 5547.002143] [] path_openat+0xbb/0x6b0 [ 5547.002145] [] ? __acct_update_integrals+0x7f/0x100 [ 5547.002147] [] ? account_system_time+0xa2/0x180 [ 5547.002148] [] ? get_parent_ip+0xd/0x50 [ 5547.002150] [] do_filp_open+0x3a/0x90 [ 5547.002152] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002153] [] ? __alloc_fd+0xa7/0x130 [ 5547.002155] [] do_sys_open+0x129/0x220 [ 5547.002157] [] ? syscall_trace_enter+0x135/0x230 [ 5547.002159] [] SyS_open+0x1e/0x20 [ 5547.002160] [] tracesys+0xdd/0xe2 [ 5547.002164] rsync D ffff8802dcde0820 0 5803 5802 0x00000000 [ 5547.002165] ffff8802daeb1a90 0000000000000002 ffff8802daeb1fd8 0000000000012e40 [ 5547.002167] ffff8802daeb1fd8 0000000000012e40 ffff8802dcde0820 ffff880100000002 [ 5547.002169] ffff8802daeb19e0 ffffffff81080edd ffff880308b337e0 0000000000000000 [ 5547.002170] Call Trace: [ 5547.002172] [] ? get_parent_ip+0xd/0x50 [ 5547.002173] [] ? get_parent_ip+0xd/0x50 [ 5547.002175] [] ? sub_preempt_count+0x49/0x50 [ 5547.002177] [] ? get_parent_ip+0xd/0x50 [ 5547.002178] [] ? add_preempt_count+0x3d/0x40 [ 5547.002180] [] ? get_parent_ip+0xd/0x50 [ 5547.002181] [] schedule+0x29/0x70 [ 5547.002182] [] schedule_timeout+0x11a/0x230 [ 5547.002185] [] ? detach_if_pending+0x120/0x120 [ 5547.002187] [] ? ktime_get_ts+0x48/0xe0 [ 5547.002189] [] io_schedule_timeout+0x9b/0xf0 [ 5547.002191] [] balance_dirty_pages_ratelimited+0x3d9/0xa10 [ 5547.002198] [] ? ext4_dirty_inode+0x54/0x60 [ext4] [ 5547.002200] [] generic_file_buffered_write+0x1b8/0x290 [ 5547.002202] [] __generic_file_aio_write+0x1a9/0x3b0 [ 5547.002203] [] generic_file_aio_write+0x58/0xa0 [ 5547.002208] [] ext4_file_write+0x99/0x3e0 [ext4] [ 5547.002210] [] ? acct_account_cputime+0x1c/0x20 [ 5547.002212] [] ? account_system_time+0xa2/0x180 [ 5547.002213] [] ? get_parent_ip+0xd/0x50 [ 5547.002215] [] ? get_parent_ip+0xd/0x50 [ 5547.002216] [] do_sync_write+0x5a/0x90 [ 5547.002218] [] vfs_write+0xbd/0x1e0 [ 5547.002220] [] SyS_write+0x49/0xa0 [ 5547.002221] [] tracesys+0xdd/0xe2 [ 5547.002223] ktorrent D ffff8802e7680820 0 5806 1 0x00000000 [ 5547.002224] ffff8802daf7fba0 0000000000000002 ffff8802daf7ffd8 0000000000012e40 [ 5547.002226] ffff8802daf7ffd8 0000000000012e40 ffff8802e7680820 0000000000000004 [ 5547.002227] ffff8802daf7fae8 ffffffff8155cc86 ffff8802daf7fbd0 ffffffffa005ada4 [ 5547.002229] Call Trace: [ 5547.002230] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002236] [] ? reserve_metadata_bytes+0x184/0x930 [btrfs] [ 5547.002241] [] ? btrfs_set_path_blocking+0x39/0x80 [btrfs] [ 5547.002246] [] ? btrfs_search_slot+0x498/0x970 [btrfs] [ 5547.002247] [] ? get_parent_ip+0xd/0x50 [ 5547.002249] [] ? sub_preempt_count+0x49/0x50 [ 5547.002251] [] ? _raw_spin_unlock_irqrestore+0x26/0x60 [ 5547.002252] [] schedule+0x29/0x70 [ 5547.002258] [] wait_current_trans.isra.17+0xbf/0x120 [btrfs] [ 5547.002260] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002266] [] start_transaction+0x37f/0x570 [btrfs] [ 5547.002268] [] ? sub_preempt_count+0x49/0x50 [ 5547.002273] [] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 5547.002280] [] btrfs_create+0x3b/0x200 [btrfs] [ 5547.002281] [] ? security_inode_permission+0x1c/0x30 [ 5547.002283] [] vfs_create+0xb4/0x120 [ 5547.002285] [] do_last+0x904/0xea0 [ 5547.002287] [] ? link_path_walk+0x70/0x930 [ 5547.002288] [] ? get_parent_ip+0xd/0x50 [ 5547.002290] [] ? security_file_alloc+0x16/0x20 [ 5547.002292] [] path_openat+0xbb/0x6b0 [ 5547.002293] [] ? __acct_update_integrals+0x7f/0x100 [ 5547.002295] [] ? account_system_time+0xa2/0x180 [ 5547.002297] [] ? get_parent_ip+0xd/0x50 [ 5547.002299] [] do_filp_open+0x3a/0x90 [ 5547.002300] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002302] [] ? __alloc_fd+0xa7/0x130 [ 5547.002304] [] do_sys_open+0x129/0x220 [ 5547.002306] [] ? syscall_trace_enter+0x135/0x230 [ 5547.002307] [] SyS_open+0x1e/0x20 [ 5547.002309] [] tracesys+0xdd/0xe2 [ 5547.002311] kworker/u16:0 D ffff88035c5ac920 0 6043 2 0x00000000 [ 5547.002313] Workqueue: writeback bdi_writeback_workfn (flush-8:32) [ 5547.002315] ffff88036c9cb898 0000000000000002 ffff88036c9cbfd8 0000000000012e40 [ 5547.002316] ffff88036c9cbfd8 0000000000012e40 ffff88035c5ac920 ffff8804281de048 [ 5547.002318] ffff88036c9cb7e8 ffffffff81080edd 0000000000000001 ffff88036c9cb800 [ 5547.002319] Call Trace: [ 5547.002321] [] ? get_parent_ip+0xd/0x50 [ 5547.002323] [] ? sub_preempt_count+0x49/0x50 [ 5547.002324] [] ? _raw_spin_unlock+0x16/0x40 [ 5547.002326] [] ? queue_unplugged+0x3b/0xe0 [ 5547.002328] [] schedule+0x29/0x70 [ 5547.002329] [] io_schedule+0x8f/0xe0 [ 5547.002331] [] get_request+0x1aa/0x780 [ 5547.002332] [] ? ioc_lookup_icq+0x4e/0x80 [ 5547.002334] [] ? wake_up_atomic_t+0x30/0x30 [ 5547.002336] [] blk_queue_bio+0x78/0x3e0 [ 5547.002337] [] generic_make_request+0xc2/0x110 [ 5547.002338] [] submit_bio+0x73/0x160 [ 5547.002344] [] ext4_io_submit+0x25/0x50 [ext4] [ 5547.002348] [] ext4_writepages+0x823/0xe00 [ext4] [ 5547.002350] [] do_writepages+0x1e/0x40 [ 5547.002352] [] __writeback_single_inode+0x40/0x330 [ 5547.002353] [] writeback_sb_inodes+0x262/0x450 [ 5547.002355] [] __writeback_inodes_wb+0x9f/0xd0 [ 5547.002357] [] wb_writeback+0x32b/0x360 [ 5547.002358] [] bdi_writeback_workfn+0x221/0x510 [ 5547.002361] [] process_one_work+0x167/0x450 [ 5547.002362] [] worker_thread+0x121/0x3a0 [ 5547.002364] [] ? sub_preempt_count+0x49/0x50 [ 5547.002366] [] ? manage_workers.isra.25+0x2a0/0x2a0 [ 5547.002367] [] kthread+0xc0/0xd0 [ 5547.002369] [] ? kthread_create_on_node+0x120/0x120 [ 5547.002371] [] ret_from_fork+0x7c/0xb0 [ 5547.002372] [] ? kthread_create_on_node+0x120/0x120 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753823Ab3KTDQh (ORCPT ); Tue, 19 Nov 2013 22:16:37 -0500 Received: from mail-ob0-f174.google.com ([209.85.214.174]:44223 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753346Ab3KTDQf convert rfc822-to-8bit (ORCPT ); Tue, 19 Nov 2013 22:16:35 -0500 Date: Tue, 19 Nov 2013 11:17:03 -0600 From: Rob Landley Subject: Re: Disabling in-memory write cache for x86-64 in Linux II To: Mel Gorman Cc: Jan Kara , Linus Torvalds , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List In-Reply-To: <20131030120152.GM2400@suse.de> (from mgorman@suse.de on Wed Oct 30 07:01:52 2013) X-Mailer: Balsa 2.4.11 Message-Id: <1384881423.1974.277@driftwood> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; DelSp=Yes; Format=Flowed Content-Disposition: inline Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/30/2013 07:01:52 AM, Mel Gorman wrote: > We talked about this a > few months ago but I still suspect that we will have to bite the > bullet and > tune based on "do not dirty more data than it takes N seconds to > writeback" > using per-bdi writeback estimations. It's just not that trivial to > implement > as the writeback speeds can change for a variety of reasons (multiple > IO > sources, random vs sequential etc). Record "block writes finished this second" into an 8 entry ring buffer, with a flag saying "device was partly idle this period" so you can ignore those entries. Keep a high water mark, which should converge to the device's linear write capacity. This gives you recent thrashing speed and max capacity, and some weighted average of the two lets you avoid queuing up 10 minutes of writes all at once like 3.0 would to a terabyte USB2 disk. (And then vim calls sync() and hangs...) The first tricky bit is the high water mark, but it's not too bad. If the device reads and writes at the same rate you can populate it from that, but even starting it with just one block should converge really fast because A) the round trip time should be well under a second, B) if you're submitting more than one period's worth of data (you can dirty enough to keep disk busy for 2 seconds), then it'll queue up 2 blocks at a time, then 4, then 8, and increase exponentially until you hit the high water mark. (Which is measured so it won't overshoot.) The second tricky bit is weighting the average, but presumably counting the high water mark as one, then adding in all the "device did not actually go idle during this period" entries, and dividing by the number of entries considered... Reasonable first guess? Obvious optimizations: instead of recording the "disk went idle" flag in the ring buffer, just don't advance the ring buffer at the end of that second, but zero out the entry and re-accumulate it. That way the ring buffer should always have 7 seconds of measured activity, even if it's not necessarily recent. And of course you don't have to wake anything up when there was no I/O, so it's nicely quiescent when the system is... Lowering the high water mark in the case of a transient spurious reading (maybe clock skew during suspend or virtualization glitch or some such) is fun, and could give you a 4 billion block bad reading, but if you always decrement the high water mark by 25% (x-=(x>>2)) each second the disk didn't go idle (rounding up) and then queue up more than one period's worth of data (but no more than say 8 seconds worth), such glitches should fix themselves and it'll work its way back up or down to a reasonably accurate value. (Keep in mind you're averaging the high water mark back down with 7 seconds of measured data from the ring buffer. Maybe you can cap the high water mark at the sum of all the measured values in the ring buffer as an extra check? You're already calculating it to do the average, so...) This is assuming your hard drive _itself_ doesn't have bufferbloat, but http://spritesmods.com/?art=hddhack&f=rss implies they don't, and tagged command queueing lets you see through that anyway so your "actually committed" numbers could presumably still be accurate if the manufacturers aren't totally lying. Given how far behind I am on my email, I assume somebody's already suggested this by now. :) Rob From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755376Ab3KTUwp (ORCPT ); Wed, 20 Nov 2013 15:52:45 -0500 Received: from lxorguk.ukuu.org.uk ([81.2.110.251]:37486 "EHLO lxorguk.ukuu.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755027Ab3KTUwm (ORCPT ); Wed, 20 Nov 2013 15:52:42 -0500 Date: Wed, 20 Nov 2013 20:52:12 +0000 From: One Thousand Gnomes To: Rob Landley Cc: Mel Gorman , Jan Kara , Linus Torvalds , Andrew Morton , "Theodore Ts'o" , "Artem S. Tashkinov" , Wu Fengguang , Linux Kernel Mailing List Subject: Re: Disabling in-memory write cache for x86-64 in Linux II Message-ID: <20131120205212.2509cb8b@alan.etchedpixels.co.uk> In-Reply-To: <1384881423.1974.277@driftwood> References: <20131030120152.GM2400@suse.de> <1384881423.1974.277@driftwood> Organization: Intel Corporation X-Mailer: Claws Mail 3.8.1 (GTK+ 2.24.20; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > This is assuming your hard drive _itself_ doesn't have bufferbloat, but > http://spritesmods.com/?art=hddhack&f=rss implies they don't, and > tagged command queueing lets you see through that anyway so your > "actually committed" numbers could presumably still be accurate if the > manufacturers aren't totally lying. They don't but they do have wildly variable completion rates and times. Nothing like a drive having a seven second hiccup to annoy people but they can do that at times. There are two problems though 1. Disk performance particularly in the rotating rust world is operations/second which is rarely related to volume 2. If the block layer is trying to decide whether the drive is busy you've got it the wrong way up IMHO. Busy-ness is a property of the device and often very device and subsystem specific, so the device end of the chain should figure out how loaded it feels Beyond that the entire problem is well understood and there isn't any real difference between an IPv4 network and a storage layer. In fact in some cases like NFS, DRBD, AoE, and remote block device stuff it's even more so. (TCP based remote block devices btw are a prime example of why you need device end of chain figuring out busy state.. you'll otherwise end up doing double backoff) Alan From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756281Ab3KVXpL (ORCPT ); Fri, 22 Nov 2013 18:45:11 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:58536 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755484Ab3KVXpG (ORCPT ); Fri, 22 Nov 2013 18:45:06 -0500 Date: Fri, 22 Nov 2013 15:45:05 -0800 From: Andrew Morton To: Maxim Patlasov Cc: karl.kiniger@med.ge.com, tytso@mit.edu, linux-kernel@vger.kernel.org, t.artem@lycos.com, linux-mm@kvack.org, mgorman@suse.de, jack@suse.cz, fengguang.wu@intel.com, torvalds@linux-foundation.org, mpatlasov@parallels.com Subject: Re: [PATCH] mm: add strictlimit knob -v2 Message-Id: <20131122154505.3e686fcfc584534d555399e5@linux-foundation.org> In-Reply-To: <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> References: <20131104140104.7936d263258a7a6753eb325e@linux-foundation.org> <20131106150515.25906.55017.stgit@dhcp-10-30-17-2.sw.ru> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 06 Nov 2013 19:05:57 +0400 Maxim Patlasov wrote: > "strictlimit" feature was introduced to enforce per-bdi dirty limits for > FUSE which sets bdi max_ratio to 1% by default: > > http://article.gmane.org/gmane.linux.kernel.mm/105809 > > However the feature can be useful for other relatively slow or untrusted > BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the > feature: > > echo 1 > /sys/class/bdi/X:Y/strictlimit > > Being enabled, the feature enforces bdi max_ratio limit even if global (10%) > dirty limit is not reached. Of course, the effect is not visible until > /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. > > ... > > --- a/Documentation/ABI/testing/sysfs-class-bdi > +++ b/Documentation/ABI/testing/sysfs-class-bdi > @@ -53,3 +53,11 @@ stable_pages_required (read-only) > > If set, the backing device requires that all pages comprising a write > request must not be changed until writeout is complete. > + > +strictlimit (read-write) > + > + Forces per-BDI checks for the share of given device in the write-back > + cache even before the global background dirty limit is reached. This > + is useful in situations where the global limit is much higher than > + affordable for given relatively slow (or untrusted) device. Turning > + strictlimit on has no visible effect if max_ratio is equal to 100%. > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index ce682f7..4ee1d64 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -234,11 +234,46 @@ static ssize_t stable_pages_required_show(struct device *dev, > } > static DEVICE_ATTR_RO(stable_pages_required); > > +static ssize_t strictlimit_store(struct device *dev, > + struct device_attribute *attr, const char *buf, size_t count) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + unsigned int val; > + ssize_t ret; > + > + ret = kstrtouint(buf, 10, &val); > + if (ret < 0) > + return ret; > + > + switch (val) { > + case 0: > + bdi->capabilities &= ~BDI_CAP_STRICTLIMIT; > + break; > + case 1: > + bdi->capabilities |= BDI_CAP_STRICTLIMIT; > + break; > + default: > + return -EINVAL; > + } > + > + return count; > +} > +static ssize_t strictlimit_show(struct device *dev, > + struct device_attribute *attr, char *page) > +{ > + struct backing_dev_info *bdi = dev_get_drvdata(dev); > + > + return snprintf(page, PAGE_SIZE-1, "%d\n", > + !!(bdi->capabilities & BDI_CAP_STRICTLIMIT)); > +} > +static DEVICE_ATTR_RW(strictlimit); > + > static struct attribute *bdi_dev_attrs[] = { > &dev_attr_read_ahead_kb.attr, > &dev_attr_min_ratio.attr, > &dev_attr_max_ratio.attr, > &dev_attr_stable_pages_required.attr, > + &dev_attr_strictlimit.attr, > NULL, Well the patch is certainly simple and straightforward enough and *seems* like it will be useful. The main (and large!) downside is that it adds to the user interface so we'll have to maintain this feature and its functionality for ever. Given this, my concern is that while potentially useful, the feature might not be *sufficiently* useful to justify its inclusion. So we'll end up addressing these issues by other means, then we're left maintaining this obsolete legacy feature. So I'm thinking that unless someone can show that this is good and complete and sufficient for a "large enough" set of issues, I'll take a pass on the patch[1]. What do people think? [1] Actually, I'll stick it in -mm and maintain it, so next time someone reports an issue I can say "hey, try this".