From mboxrd@z Thu Jan 1 00:00:00 1970 From: Maxim Patlasov Subject: Re: [PATCH] mm: strictlimit feature -v4 Date: Thu, 22 Aug 2013 14:15:19 +0400 Message-ID: <5215E4B7.3020003@parallels.com> References: <20130821135427.20334.79477.stgit@maximpc.sw.ru> <20130821133804.87ca602dd864df712e67342a@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable Cc: , , , , , , , , , , , , , To: Andrew Morton Return-path: In-Reply-To: <20130821133804.87ca602dd864df712e67342a@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org 08/22/2013 12:38 AM, Andrew Morton =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > On Wed, 21 Aug 2013 17:56:32 +0400 Maxim Patlasov wrote: > >> The feature prevents mistrusted filesystems to grow a large number of = dirty >> pages before throttling. For such filesystems balance_dirty_pages alwa= ys >> check bdi counters against bdi limits. I.e. even if global "nr_dirty" = is under >> "freerun", it's not allowed to skip bdi checks. The only use case for = now is >> fuse: it sets bdi max_ratio to 1% by default and system administrators= are >> supposed to expect that this limit won't be exceeded. >> >> The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. >> A filesystem may set the flag when it initializes its BDI. > Now I think about it, I don't really understand the need for this > feature. Can you please go into some detail about the problematic > scenarios and why they need fixing? Including an expanded descritopn > of the term "mistrusted filesystem"? Saying "mistrusted filesystem" I meant FUSE mount created by=20 unprivileged user. Userspace fuse library provides suid binary=20 "fusermount". Here is an excerpt from its man-page: > Filesystem in Userspace (FUSE) is a simple interface for userspace pro= - > grams to export a virtual filesystem to the Linux kernel. It also aims > to provide a secure method for non privileged users to create and moun= t > their own filesystem implementations. > > fusermount is a program to mount and unmount FUSE filesystems. I'm citing it here to emphasize the fact that running buggy or=20 malevolent filesystem implementation is not pure theoretical. Every time=20 you have fuse library properly installed, any user can compile and mount=20 its own filesystem implementation. The problematic scenario comes from the fact that nobody pays attention=20 to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse=20 writeback). The implementation of fuse writeback releases original page=20 (by calling end_page_writeback) almost immediately. A fuse request=20 queued for real processing bears a copy of original page. Hence, if=20 userspace fuse daemon doesn't finalize write requests in timely manner,=20 an aggressive mmap writer can pollute virtually all memory by those=20 temporary fuse page copies. They are carefully accounted in=20 NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP=20 problem" as a shortcut for "a possibility of uncontrolled grow of amount=20 of RAM consumed by temporary pages allocated by kernel fuse to process=20 writeback". > Is this some theoretical happens-in-the-lab thing, or are real world > users actually hurting due to the lack of this feature? The problem was very easy to reproduce. There is a trivial example=20 filesystem implementation in fuse userspace distribution: fusexmp_fh.c.=20 I added "sleep(1);" to the write methods, then recompiled and mounted=20 it. Then created a huge file on the mount point and run a simple program=20 which mmap-ed the file to a memory region, then wrote a data to the=20 region. An hour later I observed almost all RAM consumed by fuse=20 writeback. Since then some unrelated changes in kernel fuse made it more=20 difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is=20 another thing that really hurts real world (FUSE) users. This is=20 write-through page cache policy FUSE currently uses. I.e. handling=20 write(2), kernel fuse populates page cache and flushes user data to the=20 server synchronously. This is excessively suboptimal. Pavel Emelyanov's=20 patches ("writeback cache policy") solve the problem, but they also make=20 resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise,=20 simply copying a huge file to a fuse mount would result in memory=20 starvation. Miklos, the maintainer of FUSE, believes strictlimit feature=20 the way to go. And eventually putting FUSE topics aside, there is one more use-case for=20 strictlimit feature. Using a slow USB stick (mass storage) in a machine=20 with huge amount of RAM installed is a well-known pain. Let's make=20 simple computations. Assuming 64GB of RAM installed, existing=20 implementation of balance_dirty_pages will start throttling only after=20 9.6GB of RAM becomes dirty (freerun =3D=3D 15% of total RAM). So, the=20 command "cp 9GB_file /media/my-usb-storage/" may return in a few=20 seconds, but subsequent "umount /media/my-usb-storage/" will take more=20 than two hours if effective throughput of the storage is, to say, 1MB/sec= . After inclusion of strictlimit feature, it will be trivial to add a knob=20 (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.=20 Manually or via udev rule. May be I'm wrong, but it seems to be quite a=20 natural desire to limit the amount of dirty memory for some devices we=20 are not fully trust (in the sense of sustainable throughput). > I think I'll apply it to -mm for now to get a bit of testing, but would > very much like it if Fengguang could find time to review the > implementation, please. Great! Fengguang, please... Thanks, Maxim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org