* [RFC] memory cgroup: my thoughts on memsw @ 2014-09-04 14:30 Vladimir Davydov 2014-09-04 22:03 ` Kamezawa Hiroyuki 2014-09-15 19:14 ` Johannes Weiner 0 siblings, 2 replies; 19+ messages in thread From: Vladimir Davydov @ 2014-09-04 14:30 UTC (permalink / raw) To: Johannes Weiner, Michal Hocko Cc: Greg Thelen, Hugh Dickins, Kamezawa Hiroyuki, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML Hi, Over its long history the memory cgroup has been developed rapidly, but rather in a disordered manner. As a result, today we have a bunch of features that are practically unusable and wants redesign (soft limits) or even not working (kmem accounting), not talking about the messy user interface we have (the _in_bytes suffix is driving me mad :-). Fortunately, thanks to Tejun's unified cgroup hierarchy, we have a great chance to drop or redesign some of the old features and their interfaces. We should use this opportunity to examine every aspect of the memory cgroup design, because we will probably not be granted such a present in future. That's why I'm starting a series of RFC's with *my thoughts* not only on kmem accounting, which I've been trying to fix for a while, but also on other parts of the memory cgroup. I'll be happy if anybody reads this to the end, but please don't kick me too hard if something will look stupid to you :-) Today's topic is (surprisingly!) the memsw resource counter and where it fails to satisfy user requests. Let's start from the very beginning. The memory cgroup has basically two resource counters (not counting kmem, which is unusable anyway): mem_cgroup->res (configured by memory.limit), which counts the total amount of user pages charged to the cgroup, and mem_cgroup->memsw (memory.memsw.limit), which is basically res + the cgroup's swap usage. Obviously, memsw always has both the value and limit less than the value and limit of res. That said, we have three options: - memory.limit=inf, memory.memsw.limit=inf No limits, only accounting. - memory.limit=L<inf, memory.memsw.limit=inf Not allowed to use more than L bytes of user pages, but use as much swap as you want. - memory.limit=L<inf, memory.memsw.limit=S<inf, L<=S Not allowed to use more than L bytes of user memory. Swap *plus* memory usage is limited by S. When it comes to *hard* limits everything looks fine, but hard limits are not efficient for partitioning a large system among lots of containers, because it's hard to predict the right value for the limit, besides many workloads will do better when they are granted more file caches. There we need a kind of soft limit that is only used on global memory pressure to shrink containers exceeding it. Obviously the soft limit must be less than memory.limit and therefore memory.memsw.limit. And here comes a problem. Suppose admin sets a relatively high memsw.limit (say half of RAM) and a low soft limit for a container hoping it will use it for file caches when there's free memory, but when hard times come it will be shrunk back to the soft limit quickly. Suppose the container, instead of using the granted memory for caches, creates a lot of anonymous data filling up to its memsw limit (i.e. half of RAM). Then, when admin starts other containers, he might find out that they are effectively using only half of RAM. Why can this happen? See below. For example, if there's no or a little swap. It's pretty common for customers not to bother about creating TBs of swap to back TBs of RAM they have. One might propose to issue OOM if we can't reclaim anything from a container exceeding its soft limit. OK, let it be so, although it's still not agreed upon AFAIK. Another case. There's plenty of swap space out there so that we can swap out the guilty container completely. However, it will take us some reasonable amount of time especially if the container isn't standing still, but keeps touching its data. If other containers are mostly using file caches, they will experience heavy pressure for a long time, not saying about the slowdown caused by high disk usage. Unfair. One might object that we can set a limit on IO operations for the culprit (more limits and dependencies among them, I doubt admins will be happy!). This will slow it down and guarantee it won't be swapping back in pages that are being swapped out due to high memory pressure. However, disks have limited speed. That means, it doesn't solve the problem with unfair slowdown of other containers. What is worse, if we impose IO limit we will slow down swap out by ourselves! Because we shouldn't ignore IO limit for swap out, otherwise the system will be prune to DOS attacks targeted on disk from inside containers, which is what IO limit (as well as any other limit) is to protect against. Or perhaps, I'm missing something and malicious behaviour isn't considered when developing cgroups?! To sum it up, the current mem + memsw configuration scheme doesn't allow us to limit swap usage if we want to partition the system dynamically using soft limits. Actually, it also looks rather confusing to me. We have mem limit and mem+swap limit. I bet that from the first glance, an average admin will think it's possible to limit swap usage by setting the limits so that the difference between memory.memsw.limit and memory.limit equals the maximal swap usage, but (surprise!) it isn't really so. It holds if there's no global memory pressure, but otherwise swap usage is only limited by memory.memsw.limit! IMHO, it isn't something obvious. Finally, my understanding (may be crazy!) how the things should be configured. Just like now, there should be mem_cgroup->res accounting and limiting total user memory (cache+anon) usage for processes inside cgroups. This is where there's nothing to do. However, mem_cgroup->memsw should be reworked to account *only* memory that may be swapped out plus memory that has been swapped out (i.e. swap usage). This way, by setting memsw.limit (or how it should be called) less than memory soft limit we would solve the problem I described above. The container would be then allowed to use only file caches above its memsw.limit, which are usually easily shrinkable, and get OOM-kill while trying to eat too much swappable memory. The configuration will also be less confusing then IMO: - memory.limit - container can't use memory above this - memory.memsw.limit - container can't use swappable memory above this From this it clearly follows maximal swap usage is limited by memory.memsw.limit. One more thought. Anon memory and file caches are different and should be handled differently, so mixing them both under the same counter looks strange to me. Moreover, they are *already* handled differently throughout the kernel - just look at mm/vmscan.c. Here are the differences between them I see: - Anon memory is handled by the user application, while file caches are all on the kernel. That means the application will *definitely* die w/o anon memory. W/o file caches it usually can survive, but the more caches it has the better it feels. - Anon memory is not that easy to reclaim. Swap out is a really slow process, because data are usually read/written w/o any specific order. Dropping file caches is much easier. Typically we have lots of clean pages there. - Swap space is limited. And today, it's OK to have TBs of RAM and only several GBs of swap. Customers simply don't want to waste their disk space on that. IMO, these lead us to the need for limiting swap/swappable memory usage, but not swap+mem usage. Now, a bad thing about such a change (if it were ever considered). There's no way to convert old settings to new, i.e. if we currently have mem <= L, mem + swap <= S, L <= S, we can set mem <= L1, swappable_mem <= S1, where either L1 = L, S1 = S or L1 = L, S1 = S - L, but both configurations won't be exactly the same. In the first case memory+swap usage will be limited by L+S, not by S. In the second case, although memory+swap<S, the container won't be able to use more than S-L anonymous memory. This is the price we would have to pay if we decided to go with this change... Questions, comments, complains, threats? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-04 14:30 [RFC] memory cgroup: my thoughts on memsw Vladimir Davydov @ 2014-09-04 22:03 ` Kamezawa Hiroyuki [not found] ` <5408E1CD.3090004-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2014-09-15 19:14 ` Johannes Weiner 1 sibling, 1 reply; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-04 22:03 UTC (permalink / raw) To: Vladimir Davydov, Johannes Weiner, Michal Hocko Cc: Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/04 23:30), Vladimir Davydov wrote: > Hi, > > Over its long history the memory cgroup has been developed rapidly, but > rather in a disordered manner. As a result, today we have a bunch of > features that are practically unusable and wants redesign (soft limits) > or even not working (kmem accounting), not talking about the messy user > interface we have (the _in_bytes suffix is driving me mad :-). > > Fortunately, thanks to Tejun's unified cgroup hierarchy, we have a great > chance to drop or redesign some of the old features and their > interfaces. We should use this opportunity to examine every aspect of > the memory cgroup design, because we will probably not be granted such a > present in future. > > That's why I'm starting a series of RFC's with *my thoughts* not only on > kmem accounting, which I've been trying to fix for a while, but also on > other parts of the memory cgroup. I'll be happy if anybody reads this to > the end, but please don't kick me too hard if something will look stupid > to you :-) > > > Today's topic is (surprisingly!) the memsw resource counter and where it > fails to satisfy user requests. > > Let's start from the very beginning. The memory cgroup has basically two > resource counters (not counting kmem, which is unusable anyway): > mem_cgroup->res (configured by memory.limit), which counts the total > amount of user pages charged to the cgroup, and mem_cgroup->memsw > (memory.memsw.limit), which is basically res + the cgroup's swap usage. > Obviously, memsw always has both the value and limit less than the value > and limit of res. That said, we have three options: > > - memory.limit=inf, memory.memsw.limit=inf > No limits, only accounting. > > - memory.limit=L<inf, memory.memsw.limit=inf > Not allowed to use more than L bytes of user pages, but use as much > swap as you want. > > - memory.limit=L<inf, memory.memsw.limit=S<inf, L<=S > Not allowed to use more than L bytes of user memory. Swap *plus* > memory usage is limited by S. > > When it comes to *hard* limits everything looks fine, but hard limits > are not efficient for partitioning a large system among lots of > containers, because it's hard to predict the right value for the limit, > besides many workloads will do better when they are granted more file > caches. There we need a kind of soft limit that is only used on global > memory pressure to shrink containers exceeding it. > > > Obviously the soft limit must be less than memory.limit and therefore > memory.memsw.limit. And here comes a problem. Suppose admin sets a > relatively high memsw.limit (say half of RAM) and a low soft limit for a > container hoping it will use it for file caches when there's free > memory, but when hard times come it will be shrunk back to the soft > limit quickly. Suppose the container, instead of using the granted > memory for caches, creates a lot of anonymous data filling up to its > memsw limit (i.e. half of RAM). Then, when admin starts other > containers, he might find out that they are effectively using only half > of RAM. Why can this happen? See below. > > For example, if there's no or a little swap. It's pretty common for > customers not to bother about creating TBs of swap to back TBs of RAM > they have. One might propose to issue OOM if we can't reclaim anything > from a container exceeding its soft limit. OK, let it be so, although > it's still not agreed upon AFAIK. > > Another case. There's plenty of swap space out there so that we can swap > out the guilty container completely. However, it will take us some > reasonable amount of time especially if the container isn't standing > still, but keeps touching its data. If other containers are mostly using > file caches, they will experience heavy pressure for a long time, not > saying about the slowdown caused by high disk usage. Unfair. One might > object that we can set a limit on IO operations for the culprit (more > limits and dependencies among them, I doubt admins will be happy!). This > will slow it down and guarantee it won't be swapping back in pages that > are being swapped out due to high memory pressure. However, disks have > limited speed. That means, it doesn't solve the problem with unfair > slowdown of other containers. What is worse, if we impose IO limit we > will slow down swap out by ourselves! Because we shouldn't ignore IO > limit for swap out, otherwise the system will be prune to DOS attacks > targeted on disk from inside containers, which is what IO limit (as well > as any other limit) is to protect against. > > Or perhaps, I'm missing something and malicious behaviour isn't > considered when developing cgroups?! > > > To sum it up, the current mem + memsw configuration scheme doesn't allow > us to limit swap usage if we want to partition the system dynamically > using soft limits. Actually, it also looks rather confusing to me. We > have mem limit and mem+swap limit. I bet that from the first glance, an > average admin will think it's possible to limit swap usage by setting > the limits so that the difference between memory.memsw.limit and > memory.limit equals the maximal swap usage, but (surprise!) it isn't > really so. It holds if there's no global memory pressure, but otherwise > swap usage is only limited by memory.memsw.limit! IMHO, it isn't > something obvious. > > > Finally, my understanding (may be crazy!) how the things should be > configured. Just like now, there should be mem_cgroup->res accounting > and limiting total user memory (cache+anon) usage for processes inside > cgroups. This is where there's nothing to do. However, mem_cgroup->memsw > should be reworked to account *only* memory that may be swapped out plus > memory that has been swapped out (i.e. swap usage). > > This way, by setting memsw.limit (or how it should be called) less than > memory soft limit we would solve the problem I described above. The > container would be then allowed to use only file caches above its > memsw.limit, which are usually easily shrinkable, and get OOM-kill while > trying to eat too much swappable memory. > > The configuration will also be less confusing then IMO: > > - memory.limit - container can't use memory above this > - memory.memsw.limit - container can't use swappable memory above this > > From this it clearly follows maximal swap usage is limited by > memory.memsw.limit. > > One more thought. Anon memory and file caches are different and should > be handled differently, so mixing them both under the same counter looks > strange to me. Moreover, they are *already* handled differently > throughout the kernel - just look at mm/vmscan.c. Here are the > differences between them I see: > > - Anon memory is handled by the user application, while file caches are > all on the kernel. That means the application will *definitely* die > w/o anon memory. W/o file caches it usually can survive, but the more > caches it has the better it feels. > > - Anon memory is not that easy to reclaim. Swap out is a really slow > process, because data are usually read/written w/o any specific > order. Dropping file caches is much easier. Typically we have lots of > clean pages there. > > - Swap space is limited. And today, it's OK to have TBs of RAM and only > several GBs of swap. Customers simply don't want to waste their disk > space on that. > > IMO, these lead us to the need for limiting swap/swappable memory usage, > but not swap+mem usage. > > > Now, a bad thing about such a change (if it were ever considered). > There's no way to convert old settings to new, i.e. if we currently have > > mem <= L, > mem + swap <= S, > L <= S, > > we can set > > mem <= L1, > swappable_mem <= S1, > > where either > > L1 = L, S1 = S > > or > > L1 = L, S1 = S - L, > > but both configurations won't be exactly the same. In the first case > memory+swap usage will be limited by L+S, not by S. In the second case, > although memory+swap<S, the container won't be able to use more than S-L > anonymous memory. This is the price we would have to pay if we decided > to go with this change... > > > Questions, comments, complains, threats? > If one hits anon+swap limit, it just means OOM. Hitting limit means process's death. Is it useful ? Thanks, -Kame ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <5408E1CD.3090004-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [RFC] memory cgroup: my thoughts on memsw [not found] ` <5408E1CD.3090004-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2014-09-05 8:28 ` Vladimir Davydov 2014-09-05 14:20 ` Kamezawa Hiroyuki 0 siblings, 1 reply; 19+ messages in thread From: Vladimir Davydov @ 2014-09-05 8:28 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML Hi Kamezawa, Thanks for reading this :-) On Fri, Sep 05, 2014 at 07:03:57AM +0900, Kamezawa Hiroyuki wrote: > (2014/09/04 23:30), Vladimir Davydov wrote: > > - memory.limit - container can't use memory above this > > - memory.memsw.limit - container can't use swappable memory above this > > If one hits anon+swap limit, it just means OOM. Hitting limit means > process's death. Basically yes. Hitting the memory.limit will result in swap out + cache reclaim no matter if it's an anon charge or a page cache one. Hitting the swappable memory limit (anon+swap) can only occur on anon charge and if it happens we have no choice rather than invoking OOM. Frankly, I don't see anything wrong in such a behavior. Why is it worse than the current behavior where we also kill processes if a cgroup reaches memsw.limit and we can't reclaim page caches? I admit I may be missing something. So I'd appreciate if you could provide me with a use case where we want *only* the current behavior and my proposal is a no-go. > Is it useful ? I think so, at least, if we want to use soft limits. The point is we will have to kill a process if it eats too much anon memory *anyway* when it comes to global memory pressure, but before finishing it we'll be torturing the culprit as well as *innocent* processes by issuing massive reclaim, as I tried to point out in the example above. IMO, this is no good. Besides, I believe such a distinction between swappable memory and caches would look more natural to users. Everyone got used to it actually. For example, when an admin or user or any userspace utility looks at the output of free(1), it primarily pays attention to free memory "-/+ buffers/caches", because almost all memory is usually full with file caches. And they know that caches easy come, easy go. IMO, for them it'd be more useful to limit this to avoid nasty surprises in the future, and only set some hints for page cache reclaim. The only exception is strict sand-boxing, but AFAIU we can sand-box apps perfectly well with this either, because we would still have a strict memory limit and a limit on maximal swap usage. Please sorry if the idea looks to you totally stupid (may be it is!), but let's just try to consider every possibility we have in mind. Thanks, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-05 8:28 ` Vladimir Davydov @ 2014-09-05 14:20 ` Kamezawa Hiroyuki 2014-09-05 16:00 ` Vladimir Davydov 0 siblings, 1 reply; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-05 14:20 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/05 17:28), Vladimir Davydov wrote: > Hi Kamezawa, > > Thanks for reading this :-) > > On Fri, Sep 05, 2014 at 07:03:57AM +0900, Kamezawa Hiroyuki wrote: >> (2014/09/04 23:30), Vladimir Davydov wrote: >>> - memory.limit - container can't use memory above this >>> - memory.memsw.limit - container can't use swappable memory above this >> >> If one hits anon+swap limit, it just means OOM. Hitting limit means >> process's death. > > Basically yes. Hitting the memory.limit will result in swap out + cache > reclaim no matter if it's an anon charge or a page cache one. Hitting > the swappable memory limit (anon+swap) can only occur on anon charge and > if it happens we have no choice rather than invoking OOM. > > Frankly, I don't see anything wrong in such a behavior. Why is it worse > than the current behavior where we also kill processes if a cgroup > reaches memsw.limit and we can't reclaim page caches? > IIUC, it's the same behavior with the system without cgroup. > I admit I may be missing something. So I'd appreciate if you could > provide me with a use case where we want *only* the current behavior and > my proposal is a no-go. > Basically, I don't like OOM Kill. Anyone don't like it, I think. In recent container use, application may be build as "stateless" and kill-and-respawn may not be problematic, but I think killing "a" process by oom-kill is too naive. If your proposal is triggering notification to user space at hitting anon+swap limit, it may be useful. ...Some container-cluster management software can handle it. For example, container may be restarted. Memcg has threshold notifier and vmpressure notifier. I think you can enhance it. >> Is it useful ? > > I think so, at least, if we want to use soft limits. The point is we > will have to kill a process if it eats too much anon memory *anyway* > when it comes to global memory pressure, but before finishing it we'll > be torturing the culprit as well as *innocent* processes by issuing > massive reclaim, as I tried to point out in the example above. IMO, this > is no good. > My point is that "killing a process" tend not to be able to fix the situation. For example, fork-bomb by "make -j" cannot be handled by it. So, I don't want to think about enhancing OOM-Kill. Please think of better way to survive. With the help of countainer-management-softwares, I think we can have several choices. Restart contantainer (killall) may be the best if container app is stateless. Or container-management can provide some failover. > Besides, I believe such a distinction between swappable memory and > caches would look more natural to users. Everyone got used to it > actually. For example, when an admin or user or any userspace utility > looks at the output of free(1), it primarily pays attention to free > memory "-/+ buffers/caches", because almost all memory is usually full > with file caches. And they know that caches easy come, easy go. IMO, for > them it'd be more useful to limit this to avoid nasty surprises in the > future, and only set some hints for page cache reclaim. > > The only exception is strict sand-boxing, but AFAIU we can sand-box apps >perfectly well with this either, because we would still have a strict > memory limit and a limit on maximal swap usage. > > Please sorry if the idea looks to you totally stupid (may be it is!), > but let's just try to consider every possibility we have in mind. > The 1st reason we added memsw.limit was for avoiding that the whole swap is used up by a cgroup where memory-leak of forkbomb running and not for some intellegent controls. From your opinion, I feel what you want is avoiding charging against page-caches. But thiking docker at el, page-cache is not shared between containers any more. I think "including cache" makes sense. Thanks, -Kame ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-05 14:20 ` Kamezawa Hiroyuki @ 2014-09-05 16:00 ` Vladimir Davydov 2014-09-05 23:15 ` Kamezawa Hiroyuki 0 siblings, 1 reply; 19+ messages in thread From: Vladimir Davydov @ 2014-09-05 16:00 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Fri, Sep 05, 2014 at 11:20:43PM +0900, Kamezawa Hiroyuki wrote: > Basically, I don't like OOM Kill. Anyone don't like it, I think. > > In recent container use, application may be build as "stateless" and > kill-and-respawn may not be problematic, but I think killing "a" process > by oom-kill is too naive. > > If your proposal is triggering notification to user space at hitting > anon+swap limit, it may be useful. > ...Some container-cluster management software can handle it. > For example, container may be restarted. > > Memcg has threshold notifier and vmpressure notifier. > I think you can enhance it. [...] > My point is that "killing a process" tend not to be able to fix the situation. > For example, fork-bomb by "make -j" cannot be handled by it. > > So, I don't want to think about enhancing OOM-Kill. Please think of better > way to survive. With the help of countainer-management-softwares, I think > we can have several choices. > > Restart contantainer (killall) may be the best if container app is stateless. > Or container-management can provide some failover. The problem I'm trying to set out is not about OOM actually (sorry if the way I explain is confusing). We could probably configure OOM to kill a whole cgroup (not just a process) and/or improve user-notification so that the userspace could react somehow. I'm sure it must and will be discussed one day. The problem is that *before* invoking OOM on *global* pressure we're trying to reclaim containers' memory and if there's progress we won't invoke OOM. This can result in a huge slow down of the whole system (due to swap out). And if we want to fully make use of soft limits, we currently have no means to limit anon memory at all. It's just impossible, because memsw.limit must be > soft limit, otherwise it makes no sense. So we will be trying to swap out under global pressure until we finally realize there's no point in it and call OOM. If we don't, we'll be suffering until the load goes away by itself. > The 1st reason we added memsw.limit was for avoiding that the whole swap > is used up by a cgroup where memory-leak of forkbomb running and not for > some intellegent controls. > > From your opinion, I feel what you want is avoiding charging against page-caches. > But thiking docker at el, page-cache is not shared between containers any more. > I think "including cache" makes sense. Not exactly. It's not about sharing caches among containers. The point is (1) it's difficult to estimate the size of file caches that will max out the performance of a container, and (2) a typical workload will perform better and put less pressure on disk if it has more caches. Now imagine a big host running a small number of containers and therefore having a lot of free memory most of time, but still experiencing load spikes once an hour/day/whatever when memory usage raises up drastically. It'd be unwise to set hard limits for those containers that are running regularly, because they'd probably perform much better if they had more file caches. So the admin decides to use soft limits instead. He is forced to use memsw.limit > the soft limit, but this is unsafe, because the container may eat anon memory up to memsw.limit then, and anon memory isn't easy to get rid of when it comes to the global pressure. If the admin had a mean to limit swappable memory, he could avoid it. This is what I was trying to illustrate by the example in the first e-mail of this thread. Note if there were no soft limits, the current setup would be just fine, otherwise it fails. And soft limits are proved to be useful AFAIK. Thanks, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-05 16:00 ` Vladimir Davydov @ 2014-09-05 23:15 ` Kamezawa Hiroyuki 2014-09-08 11:01 ` Vladimir Davydov 0 siblings, 1 reply; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-05 23:15 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/06 1:00), Vladimir Davydov wrote: > On Fri, Sep 05, 2014 at 11:20:43PM +0900, Kamezawa Hiroyuki wrote: >> Basically, I don't like OOM Kill. Anyone don't like it, I think. >> >> In recent container use, application may be build as "stateless" and >> kill-and-respawn may not be problematic, but I think killing "a" process >> by oom-kill is too naive. >> >> If your proposal is triggering notification to user space at hitting >> anon+swap limit, it may be useful. >> ...Some container-cluster management software can handle it. >> For example, container may be restarted. >> >> Memcg has threshold notifier and vmpressure notifier. >> I think you can enhance it. > [...] >> My point is that "killing a process" tend not to be able to fix the situation. >> For example, fork-bomb by "make -j" cannot be handled by it. >> >> So, I don't want to think about enhancing OOM-Kill. Please think of better >> way to survive. With the help of countainer-management-softwares, I think >> we can have several choices. >> >> Restart contantainer (killall) may be the best if container app is stateless. >> Or container-management can provide some failover. > > The problem I'm trying to set out is not about OOM actually (sorry if > the way I explain is confusing). We could probably configure OOM to kill > a whole cgroup (not just a process) and/or improve user-notification so > that the userspace could react somehow. I'm sure it must and will be > discussed one day. > > The problem is that *before* invoking OOM on *global* pressure we're > trying to reclaim containers' memory and if there's progress we won't > invoke OOM. This can result in a huge slow down of the whole system (due > to swap out). > use SSD or zram for swap device. >> The 1st reason we added memsw.limit was for avoiding that the whole swap >> is used up by a cgroup where memory-leak of forkbomb running and not for >> some intellegent controls. >> >> From your opinion, I feel what you want is avoiding charging against page-caches. >> But thiking docker at el, page-cache is not shared between containers any more. >> I think "including cache" makes sense. > > Not exactly. It's not about sharing caches among containers. The point > is (1) it's difficult to estimate the size of file caches that will max > out the performance of a container, and (2) a typical workload will > perform better and put less pressure on disk if it has more caches. > > Now imagine a big host running a small number of containers and > therefore having a lot of free memory most of time, but still > experiencing load spikes once an hour/day/whatever when memory usage > raises up drastically. It'd be unwise to set hard limits for those > containers that are running regularly, because they'd probably perform > much better if they had more file caches. So the admin decides to use > soft limits instead. He is forced to use memsw.limit > the soft limit, > but this is unsafe, because the container may eat anon memory up to > memsw.limit then, and anon memory isn't easy to get rid of when it comes > to the global pressure. If the admin had a mean to limit swappable > memory, he could avoid it. This is what I was trying to illustrate by > the example in the first e-mail of this thread. > > Note if there were no soft limits, the current setup would be just fine, > otherwise it fails. And soft limits are proved to be useful AFAIK. > As you noticed, hitting anon+swap limit just means oom-kill. My point is that using oom-killer for "server management" just seems crazy. Let my clarify things. your proposal was. 1. soft-limit will be a main feature for server management. 2. Because of soft-limit, global memory reclaim runs. 3. Using swap at global memory reclaim can cause poor performance. 4. So, making use of OOM-Killer for avoiding swap. I can't agree "4". I think - don't configure swap. - use zram - use SSD for swap Or - provide a way to notify usage of "anon+swap" to container management software. Now we have "vmpressure". Container management software can kill or respawn container with using user-defined policy for avoidng swap. If you don't want to run kswapd at all, threshold notifier enhancement may be required. /proc/meminfo provides total number of ANON/CACHE pages. Many things can be done in userland. And your idea can't help swap-out caused by memory pressure comes from "zones". I guess vmpressure will be a total win. The kernel may need some enhancement but I don't like to make use of oom-killer as a part of feature for avoiding swap. Thanks, -Kame ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-05 23:15 ` Kamezawa Hiroyuki @ 2014-09-08 11:01 ` Vladimir Davydov 2014-09-08 13:53 ` Kamezawa Hiroyuki 0 siblings, 1 reply; 19+ messages in thread From: Vladimir Davydov @ 2014-09-08 11:01 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote: > As you noticed, hitting anon+swap limit just means oom-kill. > My point is that using oom-killer for "server management" just seems crazy. > > Let my clarify things. your proposal was. > 1. soft-limit will be a main feature for server management. > 2. Because of soft-limit, global memory reclaim runs. > 3. Using swap at global memory reclaim can cause poor performance. > 4. So, making use of OOM-Killer for avoiding swap. > > I can't agree "4". I think > > - don't configure swap. Suppose there are two containers, each having soft limit set to 50% of total system RAM. One of the containers eats 90% of the system RAM by allocating anonymous pages. Another starts using file caches and wants more than 10% of RAM to work w/o issuing disk reads. So what should we do then? We won't be able to shrink the first container to its soft limit, because there's no swap. Leaving it as is would be unfair from the second container's point of view. Kill it? But the whole system is going OK, because the working set of the second container is easily shrinkable. Besides there may be some progress in shrinking file caches from the first container. > - use zram In fact this isn't different from the previous proposal (working w/o swap). ZRAM only compresses data while still storing them in RAM so we eventually may get into a situation where almost all RAM is full of compressed anon pages. > - use SSD for swap Such a requirement might be OK in enterprise, but forcing SMB to update their hardware to run a piece of software is a no go. And again, SSD isn't infinite, we may use it up. > Or > - provide a way to notify usage of "anon+swap" to container management software. > > Now we have "vmpressure". Container management software can kill or respawn container > with using user-defined policy for avoidng swap. > > If you don't want to run kswapd at all, threshold notifier enhancement may be required. > > /proc/meminfo provides total number of ANON/CACHE pages. > Many things can be done in userland. AFAIK OOM-in-userspace-handling has been discussed many times, but there's still no agreement upon it. Basically it isn't reliable, because it can lead to a deadlock if the userspace handler won't be able to allocate memory to proceed or will get stuck in some other way. IMO there must be in-kernel OOM-handling as a last resort anyway. And actually we already have one - we may kill processes when they hit the memsw limit. But OK, you don't like OOM on hitting anon+swap limit and propose to introduce a kind of userspace notification instead, but the problem actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW* we should implement it (or should we implement it at all). No matter which way we go, in-kernel OOM or userland notifications, we have to *INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a predefined threshold we could invoke OOM or issue a userland notification or both. And here goes the problem: there's anon+file and anon+file+swap resource counters, but no anon+swap counter. To react on anon+swap limit breaching, we must introduce one. I propose to *REUSE* memsw instead by slightly modifying its meaning. What we would get then is the ability to react on potentially unreclaimable memory growth inside a container. What we would loose is the current implementation of memory+swap limit, *BUT* we would still be able to limit memory+swap usage by imposing limits on total memory and anon+swap usage. > And your idea can't help swap-out caused by memory pressure comes from "zones". It would help limit swap-out to a sane value. I'm sorry if I'm not clear or don't understand something that looks trivial to you. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-08 11:01 ` Vladimir Davydov @ 2014-09-08 13:53 ` Kamezawa Hiroyuki 2014-09-09 10:39 ` Vladimir Davydov 2014-09-10 12:01 ` Vladimir Davydov 0 siblings, 2 replies; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-08 13:53 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/08 20:01), Vladimir Davydov wrote: > On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote: >> As you noticed, hitting anon+swap limit just means oom-kill. >> My point is that using oom-killer for "server management" just seems crazy. >> >> Let my clarify things. your proposal was. >> 1. soft-limit will be a main feature for server management. >> 2. Because of soft-limit, global memory reclaim runs. >> 3. Using swap at global memory reclaim can cause poor performance. >> 4. So, making use of OOM-Killer for avoiding swap. >> >> I can't agree "4". I think >> >> - don't configure swap. > > Suppose there are two containers, each having soft limit set to 50% of > total system RAM. One of the containers eats 90% of the system RAM by > allocating anonymous pages. Another starts using file caches and wants > more than 10% of RAM to work w/o issuing disk reads. So what should we > do then? > We won't be able to shrink the first container to its soft > limit, because there's no swap. Leaving it as is would be unfair from > the second container's point of view. Kill it? But the whole system is > going OK, because the working set of the second container is easily > shrinkable. Besides there may be some progress in shrinking file caches > from the first container. > >> - use zram > > In fact this isn't different from the previous proposal (working w/o > swap). ZRAM only compresses data while still storing them in RAM so we > eventually may get into a situation where almost all RAM is full of > compressed anon pages. > In above 2 cases, "vmpressure" works fine. > - use SSD for swap > > Such a requirement might be OK in enterprise, but forcing SMB to update > their hardware to run a piece of software is a no go. And again, SSD > isn't infinite, we may use it up. > ditto. >> Or >> - provide a way to notify usage of "anon+swap" to container management software. >> >> Now we have "vmpressure". Container management software can kill or respawn container >> with using user-defined policy for avoidng swap. >> >> If you don't want to run kswapd at all, threshold notifier enhancement may be required. >> >> /proc/meminfo provides total number of ANON/CACHE pages. >> Many things can be done in userland. > > AFAIK OOM-in-userspace-handling has been discussed many times, but > there's still no agreement upon it. Basically it isn't reliable, because > it can lead to a deadlock if the userspace handler won't be able to > allocate memory to proceed or will get stuck in some other way. IMO > there must be in-kernel OOM-handling as a last resort anyway. And > actually we already have one - we may kill processes when they hit the > memsw limit. > > But OK, you don't like OOM on hitting anon+swap limit and propose to > introduce a kind of userspace notification instead, but the problem > actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW* > we should implement it (or should we implement it at all). I'm not sure you're aware of or not, "hardlimit" counter is too expensive for your purpose. If I was you, I'll use some lightweight counter like percpu_counter() or memcg's event handling system. Did you see how threshold notifier or vmpressure works ? It's very light weight. > No matter which way we go, in-kernel OOM or userland notifications, we have to > *INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a > predefined threshold we could invoke OOM or issue a userland > notification or both. And here goes the problem: there's anon+file and > anon+file+swap resource counters, but no anon+swap counter. To react on > anon+swap limit breaching, we must introduce one. I propose to *REUSE* > memsw instead by slightly modifying its meaning. > you can see "anon+swap" via memcg's accounting. > What we would get then is the ability to react on potentially > unreclaimable memory growth inside a container. What we would loose is > the current implementation of memory+swap limit, *BUT* we would still be > able to limit memory+swap usage by imposing limits on total memory and > anon+swap usage. > I repeatedly say anon+swap "hardlimit" just means OOM. That's not buy. >> And your idea can't help swap-out caused by memory pressure comes from "zones". > > It would help limit swap-out to a sane value. > > > I'm sorry if I'm not clear or don't understand something that looks > trivial to you. > It seems your purpose is to avoiding system-wide-oom-situation. Right ? Implementing system-wide-oom-kill-avoidance logic in memcg doesn't sound good to me. It should work under system-wide memory management logic. If memcg can be a help for it, it will be good. For your purpose, you need to implement your method in system-wide way. It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom. You'll need help of system-wide-cgroup-configuration-middleware even if you have a method in a cgroup. If you say logic should be in OS kernel, please implement it in a system wide logic rather than cgroup. I think it's okay to add a help functionality in memcg if there is a system-wide-oom-avoidance logic. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-08 13:53 ` Kamezawa Hiroyuki @ 2014-09-09 10:39 ` Vladimir Davydov 2014-09-11 2:04 ` Kamezawa Hiroyuki 2014-09-10 12:01 ` Vladimir Davydov 1 sibling, 1 reply; 19+ messages in thread From: Vladimir Davydov @ 2014-09-09 10:39 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote: > (2014/09/08 20:01), Vladimir Davydov wrote: > >But OK, you don't like OOM on hitting anon+swap limit and propose to > >introduce a kind of userspace notification instead, but the problem > >actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW* > >we should implement it (or should we implement it at all). > > > I'm not sure you're aware of or not, "hardlimit" counter is too expensive > for your purpose. > > If I was you, I'll use some lightweight counter like percpu_counter() or > memcg's event handling system. > Did you see how threshold notifier or vmpressure works ? It's very light weight. OK, after looking through the memory thresholds code and pondering the problem a bit I tend to agree with you. We can tweak the notifiers to trigger on anon+swap thresholds, handle them in userspace and do whatever we like. At least for now, I don't see anything why this could be worse than hard anon+swap limit except it requires more steps to configure. Thank you for your patience while explaining this to me :-) However, there's one thing, which made me start this discussion, and it still bothers me. It's about memsw.limit_in_bytes knob itself. First, its value must be greater or equal to memory.limit_in_bytes. IMO, such a dependency in the user interface isn't great, but it isn't the worst thing. What is worse, there's only point in setting it to infinity if one wants to fully make use of soft limits as I pointed out earlier. So, we have a userspace knob that suits only for strict sand-boxing when one wants to hard-limit the amount of memory and swap an app can use. When it comes to soft limits, you have to set it to infinity, and it'll still be accounted at the cost of performance, but without any purpose. It just seems meaningless to me. Not counting that the knob itself is a kind of confusing IMO. memsw means memory+swap, so one would mistakenly think memsw.limit-mem.limit is the limit on swap usage, but that's wrong. My point is that anon+swap accounting instead of the current anon+file+swap memsw implementation would be more flexible. We could still sandbox apps by setting hard anon+swap and memory limits, but it would also be possible to make use of it in "soft" environments. It wouldn't be mandatory though. If one doesn't like OOM, he can use threshold notifications to restart the container when it starts to behave badly. But if the user just doesn't want to bother about configuration or is OK with OOM-killer, he could set hard anon+swap limit. Besides, it would untie mem.limit knob from memsw.limit, which would make the user interface simpler and cleaner. So, I think anon+swap limit would be more flexible than file+anon+swap limit we have now. Is there any use case where anon+swap and anon+file accounting couldn't satisfy the user requirements while the anon+file+swap and anon+file pair could? > >No matter which way we go, in-kernel OOM or userland notifications, we have to > >*INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a > >predefined threshold we could invoke OOM or issue a userland > >notification or both. And here goes the problem: there's anon+file and > >anon+file+swap resource counters, but no anon+swap counter. To react on > >anon+swap limit breaching, we must introduce one. I propose to *REUSE* > >memsw instead by slightly modifying its meaning. > > > you can see "anon+swap" via memcg's accounting. > > >What we would get then is the ability to react on potentially > >unreclaimable memory growth inside a container. What we would loose is > >the current implementation of memory+swap limit, *BUT* we would still be > >able to limit memory+swap usage by imposing limits on total memory and > >anon+swap usage. > > > > I repeatedly say anon+swap "hardlimit" just means OOM. That's not buy. anon+file+swap hardlimit eventually means OOM too :-/ > >>And your idea can't help swap-out caused by memory pressure comes from "zones". > > > >It would help limit swap-out to a sane value. > > > > > >I'm sorry if I'm not clear or don't understand something that looks > >trivial to you. > > > > It seems your purpose is to avoiding system-wide-oom-situation. Right ? This is the purpose of any hard memory limit, including the current implementation - avoiding global memory pressure in general and system-wide OOM in particular. > Implementing system-wide-oom-kill-avoidance logic in memcg doesn't > sound good to me. It should work under system-wide memory management logic. > If memcg can be a help for it, it will be good. > > > For your purpose, you need to implement your method in system-wide way. > It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom. > You'll need help of system-wide-cgroup-configuration-middleware even if > you have a method in a cgroup. If you say logic should be in OS kernel, > please implement it in a system wide logic rather than cgroup. What if on global pressure a memory cgroup exceeding its soft limit is being reclaimed, but not fast enough, because it has a lot of anon memory? The global OOM won't be triggered then, because there's still progress, but the system will experience hard pressure due to the reclaimer runs. How can we detect if we should kill the container or not? It smells like one more heuristic to vmscan, IMO. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-09 10:39 ` Vladimir Davydov @ 2014-09-11 2:04 ` Kamezawa Hiroyuki 2014-09-11 8:23 ` Vladimir Davydov 0 siblings, 1 reply; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-11 2:04 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/09 19:39), Vladimir Davydov wrote: >> For your purpose, you need to implement your method in system-wide way. >> It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom. >> You'll need help of system-wide-cgroup-configuration-middleware even if >> you have a method in a cgroup. If you say logic should be in OS kernel, >> please implement it in a system wide logic rather than cgroup. > > What if on global pressure a memory cgroup exceeding its soft limit is > being reclaimed, but not fast enough, because it has a lot of anon > memory? The global OOM won't be triggered then, because there's still > progress, but the system will experience hard pressure due to the > reclaimer runs. How can we detect if we should kill the container or > not? It smells like one more heuristic to vmscan, IMO. That's you are trying to implement by per-cgroup-anon+swap-limit, the difference is heuristics by system designer at container creation or heuristics by kernel in the dynamic way. I said it should be done by system/cloud-container-scheduler based on notification. But okay, let me think of kernel help in global reclaim. - Assume "priority" is a value calculated by "usage - soft limit". - weighted kswapd/direct reclaim => Based on priority of each threads/cgroup, increase "wait" in direct reclaim if it's contended. Low prio container will sleep longer until memory contention is fixed. - weighted anon allocation similar to above, if memory is contended, page fault speed should be weighted based on priority(softlimit). - off cpu direct-reclaim run direct recalim in workqueue with cpu mask. the cpu mask is a global setting per numa node, which determines cpus available for being used to reclaim memory. "How to wait" may affect the performance of system but this can allow masked cpus to be used for more important jobs. All of them will give a container-manager time to consinder next action. Anyway, if swap is slow but necessary, you can use faster swap, now. It's a good age. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-11 2:04 ` Kamezawa Hiroyuki @ 2014-09-11 8:23 ` Vladimir Davydov 2014-09-11 8:53 ` Kamezawa Hiroyuki 0 siblings, 1 reply; 19+ messages in thread From: Vladimir Davydov @ 2014-09-11 8:23 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Thu, Sep 11, 2014 at 11:04:41AM +0900, Kamezawa Hiroyuki wrote: > (2014/09/09 19:39), Vladimir Davydov wrote: > > >>For your purpose, you need to implement your method in system-wide way. > >>It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom. > >>You'll need help of system-wide-cgroup-configuration-middleware even if > >>you have a method in a cgroup. If you say logic should be in OS kernel, > >>please implement it in a system wide logic rather than cgroup. > > > >What if on global pressure a memory cgroup exceeding its soft limit is > >being reclaimed, but not fast enough, because it has a lot of anon > >memory? The global OOM won't be triggered then, because there's still > >progress, but the system will experience hard pressure due to the > >reclaimer runs. How can we detect if we should kill the container or > >not? It smells like one more heuristic to vmscan, IMO. > > > That's you are trying to implement by per-cgroup-anon+swap-limit, the difference > is heuristics by system designer at container creation or heuristics by kernel in > the dynamic way. anon+swap limit isn't a heuristic, it's a configuration! The difference is that the user usually knows *minimal* requirements of the app he's going to run in a container/VM. Basing on them, he buys a container/VM with some predefined amount of RAM. From the whole system POV it's suboptimal to set the hard limit for the container by the user configuration, because there might be free memory, which could be used for file caches and hence lower disk load. If we had anon+swap hard limit, we could use it in conjunction with the soft limit instead of the hard limit. That would be more efficient than VM-like sand-boxing though still safe. When I'm talking about in-kernel heuristics, I mean a pile of hard-to-read functions with a bunch of obscure constants. This is much worse than providing the user with a convenient and flexible interface. > I said it should be done by system/cloud-container-scheduler based on notification. Basically, it's unsafe to hand this out to userspace completely. The system would be prone to DOS attacks from inside containers then. > But okay, let me think of kernel help in global reclaim. > > - Assume "priority" is a value calculated by "usage - soft limit". > > - weighted kswapd/direct reclaim > => Based on priority of each threads/cgroup, increase "wait" in direct reclaim > if it's contended. > Low prio container will sleep longer until memory contention is fixed. > > - weighted anon allocation > similar to above, if memory is contended, page fault speed should be weighted > based on priority(softlimit). > > - off cpu direct-reclaim > run direct recalim in workqueue with cpu mask. the cpu mask is a global setting > per numa node, which determines cpus available for being used to reclaim memory. > "How to wait" may affect the performance of system but this can allow masked cpus > to be used for more important jobs. That's what I call a bunch of heuristics. And actually I don't see how it'd help us against latency spikes caused by reclaimer runs, seems the set is still incomplete :-/ For example, there are two cgroups, one having a huge soft limit excess and full of anon memory and another not exceeding its soft limit but using primarily clean file caches. This prioritizing/weighting stuff would result in shrinking the first group first on global pressure, though it's way slower than shrinking the second one. That means a latency spike in other containers. The heuristics you proposed above will only make it non-critical - the system will get over sooner or later. However, it's still a kind of DOS, which anon+swap hard limit would prevent. Sorry, but I simply don't understand what would go wrong if we substituted the current memsw (anon+file+swap) with anon+swap limit. As I stated before it would be more flexible and logical: On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote: > However, there's one thing, which made me start this discussion, and it > still bothers me. It's about memsw.limit_in_bytes knob itself. > > First, its value must be greater or equal to memory.limit_in_bytes. > IMO, such a dependency in the user interface isn't great, but it isn't > the worst thing. What is worse, there's only point in setting it to > infinity if one wants to fully make use of soft limits as I pointed out > earlier. > > So, we have a userspace knob that suits only for strict sand-boxing when > one wants to hard-limit the amount of memory and swap an app can use. > When it comes to soft limits, you have to set it to infinity, and it'll > still be accounted at the cost of performance, but without any purpose. > It just seems meaningless to me. > > Not counting that the knob itself is a kind of confusing IMO. memsw > means memory+swap, so one would mistakenly think memsw.limit-mem.limit > is the limit on swap usage, but that's wrong. > > My point is that anon+swap accounting instead of the current > anon+file+swap memsw implementation would be more flexible. We could > still sandbox apps by setting hard anon+swap and memory limits, but it > would also be possible to make use of it in "soft" environments. It > wouldn't be mandatory though. If one doesn't like OOM, he can use > threshold notifications to restart the container when it starts to > behave badly. But if the user just doesn't want to bother about > configuration or is OK with OOM-killer, he could set hard anon+swap > limit. Besides, it would untie mem.limit knob from memsw.limit, which > would make the user interface simpler and cleaner. > > So, I think anon+swap limit would be more flexible than file+anon+swap > limit we have now. Is there any use case where anon+swap and anon+file > accounting couldn't satisfy the user requirements while the > anon+file+swap and anon+file pair could? I would appreciate if anybody could answer this. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-11 8:23 ` Vladimir Davydov @ 2014-09-11 8:53 ` Kamezawa Hiroyuki [not found] ` <54116324.7000200-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 0 siblings, 1 reply; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-11 8:53 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/11 17:23), Vladimir Davydov wrote: > On Thu, Sep 11, 2014 at 11:04:41AM +0900, Kamezawa Hiroyuki wrote: >> (2014/09/09 19:39), Vladimir Davydov wrote: >> >>>> For your purpose, you need to implement your method in system-wide way. >>>> It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom. >>>> You'll need help of system-wide-cgroup-configuration-middleware even if >>>> you have a method in a cgroup. If you say logic should be in OS kernel, >>>> please implement it in a system wide logic rather than cgroup. >>> >>> What if on global pressure a memory cgroup exceeding its soft limit is >>> being reclaimed, but not fast enough, because it has a lot of anon >>> memory? The global OOM won't be triggered then, because there's still >>> progress, but the system will experience hard pressure due to the >>> reclaimer runs. How can we detect if we should kill the container or >>> not? It smells like one more heuristic to vmscan, IMO. >> >> >> That's you are trying to implement by per-cgroup-anon+swap-limit, the difference >> is heuristics by system designer at container creation or heuristics by kernel in >> the dynamic way. > > anon+swap limit isn't a heuristic, it's a configuration! > > The difference is that the user usually knows *minimal* requirements of > the app he's going to run in a container/VM. Basing on them, he buys a > container/VM with some predefined amount of RAM. From the whole system > POV it's suboptimal to set the hard limit for the container by the user > configuration, because there might be free memory, which could be used > for file caches and hence lower disk load. If we had anon+swap hard > limit, we could use it in conjunction with the soft limit instead of the > hard limit. That would be more efficient than VM-like sand-boxing though > still safe. > > When I'm talking about in-kernel heuristics, I mean a pile of > hard-to-read functions with a bunch of obscure constants. This is much > worse than providing the user with a convenient and flexible interface. > >> I said it should be done by system/cloud-container-scheduler based on notification. > > Basically, it's unsafe to hand this out to userspace completely. The > system would be prone to DOS attacks from inside containers then. > >> But okay, let me think of kernel help in global reclaim. >> >> - Assume "priority" is a value calculated by "usage - soft limit". >> >> - weighted kswapd/direct reclaim >> => Based on priority of each threads/cgroup, increase "wait" in direct reclaim >> if it's contended. >> Low prio container will sleep longer until memory contention is fixed. >> >> - weighted anon allocation >> similar to above, if memory is contended, page fault speed should be weighted >> based on priority(softlimit). >> >> - off cpu direct-reclaim >> run direct recalim in workqueue with cpu mask. the cpu mask is a global setting >> per numa node, which determines cpus available for being used to reclaim memory. >> "How to wait" may affect the performance of system but this can allow masked cpus >> to be used for more important jobs. > > That's what I call a bunch of heuristics. And actually I don't see how > it'd help us against latency spikes caused by reclaimer runs, seems the > set is still incomplete :-/ > > For example, there are two cgroups, one having a huge soft limit excess > and full of anon memory and another not exceeding its soft limit but > using primarily clean file caches. This prioritizing/weighting stuff > would result in shrinking the first group first on global pressure, > though it's way slower than shrinking the second one. Current implementation just round-robin all memcgs under the tree. With re-designed soft-limit, things will be changed, you can change it. > That means a latency spike in other containers. why ? you said the other container just contains file caches. latency-spike just because file cache drops ? If the service is such naive, please use hard limit. Hmm. How about raising kswapd's scheduling threshold in some situation ? Per-memcg-kswapd-for-helping-softlimit may work. > The heuristics you proposed above > will only make it non-critical - the system will get over sooner or > later. My idea is always based on there is a container-manager on the system, which can do enough clever decision based on a policy, admin specified. IIUC, reducing cpu-hog caused by memory pressure is always helpful. > However, it's still a kind of DOS, which anon+swap hard limit would prevent. by oom-killer. > On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote: >> However, there's one thing, which made me start this discussion, and it >> still bothers me. It's about memsw.limit_in_bytes knob itself. >> >> First, its value must be greater or equal to memory.limit_in_bytes. >> IMO, such a dependency in the user interface isn't great, but it isn't >> the worst thing. What is worse, there's only point in setting it to >> infinity if one wants to fully make use of soft limits as I pointed out >> earlier. >> >> So, we have a userspace knob that suits only for strict sand-boxing when >> one wants to hard-limit the amount of memory and swap an app can use. >> When it comes to soft limits, you have to set it to infinity, and it'll >> still be accounted at the cost of performance, but without any purpose. >> It just seems meaningless to me. >> >> Not counting that the knob itself is a kind of confusing IMO. memsw >> means memory+swap, so one would mistakenly think memsw.limit-mem.limit >> is the limit on swap usage, but that's wrong. >> >> My point is that anon+swap accounting instead of the current >> anon+file+swap memsw implementation would be more flexible. We could >> still sandbox apps by setting hard anon+swap and memory limits, but it >> would also be possible to make use of it in "soft" environments. It >> wouldn't be mandatory though. If one doesn't like OOM, he can use >> threshold notifications to restart the container when it starts to >> behave badly. But if the user just doesn't want to bother about >> configuration or is OK with OOM-killer, he could set hard anon+swap >> limit. Besides, it would untie mem.limit knob from memsw.limit, which >> would make the user interface simpler and cleaner. >> >> So, I think anon+swap limit would be more flexible than file+anon+swap >> limit we have now. Is there any use case where anon+swap and anon+file >> accounting couldn't satisfy the user requirements while the >> anon+file+swap and anon+file pair could? > > I would appreciate if anybody could answer this. > I can't understand why you want to use OOM killer for resource controlling . Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <54116324.7000200-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [RFC] memory cgroup: my thoughts on memsw [not found] ` <54116324.7000200-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2014-09-11 9:50 ` Vladimir Davydov 0 siblings, 0 replies; 19+ messages in thread From: Vladimir Davydov @ 2014-09-11 9:50 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Thu, Sep 11, 2014 at 05:53:56PM +0900, Kamezawa Hiroyuki wrote: > (2014/09/11 17:23), Vladimir Davydov wrote: > >For example, there are two cgroups, one having a huge soft limit excess > >and full of anon memory and another not exceeding its soft limit but > >using primarily clean file caches. This prioritizing/weighting stuff > >would result in shrinking the first group first on global pressure, > >though it's way slower than shrinking the second one. > > Current implementation just round-robin all memcgs under the tree. > With re-designed soft-limit, things will be changed, you can change it. > > > >That means a latency spike in other containers. > > why ? you said the other container just contains file caches. A container wants some mem (anon, file, whatever) under pressure. If the pressure is high, it falls into direct reclaim and starts shrinking the container with a lot of anon memory, which is going to be slow, - here goes a latency spike. > latency-spike just because file cache drops ? > If the service is such naive, please use hard limit. File caches are evicted much easier than anon memory, simply because the latter is (almost) always dirty, However, file caches still can be a vital part of the working set. It all depends on the load. What's wrong with a web server that most of the time sends the same set of web pages to clients? The data it needs are stored on the disk and mostly clean, but it's still its working set. Evicting it will lower the server responsiveness, which will result in clients getting upset and stopping visiting the web site. Or do you suppose the web server must cache disk data in anon memory on its own? Why do we keep clean caches at all then? > Hmm. > How about raising kswapd's scheduling threshold in some situation ? > Per-memcg-kswapd-for-helping-softlimit may work. Instead of preventing the worst case you propose to prepare the after-treatment... > >The heuristics you proposed above > >will only make it non-critical - the system will get over sooner or > >later. > > My idea is always based on there is a container-manager on the system, > which can do enough clever decision based on a policy, admin specified. > IIUC, reducing cpu-hog caused by memory pressure is always helpful. > > >However, it's still a kind of DOS, which anon+swap hard limit would prevent. > > by oom-killer. *Local* oom-killer inside the container behaving badly. This is way better than waiting until it puts the whole system under heavy pressure. > >On Tue, Sep 09, 2014 at 02:39:43PM +0400, Vladimir Davydov wrote: > >>However, there's one thing, which made me start this discussion, and it > >>still bothers me. It's about memsw.limit_in_bytes knob itself. > >> > >>First, its value must be greater or equal to memory.limit_in_bytes. > >>IMO, such a dependency in the user interface isn't great, but it isn't > >>the worst thing. What is worse, there's only point in setting it to > >>infinity if one wants to fully make use of soft limits as I pointed out > >>earlier. > >> > >>So, we have a userspace knob that suits only for strict sand-boxing when > >>one wants to hard-limit the amount of memory and swap an app can use. > >>When it comes to soft limits, you have to set it to infinity, and it'll > >>still be accounted at the cost of performance, but without any purpose. > >>It just seems meaningless to me. > >> > >>Not counting that the knob itself is a kind of confusing IMO. memsw > >>means memory+swap, so one would mistakenly think memsw.limit-mem.limit > >>is the limit on swap usage, but that's wrong. > >> > >>My point is that anon+swap accounting instead of the current > >>anon+file+swap memsw implementation would be more flexible. We could > >>still sandbox apps by setting hard anon+swap and memory limits, but it > >>would also be possible to make use of it in "soft" environments. It > >>wouldn't be mandatory though. If one doesn't like OOM, he can use > >>threshold notifications to restart the container when it starts to > >>behave badly. But if the user just doesn't want to bother about > >>configuration or is OK with OOM-killer, he could set hard anon+swap > >>limit. Besides, it would untie mem.limit knob from memsw.limit, which > >>would make the user interface simpler and cleaner. > >> > >>So, I think anon+swap limit would be more flexible than file+anon+swap > >>limit we have now. Is there any use case where anon+swap and anon+file > >>accounting couldn't satisfy the user requirements while the > >>anon+file+swap and anon+file pair could? > > > >I would appreciate if anybody could answer this. > > > > I can't understand why you want to use OOM killer for resource controlling . Because there are situations when an app inside a container goes mad. There must be a reliable way to stop it. It's all about the compromise between safety (sand-boxing) and efficiency (soft limits). Currently we can't mix them. Soft limits are intrinsically unsafe though must be efficient while hard limits guarantee safety at cost of performance. Anon+swap limit would allow us to combine them to yield an efficient yet safe setup. Besides, memsw limit eventually means OOM too, why is it better? What I propose is to give the admin a choice. If he thinks the app is 100% safe, let him rely on userspace handling and in-kernel after-care. But if there's a possibility of a malicious and/or badly designed app, let him configure in-kernel OOM per container to prevent a disaster for sure. The latter is usually the case when you sell containers to third-party users. Thanks, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-08 13:53 ` Kamezawa Hiroyuki 2014-09-09 10:39 ` Vladimir Davydov @ 2014-09-10 12:01 ` Vladimir Davydov 2014-09-11 1:22 ` Kamezawa Hiroyuki 1 sibling, 1 reply; 19+ messages in thread From: Vladimir Davydov @ 2014-09-10 12:01 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote: > (2014/09/08 20:01), Vladimir Davydov wrote: > >On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote: > >>As you noticed, hitting anon+swap limit just means oom-kill. > >>My point is that using oom-killer for "server management" just seems crazy. > >> > >>Let my clarify things. your proposal was. > >> 1. soft-limit will be a main feature for server management. > >> 2. Because of soft-limit, global memory reclaim runs. > >> 3. Using swap at global memory reclaim can cause poor performance. > >> 4. So, making use of OOM-Killer for avoiding swap. > >> > >>I can't agree "4". I think > >> > >> - don't configure swap. > > > >Suppose there are two containers, each having soft limit set to 50% of > >total system RAM. One of the containers eats 90% of the system RAM by > >allocating anonymous pages. Another starts using file caches and wants > >more than 10% of RAM to work w/o issuing disk reads. So what should we > >do then? > >We won't be able to shrink the first container to its soft > >limit, because there's no swap. Leaving it as is would be unfair from > >the second container's point of view. Kill it? But the whole system is > >going OK, because the working set of the second container is easily > >shrinkable. Besides there may be some progress in shrinking file caches > >from the first container. > > > >> - use zram > > > >In fact this isn't different from the previous proposal (working w/o > >swap). ZRAM only compresses data while still storing them in RAM so we > >eventually may get into a situation where almost all RAM is full of > >compressed anon pages. > > > > In above 2 cases, "vmpressure" works fine. What if a container allocates memory so fast that the userspace thread handling its threshold notifications won't have time to react before it eats all memory? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-10 12:01 ` Vladimir Davydov @ 2014-09-11 1:22 ` Kamezawa Hiroyuki 2014-09-11 7:03 ` Vladimir Davydov 0 siblings, 1 reply; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-11 1:22 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/10 21:01), Vladimir Davydov wrote: > On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote: >> (2014/09/08 20:01), Vladimir Davydov wrote: >>> On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote: >>>> As you noticed, hitting anon+swap limit just means oom-kill. >>>> My point is that using oom-killer for "server management" just seems crazy. >>>> >>>> Let my clarify things. your proposal was. >>>> 1. soft-limit will be a main feature for server management. >>>> 2. Because of soft-limit, global memory reclaim runs. >>>> 3. Using swap at global memory reclaim can cause poor performance. >>>> 4. So, making use of OOM-Killer for avoiding swap. >>>> >>>> I can't agree "4". I think >>>> >>>> - don't configure swap. >>> >>> Suppose there are two containers, each having soft limit set to 50% of >>> total system RAM. One of the containers eats 90% of the system RAM by >>> allocating anonymous pages. Another starts using file caches and wants >>> more than 10% of RAM to work w/o issuing disk reads. So what should we >>> do then? >>> We won't be able to shrink the first container to its soft >>> limit, because there's no swap. Leaving it as is would be unfair from >>> the second container's point of view. Kill it? But the whole system is >>> going OK, because the working set of the second container is easily >>> shrinkable. Besides there may be some progress in shrinking file caches >> >from the first container. >>> >>>> - use zram >>> >>> In fact this isn't different from the previous proposal (working w/o >>> swap). ZRAM only compresses data while still storing them in RAM so we >>> eventually may get into a situation where almost all RAM is full of >>> compressed anon pages. >>> >> >> In above 2 cases, "vmpressure" works fine. > > What if a container allocates memory so fast that the userspace thread > handling its threshold notifications won't have time to react before it > eats all memory? > Softlimit is for avoiding such unfair memory scheduling, isn't it ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-11 1:22 ` Kamezawa Hiroyuki @ 2014-09-11 7:03 ` Vladimir Davydov 0 siblings, 0 replies; 19+ messages in thread From: Vladimir Davydov @ 2014-09-11 7:03 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML On Thu, Sep 11, 2014 at 10:22:51AM +0900, Kamezawa Hiroyuki wrote: > (2014/09/10 21:01), Vladimir Davydov wrote: > >On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote: > >>(2014/09/08 20:01), Vladimir Davydov wrote: > >>>On Sat, Sep 06, 2014 at 08:15:44AM +0900, Kamezawa Hiroyuki wrote: > >>>>As you noticed, hitting anon+swap limit just means oom-kill. > >>>>My point is that using oom-killer for "server management" just seems crazy. > >>>> > >>>>Let my clarify things. your proposal was. > >>>> 1. soft-limit will be a main feature for server management. > >>>> 2. Because of soft-limit, global memory reclaim runs. > >>>> 3. Using swap at global memory reclaim can cause poor performance. > >>>> 4. So, making use of OOM-Killer for avoiding swap. > >>>> > >>>>I can't agree "4". I think > >>>> > >>>> - don't configure swap. > >>> > >>>Suppose there are two containers, each having soft limit set to 50% of > >>>total system RAM. One of the containers eats 90% of the system RAM by > >>>allocating anonymous pages. Another starts using file caches and wants > >>>more than 10% of RAM to work w/o issuing disk reads. So what should we > >>>do then? > >>>We won't be able to shrink the first container to its soft > >>>limit, because there's no swap. Leaving it as is would be unfair from > >>>the second container's point of view. Kill it? But the whole system is > >>>going OK, because the working set of the second container is easily > >>>shrinkable. Besides there may be some progress in shrinking file caches > >>>from the first container. > >>> > >>>> - use zram > >>> > >>>In fact this isn't different from the previous proposal (working w/o > >>>swap). ZRAM only compresses data while still storing them in RAM so we > >>>eventually may get into a situation where almost all RAM is full of > >>>compressed anon pages. > >>> > >> > >>In above 2 cases, "vmpressure" works fine. > > > >What if a container allocates memory so fast that the userspace thread > >handling its threshold notifications won't have time to react before it > >eats all memory? > > > > Softlimit is for avoiding such unfair memory scheduling, isn't it ? Yeah, and we're returning back to the very beginning. Anonymous memory reclaim triggered by soft limit may be impossible due to lack of swap space or really sluggish. The whole system will be dragging its feet until it finally realizes the container must be killed. It's a kind of DOS attack... Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-04 14:30 [RFC] memory cgroup: my thoughts on memsw Vladimir Davydov 2014-09-04 22:03 ` Kamezawa Hiroyuki @ 2014-09-15 19:14 ` Johannes Weiner 2014-09-16 1:34 ` Kamezawa Hiroyuki [not found] ` <20140915191435.GA8950-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 1 sibling, 2 replies; 19+ messages in thread From: Johannes Weiner @ 2014-09-15 19:14 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Greg Thelen, Hugh Dickins, Kamezawa Hiroyuki, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML Hi Vladimir, On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote: > To sum it up, the current mem + memsw configuration scheme doesn't allow > us to limit swap usage if we want to partition the system dynamically > using soft limits. Actually, it also looks rather confusing to me. We > have mem limit and mem+swap limit. I bet that from the first glance, an > average admin will think it's possible to limit swap usage by setting > the limits so that the difference between memory.memsw.limit and > memory.limit equals the maximal swap usage, but (surprise!) it isn't > really so. It holds if there's no global memory pressure, but otherwise > swap usage is only limited by memory.memsw.limit! IMHO, it isn't > something obvious. Agreed, memory+swap accounting & limiting is broken. > - Anon memory is handled by the user application, while file caches are > all on the kernel. That means the application will *definitely* die > w/o anon memory. W/o file caches it usually can survive, but the more > caches it has the better it feels. > > - Anon memory is not that easy to reclaim. Swap out is a really slow > process, because data are usually read/written w/o any specific > order. Dropping file caches is much easier. Typically we have lots of > clean pages there. > > - Swap space is limited. And today, it's OK to have TBs of RAM and only > several GBs of swap. Customers simply don't want to waste their disk > space on that. > Finally, my understanding (may be crazy!) how the things should be > configured. Just like now, there should be mem_cgroup->res accounting > and limiting total user memory (cache+anon) usage for processes inside > cgroups. This is where there's nothing to do. However, mem_cgroup->memsw > should be reworked to account *only* memory that may be swapped out plus > memory that has been swapped out (i.e. swap usage). But anon pages are not a resource, they are a swap space liability. Think of virtual memory vs. physical pages - the use of one does not necessarily result in the use of the other. Without memory pressure, anonymous pages do not consume swap space. What we *should* be accounting and limiting here is the actual finite resource: swap space. Whenever we try to swap a page, its owner should be charged for the swap space - or the swapout be rejected. For hard limit reclaim, the semantics of a swap space limit would be fairly obvious, because it's clear who the offender is. However, in an overcommitted machine, the amount of swap space used by a particular group depends just as much on the behavior of the other groups in the system, so the per-group swap limit should be enforced even during global reclaim to feed back pressure on whoever is causing the swapout. If reclaim fails, the global OOM killer triggers, which should then off the group with the biggest soft limit excess. As far as implementation goes, it should be doable to try-charge from add_to_swap() and keep the uncharging in swap_entry_free(). We'll also have to extend the global OOM killer to be memcg-aware, but we've been meaning to do that anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC] memory cgroup: my thoughts on memsw 2014-09-15 19:14 ` Johannes Weiner @ 2014-09-16 1:34 ` Kamezawa Hiroyuki [not found] ` <20140915191435.GA8950-druUgvl0LCNAfugRpC6u6w@public.gmane.org> 1 sibling, 0 replies; 19+ messages in thread From: Kamezawa Hiroyuki @ 2014-09-16 1:34 UTC (permalink / raw) To: Johannes Weiner, Vladimir Davydov Cc: Michal Hocko, Greg Thelen, Hugh Dickins, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML (2014/09/16 4:14), Johannes Weiner wrote: > Hi Vladimir, > > On Thu, Sep 04, 2014 at 06:30:55PM +0400, Vladimir Davydov wrote: >> To sum it up, the current mem + memsw configuration scheme doesn't allow >> us to limit swap usage if we want to partition the system dynamically >> using soft limits. Actually, it also looks rather confusing to me. We >> have mem limit and mem+swap limit. I bet that from the first glance, an >> average admin will think it's possible to limit swap usage by setting >> the limits so that the difference between memory.memsw.limit and >> memory.limit equals the maximal swap usage, but (surprise!) it isn't >> really so. It holds if there's no global memory pressure, but otherwise >> swap usage is only limited by memory.memsw.limit! IMHO, it isn't >> something obvious. > > Agreed, memory+swap accounting & limiting is broken. > >> - Anon memory is handled by the user application, while file caches are >> all on the kernel. That means the application will *definitely* die >> w/o anon memory. W/o file caches it usually can survive, but the more >> caches it has the better it feels. >> >> - Anon memory is not that easy to reclaim. Swap out is a really slow >> process, because data are usually read/written w/o any specific >> order. Dropping file caches is much easier. Typically we have lots of >> clean pages there. >> >> - Swap space is limited. And today, it's OK to have TBs of RAM and only >> several GBs of swap. Customers simply don't want to waste their disk >> space on that. > >> Finally, my understanding (may be crazy!) how the things should be >> configured. Just like now, there should be mem_cgroup->res accounting >> and limiting total user memory (cache+anon) usage for processes inside >> cgroups. This is where there's nothing to do. However, mem_cgroup->memsw >> should be reworked to account *only* memory that may be swapped out plus >> memory that has been swapped out (i.e. swap usage). > > But anon pages are not a resource, they are a swap space liability. > Think of virtual memory vs. physical pages - the use of one does not > necessarily result in the use of the other. Without memory pressure, > anonymous pages do not consume swap space. > > What we *should* be accounting and limiting here is the actual finite > resource: swap space. Whenever we try to swap a page, its owner > should be charged for the swap space - or the swapout be rejected. > > For hard limit reclaim, the semantics of a swap space limit would be > fairly obvious, because it's clear who the offender is. > > However, in an overcommitted machine, the amount of swap space used by > a particular group depends just as much on the behavior of the other > groups in the system, so the per-group swap limit should be enforced > even during global reclaim to feed back pressure on whoever is causing > the swapout. If reclaim fails, the global OOM killer triggers, which > should then off the group with the biggest soft limit excess. > > As far as implementation goes, it should be doable to try-charge from > add_to_swap() and keep the uncharging in swap_entry_free(). > > We'll also have to extend the global OOM killer to be memcg-aware, but > we've been meaning to do that anyway. > When we introduced memsw limitation, we tried to avoid affecting global memory reclaim. Then, we did memory+swap limitation. Now, global memory reclaim is memcg-aware. So, I think swap-limitation rather than anon+swap may be a choice. The change will reduce res_counter access. Hmm, it will be desireble to move anon pages to Unevictable if memcg's swap slot is 0. Anyway, I think softlimit should be re-implemented, 1st. It will be starting point. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20140915191435.GA8950-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [RFC] memory cgroup: my thoughts on memsw [not found] ` <20140915191435.GA8950-druUgvl0LCNAfugRpC6u6w@public.gmane.org> @ 2014-09-17 15:59 ` Vladimir Davydov 0 siblings, 0 replies; 19+ messages in thread From: Vladimir Davydov @ 2014-09-17 15:59 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Greg Thelen, Hugh Dickins, Kamezawa Hiroyuki, Motohiro Kosaki, Glauber Costa, Tejun Heo, Andrew Morton, Pavel Emelianov, Konstantin Khorenko, LKML-MM, LKML-cgroups, LKML Hi Johannes, On Mon, Sep 15, 2014 at 03:14:35PM -0400, Johannes Weiner wrote: > > Finally, my understanding (may be crazy!) how the things should be > > configured. Just like now, there should be mem_cgroup->res accounting > > and limiting total user memory (cache+anon) usage for processes inside > > cgroups. This is where there's nothing to do. However, mem_cgroup->memsw > > should be reworked to account *only* memory that may be swapped out plus > > memory that has been swapped out (i.e. swap usage). > > But anon pages are not a resource, they are a swap space liability. > Think of virtual memory vs. physical pages - the use of one does not > necessarily result in the use of the other. Without memory pressure, > anonymous pages do not consume swap space. > > What we *should* be accounting and limiting here is the actual finite > resource: swap space. Whenever we try to swap a page, its owner > should be charged for the swap space - or the swapout be rejected. I've been thinking quite a bit on the problem, and finally I believe you're right: a separate swap limit would be better than anon+swap. Provided we make the OOM-killer kill cgroups that exceed their soft limit and can't be reclaimed, it will solve the problem with soft limits I described above. Besides, comparing to anon+swap, swap limit would be more efficient (we only need to charge one res counter, not two) and understandable to users (it's simple to setup a limit for both kinds of resources then, because they never mix). Finally, we could transfer user configuration from cgroup v1 to v2 easily: just setup swap.limit to be equal to memsw.limit-mem.limit; it won't be exactly the same, but I bet nobody will notice any difference. So, at least for now, I vote for moving from mem+swap to swap accounting. Thanks, Vladimir ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-09-17 15:59 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-04 14:30 [RFC] memory cgroup: my thoughts on memsw Vladimir Davydov
2014-09-04 22:03 ` Kamezawa Hiroyuki
[not found] ` <5408E1CD.3090004-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2014-09-05 8:28 ` Vladimir Davydov
2014-09-05 14:20 ` Kamezawa Hiroyuki
2014-09-05 16:00 ` Vladimir Davydov
2014-09-05 23:15 ` Kamezawa Hiroyuki
2014-09-08 11:01 ` Vladimir Davydov
2014-09-08 13:53 ` Kamezawa Hiroyuki
2014-09-09 10:39 ` Vladimir Davydov
2014-09-11 2:04 ` Kamezawa Hiroyuki
2014-09-11 8:23 ` Vladimir Davydov
2014-09-11 8:53 ` Kamezawa Hiroyuki
[not found] ` <54116324.7000200-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2014-09-11 9:50 ` Vladimir Davydov
2014-09-10 12:01 ` Vladimir Davydov
2014-09-11 1:22 ` Kamezawa Hiroyuki
2014-09-11 7:03 ` Vladimir Davydov
2014-09-15 19:14 ` Johannes Weiner
2014-09-16 1:34 ` Kamezawa Hiroyuki
[not found] ` <20140915191435.GA8950-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-09-17 15:59 ` Vladimir Davydov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox