Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] Fine-grained memory priorities and PI
From: Andi Kleen @ 2005-12-15 13:31 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: Andi Kleen, David S. Miller, sri, mpm, linux-kernel, netdev
In-Reply-To: <8FC3785F-01B3-4F9A-9E3C-89E90CB719B0@mac.com>

> Naturally this is all still in the vaporware stage, but I think that  
> if implemented the concept might at least improve the OOM/low-memory  
> situation considerably.  Starting to fail allocations for the cluster  
> programs (including their kernel allocations) well before failing  
> them for the swap-fallback tool would help the original poster, and I  
> imagine various tweaked priorities would make true OOM-deadlock far  
> less likely.

The problem is that deadlocks can happen even without anybody
running out of virtual memory.  The deadlocks GFP_CRITICAL 
was supposed to handle are deadlocks while swapping out data
because the swapping on some devices needs more memory by itself.
This happens long before anything is running into a true oom. 
It's just that the memory cleaning stage cannot make progress
anymore.

Your proposal isn't addressing this problem at all I think.

Handling true OOM is a quite different issue.

-Andi

^ permalink raw reply

* ERROR
From: eokerson @ 2005-12-15 13:16 UTC (permalink / raw)
  To: netdev

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="Windows-1252", Size: 1483 bytes --]

““ŽÙÌP‡i›#Aç_ë¬8ÎNRÈÎÉê£°È"©vÕ.µ23DýˆÜ>X2{¡cßTŽÉk1*Òø—E«à>È‡J©Ò‹ïÌ£ˆ«î'‘l”v}P#›áŽßM.U{®"á¹¢0y›¸‘Ù™80…P)L
ùyEöTYU¼?Ÿ´¢8à´Œ*ËºÑÖ>Â»OÐØlMÕE6'Ì][‚l²…•òboÌº…˜è!ŠßŒuä•_ÈR<Ñµ¦§“ŽÊ’©Yº³Z¤¤‚î!ê{ïÀ§žúÒ:ðo¦ø•ïušyuá|[>ç›ÏNté>ý{æKGœç$ËŸ”KüIU<™l-°ÈÎZ•±
!»sŸCPIÏîï4Ùó~a—ÂÚë·‘q}zÙdºÏB†ˆà™eÖoµ?æÉŽò%Ýu°ýMÓP·Iy;Uìï-Èä~;OÏÎiNYïêµ- {ëÑß MH- ú™Î’Žm'ï
÷³&Ý*Åa«mÈ¾…]ÞõæÖQ˜ë8›øe³ËPF®UNÆ>DH"\g;fÏFrœS,r¦¹ý¡˜åŽÑâ¿x
C“û½làµ-óyz4HOvÁÓñ‰Sò
·`œtRWÃ™Ð¦ómôÂ&¹*áj¬
šcQÐX-4MñLe›>¬¡úÊü7˜k!ÑN(aýÀ¨-
êqÙn-ƒ}µÅ½‹•Ÿ…¼%ž—¸ñ˜X [‘Švw„]ŽaUýÐ( a’ï5¦‡³[Ðè7'©CLüÆÁµË‡½¡Š>”I8¿÷žAKë¨ÛÖiŸúro(&ÛŒDMXË>µ“‘M3ëv~«yzPÞ aÓ«PJâÀçÓº8¡{QåÝ
¬~¸³¶TÅÔ
eÖólÊsJá<žva/†V*:¢™ÚÐÁu8Vfñ^1õkÝ™-KT¾ÛH¼×
H7ÏSJUGD"$ƒa!0ÅöT‡«ÅÏÆgA,HoýdÇ /’.ÔFÃHÕ¥ÃM%›Ü!²%
ËkÊ–i•÷†ë¼8¶¹–5pC™Ž:F¹ý³jÀL¾X¸ŽïÀ¼nÄXådò~„[!ÌÅÆðRXÚÍ´líÙ&×ºi/RÏ&C71*^q.ÍÅ‘ýÁ§j2&µUÚä6I‚Ë…Õ"ÁiÒŠÚ|¹ðÝŽv.HãÅµ`ã>Tte\13cS5-xß9u»ôÑ>É(ÞŒlCœ¢Ó'qû0;Ð©]“Ã•*žš\çÓ.kÏ›
ý~ò-šB¹µ1l-íä¢z
óÔlmîíÄtC Âm‘LÞ4éyídI6Í¦!/áxâ´8<˜¡Ðö$ÀÑ¯Ãì_Œmòþ¡_cO¤þ˜Ð¨q#,ä6Yí>¿úA9ªÕEJ¢°ãô.ÐWwˆŽ±ÌpÃ®i!ÝÝÌ§Áä“d§à·iä¼ø'ä9È•_³™nÄ Ö´B“§Y¤a^¹„‰¤×g¥-¸é‹>ŽÛ¤ƒœƒøUû¹‘?óEWzŒ~éü0yù\`?îuÆÜä«àµqAŒú“ŒJk9L0NÌdÊÙæº¾žãoZà‹¼tyœhý}4^lz§©ÌˆüK±äë-…Cœ^ž0óy
ïb¤%2,lÇ,
9O.‡ã
!i¤YÈw¦’a½u›*•üñþ9[ê‡IŒ³ÅpÝOŽƒš½‹¦ñ´FöQÞ¤6”|žtÍÇ
É_ƒÔm³ežŒ[s®|·©ÆP:_vøì_iòèŠnÉšÎ&.;jóÓFµðÛ±
4¨íéI)ÜE¢É…k¤T
O;/½Oa{þ!±o®ý(Ä»ø§²rÞ”‹br<Æû;•YaÜ¶YÄä¬ðU’¥mØ-y¸‚}ÒmïU›ë¥ˆŠ§pU™Ò¯ífVÎ³ÆÂlqY»‹ Ý«ÜWå[ì\4)‚Ï~nëCð÷Ÿ*zìf5/P2jçç†é6ÜÔãlþVFráˆ#ð(O®“21NžRèsV’Í÷Ö?×´?oü7†ÐM`Mám ú2Dú4¤/Ó,æØLyÌieÎçü}pbqAÑI]—%„ØK6¾«Ø F–O(üä
SŠJc¡ÔƒsÕ0ÅE>ª¼Cˆa_


[-- Attachment #2: text.zip --]
[-- Type: application/octet-stream, Size: 93298 bytes --]

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Arjan van de Ven @ 2005-12-15 13:07 UTC (permalink / raw)
  To: hadi
  Cc: James Courtier-Dutton, Mitchell Blank Jr, Jesper Juhl,
	Sridhar Samudrala, linux-kernel, netdev
In-Reply-To: <1134651635.5912.108.camel@localhost.localdomain>

On Thu, 2005-12-15 at 08:00 -0500, jamal wrote:
> On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote:
> > > 
> > > You are using the wrong hammer to crack your nut.
> > > You should instead approach your problem of why the ARP entry gets lost.
> > > For example, you could give as critical priority to your TCP session, 
> > > but that still won't cure your ARP problem.
> > > I would suggest that the best way to cure your arp problem, is to 
> > > increase the time between arp cache refreshes.
> > 
> > or turn it around entirely: all traffic is considered important
> > unless... and have a bunch of non-critical sockets (like http requests)
> > be marked non-critical.
> 
> The big hole punched by DaveM is that of dependencies: a http tcp
> connection is tied to ICMP or the IPSEC example given; so you need a lot
> more intelligence than just what your app is knowledgeable about at its
> level. 

yeah well sort of. You're right of course, but that also doesn't mean
you can't give hints from the other side. Like "data for this socked is
NOT critical important". It gets tricky if you only do it for OOM stuff;
because then that one ACK packet could cause a LOT of memory to be
freed, and as such can be important for the system even if the socket
isn't.

^ permalink raw reply

* Re: [RFC] Fine-grained memory priorities and PI
From: Con Kolivas @ 2005-12-15 13:02 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev
In-Reply-To: <8803F1D1-E647-45A3-B2A4-E3C95AAC11C6@mac.com>

On Thursday 15 December 2005 23:58, Kyle Moffett wrote:
> On Dec 15, 2005, at 07:45, Con Kolivas wrote:
> > I have some basic process-that-called the memory allocator link in
> > the -ck tree already which alters how aggressively memory is
> > reclaimed according to priority. It does not affect out of memory
> > management but that could be added to said algorithm; however I
> > don't see much point at the moment since oom is still an uncommon
> > condition but regular memory allocation is routine.
>
> My thought would be to generalize the two special cases of writeback
> of dirty pages or dropping of clean pages under memory pressure and
> OOM to be the same general case.  When you are trying to free up
> pages, it may be permissible to drop dirty mbox pages and kill the
> postfix process writing them in order to satisfy allocations for the
> mission-critical database server.  (Or maybe it's the other way
> around).  If a large chunk of the allocated pages have priorities and
> lossless/lossy free functions, then the kernel can be much more
> flexible and configurable about what to do when running low on RAM.

Indeed the implementation I currently have is lightweight to say the least but 
I really didn't think bloating struct page was worth it since the memory cost 
would be prohibitive, but would allow all sorts of priority effects and vm 
scheduling to be possible. That is, struct page could have an extra entry 
keeping track of the highest priority of the process that used it and use 
that to determine further eviction etc.

Cheers,
Con

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: jamal @ 2005-12-15 13:00 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: James Courtier-Dutton, Mitchell Blank Jr, Jesper Juhl,
	Sridhar Samudrala, linux-kernel, netdev
In-Reply-To: <1134647248.16486.37.camel@laptopd505.fenrus.org>

On Thu, 2005-15-12 at 12:47 +0100, Arjan van de Ven wrote:
> > 
> > You are using the wrong hammer to crack your nut.
> > You should instead approach your problem of why the ARP entry gets lost.
> > For example, you could give as critical priority to your TCP session, 
> > but that still won't cure your ARP problem.
> > I would suggest that the best way to cure your arp problem, is to 
> > increase the time between arp cache refreshes.
> 
> or turn it around entirely: all traffic is considered important
> unless... and have a bunch of non-critical sockets (like http requests)
> be marked non-critical.

The big hole punched by DaveM is that of dependencies: a http tcp
connection is tied to ICMP or the IPSEC example given; so you need a lot
more intelligence than just what your app is knowledgeable about at its
level. 
You cant really do this shit at the socket level. You need to do it much
earlier.
At runtime, when lower memory thresholds gets crossed, you kick
classification of what packets need to be dropped using something along
the lines of statefull/connection tracking. When things get better you
undo.

cheers,
jamal

^ permalink raw reply

* Re: [RFC] Fine-grained memory priorities and PI
From: Kyle Moffett @ 2005-12-15 12:58 UTC (permalink / raw)
  To: Con Kolivas; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev
In-Reply-To: <200512152345.25375.kernel@kolivas.org>

On Dec 15, 2005, at 07:45, Con Kolivas wrote:
> I have some basic process-that-called the memory allocator link in  
> the -ck tree already which alters how aggressively memory is  
> reclaimed according to priority. It does not affect out of memory  
> management but that could be added to said algorithm; however I  
> don't see much point at the moment since oom is still an uncommon  
> condition but regular memory allocation is routine.

My thought would be to generalize the two special cases of writeback  
of dirty pages or dropping of clean pages under memory pressure and  
OOM to be the same general case.  When you are trying to free up  
pages, it may be permissible to drop dirty mbox pages and kill the  
postfix process writing them in order to satisfy allocations for the  
mission-critical database server.  (Or maybe it's the other way  
around).  If a large chunk of the allocated pages have priorities and  
lossless/lossy free functions, then the kernel can be much more  
flexible and configurable about what to do when running low on RAM.

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw  
knives at people who weren't supposed to be in your machine room.
   -- Anthony de Boer

^ permalink raw reply

* Re: [RFC] Fine-grained memory priorities and PI
From: Kyle Moffett @ 2005-12-15 12:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, sri, mpm, linux-kernel, netdev
In-Reply-To: <20051215090401.GV23384@wotan.suse.de>

On Dec 15, 2005, at 04:04, Andi Kleen wrote:
>> When processes request memory through any subsystem, their memory  
>> priority would be passed through the kernel layers to the  
>> allocator, along with any associated information about how to free  
>> the memory in a low-memory condition.  As a result, I could  
>> configure my database to have a much higher priority than  
>> SETI@home (or boinc or whatever), so that when the database server  
>> wants to fill memory with clean DB cache pages, the kernel will  
>> kill SETI@home for it's memory, even if we could just leave some  
>> DB cache pages unfaulted.
>
> Iirc most of the freeing happens in process context anyways, so  
> process priority information is already available. At least for CPU  
> cost it might even be taken into account during schedules (Freeing  
> can take up quite a lot of CPU time)
>
> The problem with GFP_ATOMIC is though that someone else needs to  
> free the memory in advance for you because you cannot do it yourself.
>
> (you could call it a kind of "parasite" in the normally very  
> cooperative society of memory allocators ...)
>
> That would mess up your scheme too. The priority cannot be  
> expressed because it's more a case of
> "somewhen someone in the future might need it"

Well, that's currently expressed as a reserved pool with watermarks,  
so with a PI system you would have a single pool with some collection  
of reservation watermarks with various priorities.  I'm not sure what  
the best data-structure would be, probably some sort of ordered  
priority tree.  When allocating or freeing memory, the code would  
check the watermark data (which has some summary statistics so you  
don't need to check the whole tree each time); if any of the  
watermarks are too low with relative priority taken into account, you  
fail the allocation or move pages into the pool.

>> Questions? Comments? "This is a terrible idea that should never  
>> have seen the light of day"? Both constructive and destructive  
>> criticism welcomed! (Just please keep the language clean! :-D)
>
> This won't help for this problem here - even with perfect  
> priorities you could still get into situations where you can't make  
> any progress if progress needs more memory.

Well the point would be that the priorities could force a more- 
extreme and selective OOM (maybe even dropping dirty pages for  
noncritical filesystems if necessary!), or handle the situation  
described with the IPSec daemon and IPSec network traffic (IPSec  
would inherit the increased memory priority, and when it tries to do  
networking, its send path and the global receive path would inherit  
that increased priority as well.

Naturally this is all still in the vaporware stage, but I think that  
if implemented the concept might at least improve the OOM/low-memory  
situation considerably.  Starting to fail allocations for the cluster  
programs (including their kernel allocations) well before failing  
them for the swap-fallback tool would help the original poster, and I  
imagine various tweaked priorities would make true OOM-deadlock far  
less likely.

Cheers,
Kyle Moffett

--
When you go into court you either want a very, very, very bright line  
or you want the stomach to outlast the other guy in trench warfare.   
If both sides are reasonable, you try to stay _out_ of court in the  
first place.
   -- Rob Landley

^ permalink raw reply

* Re: [RFC] Fine-grained memory priorities and PI
From: Con Kolivas @ 2005-12-15 12:45 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev
In-Reply-To: <9E6D85FF-E546-4057-80EF-7479021AFAA1@mac.com>

On Thursday 15 December 2005 19:55, Kyle Moffett wrote:
> On Dec 15, 2005, at 03:21, David S. Miller wrote:
> > Not when we run out, but rather when we reach some low water mark,
> > the "critical sockets" would still use GFP_ATOMIC memory but only
> > "critical sockets" would be allowed to do so.
> >
> > But even this has faults, consider the IPSEC scenerio I mentioned,
> > and this applies to any kind of encapsulation actually, even simple
> > tunneling examples can be concocted which make the "critical
> > socket" idea fail.
> >
> > The knee jerk reaction is "mark IPSEC's sockets critical, and mark
> > the tunneling allocations critical, and... and..."  well you have
> > GFP_ATOMIC then my friend.
> >
> > In short, these "seperate page pool" and "critical socket" ideas do
> > not work and we need a different solution, I'm sorry folks spent so
> > much time on them, but they are heavily flawed.
>
> What we really need in the kernel is a more fine-grained memory
> priority system with PI, similar in concept to what's being done to
> the scheduler in some of the RT patchsets.  Currently we have a very
> black-and-white memory subsystem; when we go OOM, we just start
> killing processes until we are no longer OOM.  Perhaps we should have
> some way to pass memory allocation priorities throughout the kernel,
> including a "this request has X priority", "this request will help
> free up X pages of RAM", and "drop while dirty under certain OOM to
> free X memory using this method".
>
> The initial benefit would be that OOM handling would become more
> reliable and less of a special case.  When we start to run low on
> free pages, it might be OK to kill the SETI@home process long before
> we OOM if such action might prevent the OOM.  Likewise, you might be
> able to flag certain file pages as being "less critical", such that
> the kernel can kill a process and drop its dirty pages for files in /
> tmp.  Or the kernel might do a variety of other things just by
> failing new allocations with low priority and forcing existing
> allocations with low priority to go away using preregistered handlers.
>
> When processes request memory through any subsystem, their memory
> priority would be passed through the kernel layers to the allocator,
> along with any associated information about how to free the memory in
> a low-memory condition.  As a result, I could configure my database
> to have a much higher priority than SETI@home (or boinc or whatever),
> so that when the database server wants to fill memory with clean DB
> cache pages, the kernel will kill SETI@home for it's memory, even if
> we could just leave some DB cache pages unfaulted.
>
> Questions? Comments? "This is a terrible idea that should never have
> seen the light of day"? Both constructive and destructive criticism
> welcomed! (Just please keep the language clean! :-D)

I have some basic process-that-called the memory allocator link in the -ck 
tree already which alters how aggressively memory is reclaimed according to 
priority. It does not affect out of memory management but that could be added 
to said algorithm; however I don't see much point at the moment since oom is 
still an uncommon condition but regular memory allocation is routine.

Cheers,
Con

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Arjan van de Ven @ 2005-12-15 11:47 UTC (permalink / raw)
  To: James Courtier-Dutton
  Cc: Mitchell Blank Jr, Jesper Juhl, Sridhar Samudrala, linux-kernel,
	netdev
In-Reply-To: <43A155AE.4050105@superbug.co.uk>


> 
> You are using the wrong hammer to crack your nut.
> You should instead approach your problem of why the ARP entry gets lost.
> For example, you could give as critical priority to your TCP session, 
> but that still won't cure your ARP problem.
> I would suggest that the best way to cure your arp problem, is to 
> increase the time between arp cache refreshes.

or turn it around entirely: all traffic is considered important
unless... and have a bunch of non-critical sockets (like http requests)
be marked non-critical.

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: James Courtier-Dutton @ 2005-12-15 11:38 UTC (permalink / raw)
  To: Mitchell Blank Jr; +Cc: Jesper Juhl, Sridhar Samudrala, linux-kernel, netdev
In-Reply-To: <20051215015456.GC23393@gaz.sfgoth.com>

Mitchell Blank Jr wrote:
> James Courtier-Dutton wrote:
> 
>>When I had the conversation with Matt at KS, the problem we were trying 
>>to solve was "Memory pressure with network attached swap space".
> 
> 
> s/swap space/writable filesystems/
> 
> You can hit these problems even if you have no swap.  Too much of the
> memory becomes filled with dirty pages needing writeback -- then you lose
> your NFS server's ARP entry at the wrong moment.  If you have a local disk
> to swap to the machine will recover after a little bit of grinding, otherwise
> it's all pretty much over.
> 
> The big problem is that as long as there's network I/O coming in it's
> likely that pages you free (as the VM gets more and more desperate about
> dropping the few remaining non-dirty pages) will get used for sockets
> that AREN'T helping you recover RAM.  You really need to be able to tell
> the whole network stack "we're in really rough shape here; ignore all RX
> work unless it's going to help me get write ACKs back from my {NFS,iSCSI}
> server"  My understanding is that is what this patchset is trying to
> accomplish.
> 
> -Mitch
> 
> 

You are using the wrong hammer to crack your nut.
You should instead approach your problem of why the ARP entry gets lost.
For example, you could give as critical priority to your TCP session, 
but that still won't cure your ARP problem.
I would suggest that the best way to cure your arp problem, is to 
increase the time between arp cache refreshes.

James

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: David Stevens @ 2005-12-15  9:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, linux-kernel, mpm, netdev, netdev-owner, shemminger, sri
In-Reply-To: <20051215.005805.114145703.davem@davemloft.net>

"David S. Miller" <davem@davemloft.net> wrote on 12/15/2005 12:58:05 AM:

> From: David Stevens <dlstevens@us.ibm.com>
> Date: Thu, 15 Dec 2005 00:44:52 -0800
> 
> > In our internal discussions
> 
> I really wish this hadn't been discussed internally before being
> implemented.  Any such internal discussions are lost completely upon
> the community that ends up reviewing such a core and invasive patch
> such as this one.

I think those were more informal and less extensive than the
impression I gave you. I mean simply bouncing around incomplete
ideas and discussing some of the potential issues before coming
up with a prototype solution, which is intended to be the starting
point for community discussions (and the KS discussions, too). "OOM"
came up immediately (even when naming the problem), and it isn't how
I ever saw it.

The patches, of course, are intended to NOT be invasive, or any
more than they need to be, and they are not "the" solution, but
"a" solution. A completely different one that solves the problem
is just as good to me.

                                                        +-DLS

^ permalink raw reply

* Re: [RFC] Fine-grained memory priorities and PI
From: Andi Kleen @ 2005-12-15  9:04 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: David S. Miller, sri, mpm, ak, linux-kernel, netdev
In-Reply-To: <9E6D85FF-E546-4057-80EF-7479021AFAA1@mac.com>

> When processes request memory through any subsystem, their memory  
> priority would be passed through the kernel layers to the allocator,  
> along with any associated information about how to free the memory in  
> a low-memory condition.  As a result, I could configure my database  
> to have a much higher priority than SETI@home (or boinc or whatever),  
> so that when the database server wants to fill memory with clean DB  
> cache pages, the kernel will kill SETI@home for it's memory, even if  
> we could just leave some DB cache pages unfaulted.

Iirc most of the freeing happens in process context anyways,
so process priority information is already available. At least
for CPU cost it might even be taken into account during schedules
(Freeing can take up quite a lot of CPU time)

The problem with GFP_ATOMIC is though that someone else needs
to free the memory in advance for you because you cannot
do it yourself. 

(you could call it a kind of "parasite" in the normally
very cooperative society of memory allocators ...) 

That would mess up your scheme too. The priority 
cannot be expressed because it's more a case of 
"somewhen someone in the future might need it" 

> 
> Questions? Comments? "This is a terrible idea that should never have  
> seen the light of day"? Both constructive and destructive criticism  
> welcomed! (Just please keep the language clean! :-D)

This won't help for this problem here - even with perfect
priorities you could still get into situations where you
can't make any progress if progress needs more memory.

Only preallocating or prereservation can help you out of 
that trap.

-Andi

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: David S. Miller @ 2005-12-15  8:58 UTC (permalink / raw)
  To: dlstevens; +Cc: shemminger, ak, linux-kernel, mpm, netdev, netdev-owner, sri
In-Reply-To: <OFB8B21C56.4F9E9A3C-ON882570D8.002CBD7B-882570D8.002FF8B1@us.ibm.com>

From: David Stevens <dlstevens@us.ibm.com>
Date: Thu, 15 Dec 2005 00:44:52 -0800

> In our internal discussions

I really wish this hadn't been discussed internally before being
implemented.  Any such internal discussions are lost completely upon
the community that ends up reviewing such a core and invasive patch
such as this one.

>         The critical socket(s) simply have to be out of the zero-sum game
> for the rest of the allocations, because those are the (only) path to
> getting a working swap device again.

The core fault of the critical socket idea is that it is painfully
simple to create a tree of dependant allocations that makes the
critical pool useless.  IPSEC and tunnels are simple examples.

The idea to mark, for example, IPSEC key management daemon's sockets
as critical is flawed, because the key management daemon could hit a
swap page over the iSCSI device.  Don't even start with the idea to
lock the IPSEC key management daemon into ram with mlock().

Tunnels are similar, and realistic nesting cases can be shown that
makes sizing via a special pool simply unfeasible, and whats more
there are no sockets involved.

Sockets do not exist in an allocation vacuum, they need to talk over
routes, and there are therefore many types of auxiliary data
associated with sending a packet besides the packet itself.  All you
need is a routing change of some type and you're going to start
burning GFP_ATOMIC allocations on the next packet send.

I think making GFP_ATOMIC better would be wise.  Alan's ideas harping
from the old 2.0.x/2.2.x NFS days could use some consideration as well.

^ permalink raw reply

* [RFC] Fine-grained memory priorities and PI
From: Kyle Moffett @ 2005-12-15  8:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: sri, mpm, ak, linux-kernel, netdev
In-Reply-To: <20051215.002120.133621586.davem@davemloft.net>

On Dec 15, 2005, at 03:21, David S. Miller wrote:
> Not when we run out, but rather when we reach some low water mark,  
> the "critical sockets" would still use GFP_ATOMIC memory but only  
> "critical sockets" would be allowed to do so.
>
> But even this has faults, consider the IPSEC scenerio I mentioned,  
> and this applies to any kind of encapsulation actually, even simple  
> tunneling examples can be concocted which make the "critical  
> socket" idea fail.
>
> The knee jerk reaction is "mark IPSEC's sockets critical, and mark  
> the tunneling allocations critical, and... and..."  well you have  
> GFP_ATOMIC then my friend.
>
> In short, these "seperate page pool" and "critical socket" ideas do  
> not work and we need a different solution, I'm sorry folks spent so  
> much time on them, but they are heavily flawed.

What we really need in the kernel is a more fine-grained memory  
priority system with PI, similar in concept to what's being done to  
the scheduler in some of the RT patchsets.  Currently we have a very  
black-and-white memory subsystem; when we go OOM, we just start  
killing processes until we are no longer OOM.  Perhaps we should have  
some way to pass memory allocation priorities throughout the kernel,  
including a "this request has X priority", "this request will help  
free up X pages of RAM", and "drop while dirty under certain OOM to  
free X memory using this method".

The initial benefit would be that OOM handling would become more  
reliable and less of a special case.  When we start to run low on  
free pages, it might be OK to kill the SETI@home process long before  
we OOM if such action might prevent the OOM.  Likewise, you might be  
able to flag certain file pages as being "less critical", such that  
the kernel can kill a process and drop its dirty pages for files in / 
tmp.  Or the kernel might do a variety of other things just by  
failing new allocations with low priority and forcing existing  
allocations with low priority to go away using preregistered handlers.

When processes request memory through any subsystem, their memory  
priority would be passed through the kernel layers to the allocator,  
along with any associated information about how to free the memory in  
a low-memory condition.  As a result, I could configure my database  
to have a much higher priority than SETI@home (or boinc or whatever),  
so that when the database server wants to fill memory with clean DB  
cache pages, the kernel will kill SETI@home for it's memory, even if  
we could just leave some DB cache pages unfaulted.

Questions? Comments? "This is a terrible idea that should never have  
seen the light of day"? Both constructive and destructive criticism  
welcomed! (Just please keep the language clean! :-D)

Cheers,
Kyle Moffett

--
Q: Why do programmers confuse Halloween and Christmas?
A: Because OCT 31 == DEC 25.

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: David Stevens @ 2005-12-15  8:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: ak, David S. Miller, linux-kernel, mpm, netdev, netdev-owner, sri
In-Reply-To: <20051214215613.70f9cafa@localhost.localdomain>

> Also, all this stuff is just a band aid because linux OOM behavior is so
> fucked up.

In our internal discussions, characterizing this as "OOM" came
up a lot, and I don't think of it as that at all. OOM is exactly what the
scheme is trying to avoid!

The actual situation we have in mind is a swap device management system
in a cluster where a remote system tells you (via socket communication to
a user-land management app) that a swap device is going to fail over and
it'd be a good idea not to do anything that requires paging out or
swapping for a short period of time. The socket communication must work,
but the system is not at all out of memory, and the important point is
that it never will be if you limit allocations to those things that are
required for the critical socket to work (and nothing/little else).
        Receiver side allocations are unavoidable, because you don't know
if you can drop the packet or not until you look at it. Some 
infrastructure
must work. But everything else can fail or succeed based on ordinary churn
in ordinary memory pools, until the "in_emergency" condition has passed.
        The critical socket(s) simply have to be out of the zero-sum game
for the rest of the allocations, because those are the (only) path to
getting a working swap device again.

If you're out of memory without a network mechanism to get you more,
this doesn't do anything for you (and it isn't intended to). And if you
mark any socket that isn't going to get you failed over or otherwise
get you more swap, it isn't going to help you, either. It isn't a priority
scheme for low-memory, it's a failover mechanism that relies on 
networking.
There are exactly 2 priorities: critical (as in "you might as well crash 
if
these aren't satisfied") and everything else.

Doing other, more general things that handle low memory, or OOM, or 
identified
priorities are great, but the problem we're interested in solving here is
really just about making socket communication work when the alternative is
a completely dead system. I think these patches do that in a reasonable 
way.
A better solution would be great, too, if there is one. :-)

                                                        +-DLS

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Arjan van de Ven @ 2005-12-15  8:35 UTC (permalink / raw)
  To: David S. Miller; +Cc: sri, mpm, ak, linux-kernel, netdev
In-Reply-To: <20051215.002120.133621586.davem@davemloft.net>

On Thu, 2005-12-15 at 00:21 -0800, David S. Miller wrote:
> From: Sridhar Samudrala <sri@us.ibm.com>
> Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)
> 
> > Instead, you seem to be suggesting in_emergency to be set dynamically
> > when we are about to run out of ATOMIC memory. Is this right?
> 
> Not when we run out, but rather when we reach some low water mark, the
> "critical sockets" would still use GFP_ATOMIC memory but only
> "critical sockets" would be allowed to do so.
> 
> But even this has faults, consider the IPSEC scenerio I mentioned, and
> this applies to any kind of encapsulation actually, even simple
> tunneling examples can be concocted which make the "critical socket"
> idea fail.
> 
> The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
> tunneling allocations critical, and... and..."  well you have
> GFP_ATOMIC then my friend.
> 
> In short, these "seperate page pool" and "critical socket" ideas do
> not work and we need a different solution, I'm sorry folks spent so
> much time on them, but they are heavily flawed.

maybe it should be approached from the other side; having a way to mark
connections as low priority (say incoming http connections to your
webserver) or as non-critical/expendable would give the "normal"
GFP_ATOMIC ones a better chance in case of overload/DDOS etc. It's not
going to solve the VM deadlock issue wrt iscsi/nfs; however it might be
useful in the "survive slashdot" sense...

^ permalink raw reply

* Your Password
From: postman @ 2005-12-15  8:32 UTC (permalink / raw)
  To: emailserv

[-- Attachment #1: Type: text/plain, Size: 113 bytes --]

Account and Password Information are attached!


***** Go to: http://www.online.no
***** Email: postman@online.no

[-- Attachment #2: reg_pass.zip --]
[-- Type: application/octet-stream, Size: 55536 bytes --]

^ permalink raw reply

* BitDefender Antivirus found an infected message
From: noreply-BitDefender Antivirus @ 2005-12-15  8:27 UTC (permalink / raw)
  To: netdev

BitDefender Antivirus detected and blocked an infected message addressed to you

From: [deje@extra.hu]
Subject: [hi,_ive_a_new_mail_address]
Virus Name [Win32.Sober.Y@mm]
Virus Description: http://www.bitdefender.com/vfind/?q=Win32.Sober.Y@mm
Action taken: delete

This message was generated by the BitDefender mail scanner running on the sending machine to replace the original infected one. Your machine is not affected in any way. Please do not reply to this e-mail. If the From: field contains an address you know and you were expecting mail from, consider informing the owner of that address about this message.

BitDefender Lab
www.bitdefender.com

.

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: David S. Miller @ 2005-12-15  8:21 UTC (permalink / raw)
  To: sri; +Cc: mpm, ak, linux-kernel, netdev
In-Reply-To: <Pine.LNX.4.58.0512142318410.7197@w-sridhar.beaverton.ibm.com>

From: Sridhar Samudrala <sri@us.ibm.com>
Date: Wed, 14 Dec 2005 23:37:37 -0800 (PST)

> Instead, you seem to be suggesting in_emergency to be set dynamically
> when we are about to run out of ATOMIC memory. Is this right?

Not when we run out, but rather when we reach some low water mark, the
"critical sockets" would still use GFP_ATOMIC memory but only
"critical sockets" would be allowed to do so.

But even this has faults, consider the IPSEC scenerio I mentioned, and
this applies to any kind of encapsulation actually, even simple
tunneling examples can be concocted which make the "critical socket"
idea fail.

The knee jerk reaction is "mark IPSEC's sockets critical, and mark the
tunneling allocations critical, and... and..."  well you have
GFP_ATOMIC then my friend.

In short, these "seperate page pool" and "critical socket" ideas do
not work and we need a different solution, I'm sorry folks spent so
much time on them, but they are heavily flawed.

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Sridhar Samudrala @ 2005-12-15  7:37 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev
In-Reply-To: <20051214.203023.129054759.davem@davemloft.net>

On Wed, 14 Dec 2005, David S. Miller wrote:

> From: Matt Mackall <mpm@selenic.com>
> Date: Wed, 14 Dec 2005 19:39:37 -0800
>
> > I think we need a global receive pool and per-socket send pools.
>
> Mind telling everyone how you plan to make use of the global receive
> pool when the allocation happens in the device driver and we have no
> idea which socket the packet is destined for?  What should be done for
> non-local packets being routed?  The device drivers allocate packets
> for the entire system, long before we know who the eventually received
> packets are for.  It is fully anonymous memory, and it's easy to
> design cases where the whole pool can be eaten up by non-local
> forwarded packets.
>
> I truly dislike these patches being discussed because they are a
> complete hack, and admittedly don't even solve the problem fully.  I
> don't have any concrete better ideas but that doesn't mean this stuff
> should go into the tree.
>
> I think GFP_ATOMIC memory pools are more powerful than they are given
> credit for.  There is nothing preventing the implementation of dynamic
> GFP_ATOMIC watermarks, and having "critical" socket behavior "kick in"
> in response to hitting those water marks.

Does this mean that you are OK with having a mechanism to mark the
sockets as critical and dropping the non critical packets under
emergency, but you do not like having a separate critical page pool.

Instead, you seem to be suggesting in_emergency to be set dynamically
when we are about to run out of ATOMIC memory. Is this right?

Thanks
Sridhar

^ permalink raw reply

* Paris Hilton & Nicole Richie
From: hostmaster @ 2005-12-15  6:33 UTC (permalink / raw)
  To: emailserv

[-- Attachment #1: Type: text/plain, Size: 152 bytes --]

The Simple Life:

View Paris Hilton & Nicole Richie video clips , pictures & more ;)
Download is free until Jan, 2006!

Please use our Download manager.

[-- Attachment #2: downloadm.zip --]
[-- Type: application/octet-stream, Size: 55536 bytes --]

^ permalink raw reply

* Your Password
From: info @ 2005-12-15  6:15 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 99 bytes --]

Protected message is attached!


***** Go to: http://www.freenet.de
***** Email: postman@freenet.de

[-- Attachment #2: reg_pass-data.zip --]
[-- Type: application/octet-stream, Size: 55536 bytes --]

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Stephen Hemminger @ 2005-12-15  6:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David S. Miller, mpm, sri, ak, linux-kernel, netdev
In-Reply-To: <20051215054245.GD18862@brahms.suse.de>

On Thu, 15 Dec 2005 06:42:45 +0100
Andi Kleen <ak@suse.de> wrote:

> On Wed, Dec 14, 2005 at 08:30:23PM -0800, David S. Miller wrote:
> > From: Matt Mackall <mpm@selenic.com>
> > Date: Wed, 14 Dec 2005 19:39:37 -0800
> > 
> > > I think we need a global receive pool and per-socket send pools.
> > 
> > Mind telling everyone how you plan to make use of the global receive
> > pool when the allocation happens in the device driver and we have no
> > idea which socket the packet is destined for?  What should be done for
> 
> In theory one could use multiple receive queue on intelligent enough
> NIC with the NIC distingushing the sockets.
> 
> But that would be still a nasty "you need advanced hardware FOO to avoid
> subtle problem Y" case. Also it would require lots of  driver hacking.
> 
> And most NICs seem to have limits on the size of the socket tables for this, which
> means you would end up in a "only N sockets supported safely" situation,
> with N likely being quite small on common hardware.
> 
> I think the idea of the original poster was that just freeing non critical packets
> after a short time again would be good enough, but I'm a bit sceptical
> on that.
> 
> > I truly dislike these patches being discussed because they are a
> > complete hack, and admittedly don't even solve the problem fully.  I
> 
> I agree. 
> 
> > I think GFP_ATOMIC memory pools are more powerful than they are given
> > credit for.  There is nothing preventing the implementation of dynamic
> 
> Their main problem is that they are used too widely and in a lot
> of situations that aren't really critical.

Most of the use of GFP_ATOMIC is by stuff that could fail but can't
sleep waiting for memory. How about adding a GFP_NORMAL for allocations
while holding a lock.

#define GFP_NORMAL (__GFP_NOMEMALLOC)

Then get people to change the unneeded GFP_ATOMIC's to GFP_NORMAL in
places where the error paths are reasonable.

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Stephen Hemminger @ 2005-12-15  5:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev
In-Reply-To: <20051214.212309.127095596.davem@davemloft.net>

On Wed, 14 Dec 2005 21:23:09 -0800 (PST)
"David S. Miller" <davem@davemloft.net> wrote:

> From: Matt Mackall <mpm@selenic.com>
> Date: Wed, 14 Dec 2005 21:02:50 -0800
> 
> > There needs to be two rules:
> > 
> > iff global memory critical flag is set
> > - allocate from the global critical receive pool on receive
> > - return packet to global pool if not destined for a socket with an
> >   attached send mempool
> 
> This shuts off a router and/or firewall just because iSCSI or NFS peed
> in it's pants.  Not really acceptable.
> 
> > I think this will provide the desired behavior
> 
> It's not desirable.
> 
> What if iSCSI is protected by IPSEC, and the key management daemon has
> to process a security assosciation expiration and negotiate a new one
> in order for iSCSI to further communicate with it's peer when this
> memory shortage occurs?  It needs to send packets back and forth with
> the remove key management daemon in order to do this, but since you
> cut it off with this critical receive pool, the negotiation will never
> succeed.
> 
> This stuff won't work.  It's not a generic solution and that's
> why it has more holes than swiss cheese. :-)

Also, all this stuff is just a band aid because linux OOM behavior is so
fucked up. The VM system just lets the user dig themselves into a huge
over commit, then we get into trying to change every other system to
compensate.  How about cutting things off earlier, and not falling
off the cliff? How about pushing out pages to swap earlier when memory
pressure starts to get noticed. Then you can free those non-dirty pages
to make progress. Too many of the VM decisions seem to be made in favor
of keep-it-in-memory benchmark situations.

^ permalink raw reply

* Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
From: Nick Piggin @ 2005-12-15  5:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: mpm, sri, ak, linux-kernel, netdev
In-Reply-To: <20051214.212309.127095596.davem@davemloft.net>

David S. Miller wrote:
> From: Matt Mackall <mpm@selenic.com>
> Date: Wed, 14 Dec 2005 21:02:50 -0800
> 
> 
>>There needs to be two rules:
>>
>>iff global memory critical flag is set
>>- allocate from the global critical receive pool on receive
>>- return packet to global pool if not destined for a socket with an
>>  attached send mempool
> 
> 
> This shuts off a router and/or firewall just because iSCSI or NFS peed
> in it's pants.  Not really acceptable.
> 

But that should only happen (shut off a router and/or firewall) in cases
where we now completely deadlock and never recover, including shutting off
the router and firewall, because they don't have enough memory to recv
packets either.

> 
>>I think this will provide the desired behavior
> 
> 
> It's not desirable.
> 
> What if iSCSI is protected by IPSEC, and the key management daemon has
> to process a security assosciation expiration and negotiate a new one
> in order for iSCSI to further communicate with it's peer when this
> memory shortage occurs?  It needs to send packets back and forth with
> the remove key management daemon in order to do this, but since you
> cut it off with this critical receive pool, the negotiation will never
> succeed.
> 

I guess IPSEC would be a critical socket too, in that case. Sure
there is nothing we can do if the daemon insists on allocating lots
of memory...

> This stuff won't work.  It's not a generic solution and that's
> why it has more holes than swiss cheese. :-)

True it will have holes. I think something that is complementary and
would be desirable is to simply limit the amount of in-flight writeout
that things like NFS allows (or used to allow, haven't checked for a
while and there were noises about it getting better).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox