From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Graf <agraf@suse.de>
Date: Tue, 06 May 2014 07:21:45 +0000
Subject: Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel page allocator for hash page tab
Message-Id: <53688D89.1070201@suse.de>
List-Id: <kvm-ppc.vger.kernel.org>
References: <1399224322-22028-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>	
 <53677558.50900@suse.de> <87r4489ttk.fsf@linux.vnet.ibm.com>	
 <20FFDF8F-1A3D-4719-B492-1E4B70F9D1B4@suse.de>	
 <1399334797.20388.71.camel@pasglop> <536889C6.1050603@suse.de>
 <1399360775.20388.112.camel@pasglop>
In-Reply-To: <1399360775.20388.112.camel@pasglop>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>, "paulus@samba.org" <paulus@samba.org>, "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>, "kvm-ppc@vger.kernel.org" <kvm-ppc@vger.kernel.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>


On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>> Isn't this a greater problem? We should start swapping before we hit
>>>> the point where non movable kernel allocation fails, no?
>>> Possibly but the fact remains, this can be avoided by making sure that
>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>> the rest of main memory for hash tables.
>> So why were we preferring non-CMA memory before? Considering that Aneesh
>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
> I assume so.
>
>>>> The fact that KVM uses a good number of normal kernel pages is maybe
>>>> suboptimal, but shouldn't be a critical problem.
>>> The point is that we explicitly reserve those pages in CMA for use
>>> by KVM for that specific purpose, but the current code tries first
>>> to get them out of the normal pool.
>>>
>>> This is not an optimal behaviour and is what Aneesh patches are
>>> trying to fix.
>> I agree, and I agree that it's worth it to make better use of our
>> resources. But we still shouldn't crash.
> Well, Linux hitting out of memory conditions has never been a happy
> story :-)
>
>> However, reading through this thread I think I've slowly grasped what
>> the problem is. The hugetlbfs size calculation.
> Not really.
>
>> I guess something in your stack overreserves huge pages because it
>> doesn't account for the fact that some part of system memory is already
>> reserved for CMA.
> Either that or simply Linux runs out because we dirty too fast...
> really, Linux has never been good at dealing with OO situations,
> especially when things like network drivers and filesystems try to do
> ATOMIC or NOIO allocs...
>   
>> So the underlying problem is something completely orthogonal. The patch
>> body as is is fine, but the patch description should simply say that we
>> should prefer the CMA region because it's already reserved for us for
>> this purpose and we make better use of our available resources that way.
> No.
>
> We give a chunk of memory to hugetlbfs, it's all good and fine.
>
> Whatever remains is split between CMA and the normal page allocator.
>
> Without Aneesh latest patch, when creating guests, KVM starts allocating
> it's hash tables from the latter instead of CMA (we never allocate from
> hugetlb pool afaik, only guest pages do that, not hash tables).
>
> So we exhaust the page allocator and get linux into OOM conditions
> while there's plenty of space in CMA. But the kernel cannot use CMA for
> it's own allocations, only to back user pages, which we don't care about
> because our guest pages are covered by our hugetlb reserve :-)

Yes. Write that in the patch description and I'm happy ;).


Alex


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <agraf@suse.de>
Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15])
 (using TLSv1 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by ozlabs.org (Postfix) with ESMTPS id 5BF2E1412D5
 for <linuxppc-dev@lists.ozlabs.org>; Tue,  6 May 2014 17:21:49 +1000 (EST)
Message-ID: <53688D89.1070201@suse.de>
Date: Tue, 06 May 2014 09:21:45 +0200
From: Alexander Graf <agraf@suse.de>
MIME-Version: 1.0
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel
 page allocator for hash page table.
References: <1399224322-22028-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>	
 <53677558.50900@suse.de> <87r4489ttk.fsf@linux.vnet.ibm.com>	
 <20FFDF8F-1A3D-4719-B492-1E4B70F9D1B4@suse.de>	
 <1399334797.20388.71.camel@pasglop> <536889C6.1050603@suse.de>
 <1399360775.20388.112.camel@pasglop>
In-Reply-To: <1399360775.20388.112.camel@pasglop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Cc: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
 "paulus@samba.org" <paulus@samba.org>,
 "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
 "kvm-ppc@vger.kernel.org" <kvm-ppc@vger.kernel.org>,
 "kvm@vger.kernel.org" <kvm@vger.kernel.org>
List-Id: Linux on PowerPC Developers Mail List <linuxppc-dev.lists.ozlabs.org>
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>


On 06.05.14 09:19, Benjamin Herrenschmidt wrote:
> On Tue, 2014-05-06 at 09:05 +0200, Alexander Graf wrote:
>> On 06.05.14 02:06, Benjamin Herrenschmidt wrote:
>>> On Mon, 2014-05-05 at 17:16 +0200, Alexander Graf wrote:
>>>> Isn't this a greater problem? We should start swapping before we hit
>>>> the point where non movable kernel allocation fails, no?
>>> Possibly but the fact remains, this can be avoided by making sure that
>>> if we create a CMA reserve for KVM, then it uses it rather than using
>>> the rest of main memory for hash tables.
>> So why were we preferring non-CMA memory before? Considering that Aneesh
>> introduced that logic in fa61a4e3 I suppose this was just a mistake?
> I assume so.
>
>>>> The fact that KVM uses a good number of normal kernel pages is maybe
>>>> suboptimal, but shouldn't be a critical problem.
>>> The point is that we explicitly reserve those pages in CMA for use
>>> by KVM for that specific purpose, but the current code tries first
>>> to get them out of the normal pool.
>>>
>>> This is not an optimal behaviour and is what Aneesh patches are
>>> trying to fix.
>> I agree, and I agree that it's worth it to make better use of our
>> resources. But we still shouldn't crash.
> Well, Linux hitting out of memory conditions has never been a happy
> story :-)
>
>> However, reading through this thread I think I've slowly grasped what
>> the problem is. The hugetlbfs size calculation.
> Not really.
>
>> I guess something in your stack overreserves huge pages because it
>> doesn't account for the fact that some part of system memory is already
>> reserved for CMA.
> Either that or simply Linux runs out because we dirty too fast...
> really, Linux has never been good at dealing with OO situations,
> especially when things like network drivers and filesystems try to do
> ATOMIC or NOIO allocs...
>   
>> So the underlying problem is something completely orthogonal. The patch
>> body as is is fine, but the patch description should simply say that we
>> should prefer the CMA region because it's already reserved for us for
>> this purpose and we make better use of our available resources that way.
> No.
>
> We give a chunk of memory to hugetlbfs, it's all good and fine.
>
> Whatever remains is split between CMA and the normal page allocator.
>
> Without Aneesh latest patch, when creating guests, KVM starts allocating
> it's hash tables from the latter instead of CMA (we never allocate from
> hugetlb pool afaik, only guest pages do that, not hash tables).
>
> So we exhaust the page allocator and get linux into OOM conditions
> while there's plenty of space in CMA. But the kernel cannot use CMA for
> it's own allocations, only to back user pages, which we don't care about
> because our guest pages are covered by our hugetlb reserve :-)

Yes. Write that in the patch description and I'm happy ;).


Alex

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Graf <agraf@suse.de>
Subject: Re: [PATCH] KVM: PPC: BOOK3S: HV: Don't try to allocate from kernel
 page allocator for hash page table.
Date: Tue, 06 May 2014 09:21:45 +0200
Message-ID: <53688D89.1070201@suse.de>
References: <1399224322-22028-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>	
 <53677558.50900@suse.de> <87r4489ttk.fsf@linux.vnet.ibm.com>	
 <20FFDF8F-1A3D-4719-B492-1E4B70F9D1B4@suse.de>	
 <1399334797.20388.71.camel@pasglop> <536889C6.1050603@suse.de>
 <1399360775.20388.112.camel@pasglop>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Content-Transfer-Encoding: base64
Cc: "linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
 "paulus@samba.org" <paulus@samba.org>,
 "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
 "kvm-ppc@vger.kernel.org" <kvm-ppc@vger.kernel.org>,
 "kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Return-path: <linuxppc-dev-bounces+glppd-linuxppc64-dev=m.gmane.org@lists.ozlabs.org>
In-Reply-To: <1399360775.20388.112.camel@pasglop>
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
 <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>
Errors-To: linuxppc-dev-bounces+glppd-linuxppc64-dev=m.gmane.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
 <linuxppc-dev-bounces+glppd-linuxppc64-dev=m.gmane.org@lists.ozlabs.org>
List-Id: kvm.vger.kernel.org

Ck9uIDA2LjA1LjE0IDA5OjE5LCBCZW5qYW1pbiBIZXJyZW5zY2htaWR0IHdyb3RlOgo+IE9uIFR1
ZSwgMjAxNC0wNS0wNiBhdCAwOTowNSArMDIwMCwgQWxleGFuZGVyIEdyYWYgd3JvdGU6Cj4+IE9u
IDA2LjA1LjE0IDAyOjA2LCBCZW5qYW1pbiBIZXJyZW5zY2htaWR0IHdyb3RlOgo+Pj4gT24gTW9u
LCAyMDE0LTA1LTA1IGF0IDE3OjE2ICswMjAwLCBBbGV4YW5kZXIgR3JhZiB3cm90ZToKPj4+PiBJ
c24ndCB0aGlzIGEgZ3JlYXRlciBwcm9ibGVtPyBXZSBzaG91bGQgc3RhcnQgc3dhcHBpbmcgYmVm
b3JlIHdlIGhpdAo+Pj4+IHRoZSBwb2ludCB3aGVyZSBub24gbW92YWJsZSBrZXJuZWwgYWxsb2Nh
dGlvbiBmYWlscywgbm8/Cj4+PiBQb3NzaWJseSBidXQgdGhlIGZhY3QgcmVtYWlucywgdGhpcyBj
YW4gYmUgYXZvaWRlZCBieSBtYWtpbmcgc3VyZSB0aGF0Cj4+PiBpZiB3ZSBjcmVhdGUgYSBDTUEg
cmVzZXJ2ZSBmb3IgS1ZNLCB0aGVuIGl0IHVzZXMgaXQgcmF0aGVyIHRoYW4gdXNpbmcKPj4+IHRo
ZSByZXN0IG9mIG1haW4gbWVtb3J5IGZvciBoYXNoIHRhYmxlcy4KPj4gU28gd2h5IHdlcmUgd2Ug
cHJlZmVycmluZyBub24tQ01BIG1lbW9yeSBiZWZvcmU/IENvbnNpZGVyaW5nIHRoYXQgQW5lZXNo
Cj4+IGludHJvZHVjZWQgdGhhdCBsb2dpYyBpbiBmYTYxYTRlMyBJIHN1cHBvc2UgdGhpcyB3YXMg
anVzdCBhIG1pc3Rha2U/Cj4gSSBhc3N1bWUgc28uCj4KPj4+PiBUaGUgZmFjdCB0aGF0IEtWTSB1
c2VzIGEgZ29vZCBudW1iZXIgb2Ygbm9ybWFsIGtlcm5lbCBwYWdlcyBpcyBtYXliZQo+Pj4+IHN1
Ym9wdGltYWwsIGJ1dCBzaG91bGRuJ3QgYmUgYSBjcml0aWNhbCBwcm9ibGVtLgo+Pj4gVGhlIHBv
aW50IGlzIHRoYXQgd2UgZXhwbGljaXRseSByZXNlcnZlIHRob3NlIHBhZ2VzIGluIENNQSBmb3Ig
dXNlCj4+PiBieSBLVk0gZm9yIHRoYXQgc3BlY2lmaWMgcHVycG9zZSwgYnV0IHRoZSBjdXJyZW50
IGNvZGUgdHJpZXMgZmlyc3QKPj4+IHRvIGdldCB0aGVtIG91dCBvZiB0aGUgbm9ybWFsIHBvb2wu
Cj4+Pgo+Pj4gVGhpcyBpcyBub3QgYW4gb3B0aW1hbCBiZWhhdmlvdXIgYW5kIGlzIHdoYXQgQW5l
ZXNoIHBhdGNoZXMgYXJlCj4+PiB0cnlpbmcgdG8gZml4Lgo+PiBJIGFncmVlLCBhbmQgSSBhZ3Jl
ZSB0aGF0IGl0J3Mgd29ydGggaXQgdG8gbWFrZSBiZXR0ZXIgdXNlIG9mIG91cgo+PiByZXNvdXJj
ZXMuIEJ1dCB3ZSBzdGlsbCBzaG91bGRuJ3QgY3Jhc2guCj4gV2VsbCwgTGludXggaGl0dGluZyBv
dXQgb2YgbWVtb3J5IGNvbmRpdGlvbnMgaGFzIG5ldmVyIGJlZW4gYSBoYXBweQo+IHN0b3J5IDot
KQo+Cj4+IEhvd2V2ZXIsIHJlYWRpbmcgdGhyb3VnaCB0aGlzIHRocmVhZCBJIHRoaW5rIEkndmUg
c2xvd2x5IGdyYXNwZWQgd2hhdAo+PiB0aGUgcHJvYmxlbSBpcy4gVGhlIGh1Z2V0bGJmcyBzaXpl
IGNhbGN1bGF0aW9uLgo+IE5vdCByZWFsbHkuCj4KPj4gSSBndWVzcyBzb21ldGhpbmcgaW4geW91
ciBzdGFjayBvdmVycmVzZXJ2ZXMgaHVnZSBwYWdlcyBiZWNhdXNlIGl0Cj4+IGRvZXNuJ3QgYWNj
b3VudCBmb3IgdGhlIGZhY3QgdGhhdCBzb21lIHBhcnQgb2Ygc3lzdGVtIG1lbW9yeSBpcyBhbHJl
YWR5Cj4+IHJlc2VydmVkIGZvciBDTUEuCj4gRWl0aGVyIHRoYXQgb3Igc2ltcGx5IExpbnV4IHJ1
bnMgb3V0IGJlY2F1c2Ugd2UgZGlydHkgdG9vIGZhc3QuLi4KPiByZWFsbHksIExpbnV4IGhhcyBu
ZXZlciBiZWVuIGdvb2QgYXQgZGVhbGluZyB3aXRoIE9PIHNpdHVhdGlvbnMsCj4gZXNwZWNpYWxs
eSB3aGVuIHRoaW5ncyBsaWtlIG5ldHdvcmsgZHJpdmVycyBhbmQgZmlsZXN5c3RlbXMgdHJ5IHRv
IGRvCj4gQVRPTUlDIG9yIE5PSU8gYWxsb2NzLi4uCj4gICAKPj4gU28gdGhlIHVuZGVybHlpbmcg
cHJvYmxlbSBpcyBzb21ldGhpbmcgY29tcGxldGVseSBvcnRob2dvbmFsLiBUaGUgcGF0Y2gKPj4g
Ym9keSBhcyBpcyBpcyBmaW5lLCBidXQgdGhlIHBhdGNoIGRlc2NyaXB0aW9uIHNob3VsZCBzaW1w
bHkgc2F5IHRoYXQgd2UKPj4gc2hvdWxkIHByZWZlciB0aGUgQ01BIHJlZ2lvbiBiZWNhdXNlIGl0
J3MgYWxyZWFkeSByZXNlcnZlZCBmb3IgdXMgZm9yCj4+IHRoaXMgcHVycG9zZSBhbmQgd2UgbWFr
ZSBiZXR0ZXIgdXNlIG9mIG91ciBhdmFpbGFibGUgcmVzb3VyY2VzIHRoYXQgd2F5Lgo+IE5vLgo+
Cj4gV2UgZ2l2ZSBhIGNodW5rIG9mIG1lbW9yeSB0byBodWdldGxiZnMsIGl0J3MgYWxsIGdvb2Qg
YW5kIGZpbmUuCj4KPiBXaGF0ZXZlciByZW1haW5zIGlzIHNwbGl0IGJldHdlZW4gQ01BIGFuZCB0
aGUgbm9ybWFsIHBhZ2UgYWxsb2NhdG9yLgo+Cj4gV2l0aG91dCBBbmVlc2ggbGF0ZXN0IHBhdGNo
LCB3aGVuIGNyZWF0aW5nIGd1ZXN0cywgS1ZNIHN0YXJ0cyBhbGxvY2F0aW5nCj4gaXQncyBoYXNo
IHRhYmxlcyBmcm9tIHRoZSBsYXR0ZXIgaW5zdGVhZCBvZiBDTUEgKHdlIG5ldmVyIGFsbG9jYXRl
IGZyb20KPiBodWdldGxiIHBvb2wgYWZhaWssIG9ubHkgZ3Vlc3QgcGFnZXMgZG8gdGhhdCwgbm90
IGhhc2ggdGFibGVzKS4KPgo+IFNvIHdlIGV4aGF1c3QgdGhlIHBhZ2UgYWxsb2NhdG9yIGFuZCBn
ZXQgbGludXggaW50byBPT00gY29uZGl0aW9ucwo+IHdoaWxlIHRoZXJlJ3MgcGxlbnR5IG9mIHNw
YWNlIGluIENNQS4gQnV0IHRoZSBrZXJuZWwgY2Fubm90IHVzZSBDTUEgZm9yCj4gaXQncyBvd24g
YWxsb2NhdGlvbnMsIG9ubHkgdG8gYmFjayB1c2VyIHBhZ2VzLCB3aGljaCB3ZSBkb24ndCBjYXJl
IGFib3V0Cj4gYmVjYXVzZSBvdXIgZ3Vlc3QgcGFnZXMgYXJlIGNvdmVyZWQgYnkgb3VyIGh1Z2V0
bGIgcmVzZXJ2ZSA6LSkKClllcy4gV3JpdGUgdGhhdCBpbiB0aGUgcGF0Y2ggZGVzY3JpcHRpb24g
YW5kIEknbSBoYXBweSA7KS4KCgpBbGV4CgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fXwpMaW51eHBwYy1kZXYgbWFpbGluZyBsaXN0CkxpbnV4cHBjLWRldkBs
aXN0cy5vemxhYnMub3JnCmh0dHBzOi8vbGlzdHMub3psYWJzLm9yZy9saXN0aW5mby9saW51eHBw
Yy1kZXY=