* QEMU bumping memory bug analysis @ 2015-06-05 16:43 Wei Liu 2015-06-05 16:58 ` Ian Campbell 2015-06-05 17:10 ` Stefano Stabellini 0 siblings, 2 replies; 38+ messages in thread From: Wei Liu @ 2015-06-05 16:43 UTC (permalink / raw) To: xen-devel Cc: wei.liu2, Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz Hi all This bug is now considered a blocker for 4.6 release. The premises of the problem remain the same (George's translated version): 1. QEMU may need extra pages from Xen to implement option ROMS, and so at the moment it calls set_max_mem() to increase max_pages so that it can allocate more pages to the guest. libxl doesn't know what max_pages a domain needs prior to qemu start-up. 2. Libxl doesn't know max_pages even after qemu start-up, because there is no mechanism to communicate between qemu and libxl. 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. Those pages are only accounted for in the hypervisor. Libxl (currently) does not extract that value from the hypervisor. Several solutions were proposed: 1. Add a new record type in libxc migration stream and call setmaxmem in the middle of xc migration stream. Main objections are calling xc_domain_setmaxmem in the middle of xc migration stream is layer violation. Also this prevents us from disaggregating domain construction to a less privileged domain. 2. Use libxl toolstack save restore blob to tranmit max pages information to remote end. This is considered a bodge and has been proven not to work because toolstack blob restore happens after xc_domain_restore. 3. Add a libxl layer that wraps necessary information, take over Andrew's work on libxl migration v2. Having a libxl layer that's not part of migration v2 is a waste of effort. There are several obstacles for libxl migration v2 at the moment. Libxl layer in migration v2 still has unresolved issues. It has inter-dependency with Remus / COLO. Most importantly it doesn't inherently solve the problem. It still requires the current libxl JSON blob to contain information about max pages (or information used to derive max pages). Andrew, correct me if I'm wrong. 4. Add a none user configurable field in current libxl JSON structure to record max pages information. This is not desirable. All fields in libxl JSON should be user configurable. 5. Add a user configurable field in current libxl JSON structure to record how much more memory this domain needs. Admin is required to fill in that value manually. In the mean time we revert the change in QEMU and declare QEMU with that change buggy. No response to this so far. But in fact I consider this the most viable solution. It's a simple enough solution that is achievable within 4.6 time frame. It doesn't prevent us from doing useful work in the future (disaggregated architecture with stricter security policy). It provides a way to work around buggy QEMU (admin sets that value to prevent QEMU from bumping memory limit). It's orgthogonal to migration v2 which means it won't be blocked by migration v2 or block migration v2. I tend to go with solution 5. Speak up if you don't agree with my analysis or you think I miss some aspects. For long term we need to: 1. Establish libxl as the arbitrator how much pages a domain can have. Anything else doing stuff behind arbitrator's back is considered buggy. This principle probably apply to other aspects of a domain as well. 2. Work out a solution communicate between QEMU and libxl. This can be expanded to cover other components in a Xen setup, but currently we only have QEMU. Wei. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 16:43 QEMU bumping memory bug analysis Wei Liu @ 2015-06-05 16:58 ` Ian Campbell 2015-06-05 17:13 ` Stefano Stabellini ` (2 more replies) 2015-06-05 17:10 ` Stefano Stabellini 1 sibling, 3 replies; 38+ messages in thread From: Ian Campbell @ 2015-06-05 16:58 UTC (permalink / raw) To: Wei Liu Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote: > 3. Add a libxl layer that wraps necessary information, take over > Andrew's work on libxl migration v2. Having a libxl layer that's not > part of migration v2 is a waste of effort. > > There are several obstacles for libxl migration v2 at the moment. Libxl > layer in migration v2 still has unresolved issues. It has > inter-dependency with Remus / COLO. > > Most importantly it doesn't inherently solve the problem. It still > requires the current libxl JSON blob to contain information about max > pages It doesn't require that, the whole point of the libxl layer is to provide a suitable home for that information which is not the current libxl json blob (which is user facing cfg data) or the libxc stream (which is opaque to libxl). Once you have the general concept of the libxl layer, adding a new field to it will be trivial (because it will have been designed to be trivially extendable). > (or information used to derive max pages). > > Andrew, correct me if I'm wrong. > > 4. Add a none user configurable field in current libxl JSON structure to > record max pages information. > > This is not desirable. All fields in libxl JSON should be user > configurable. > > 5. Add a user configurable field in current libxl JSON structure to > record how much more memory this domain needs. Admin is required to > fill in that value manually. In the mean time we revert the change in > QEMU and declare QEMU with that change buggy. > > No response to this so far. But in fact I consider this the most viable > solution. I initially thought that this was just #4 in a silly hat and was therefore no more acceptable than that. But actually I think you are suggesting that users should have to manually request additional RAM for option roms via some new interface and that the old thing in qemu should be deprecated and removed? How would a user know what value to use here? Just "a bigger one till it works"? That's, well, not super... Ian. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 16:58 ` Ian Campbell @ 2015-06-05 17:13 ` Stefano Stabellini 2015-06-05 19:06 ` Wei Liu 2015-06-05 17:17 ` Andrew Cooper 2015-06-05 17:39 ` Wei Liu 2 siblings, 1 reply; 38+ messages in thread From: Stefano Stabellini @ 2015-06-05 17:13 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, 5 Jun 2015, Ian Campbell wrote: > On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote: > > > 3. Add a libxl layer that wraps necessary information, take over > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > part of migration v2 is a waste of effort. > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > layer in migration v2 still has unresolved issues. It has > > inter-dependency with Remus / COLO. > > > > Most importantly it doesn't inherently solve the problem. It still > > requires the current libxl JSON blob to contain information about max > > pages > > It doesn't require that, the whole point of the libxl layer is to > provide a suitable home for that information which is not the current > libxl json blob (which is user facing cfg data) or the libxc stream > (which is opaque to libxl). > > Once you have the general concept of the libxl layer, adding a new field > to it will be trivial (because it will have been designed to be > trivially extendable). > > > (or information used to derive max pages). > > > > Andrew, correct me if I'm wrong. > > > > 4. Add a none user configurable field in current libxl JSON structure to > > record max pages information. > > > > This is not desirable. All fields in libxl JSON should be user > > configurable. > > > > 5. Add a user configurable field in current libxl JSON structure to > > record how much more memory this domain needs. Admin is required to > > fill in that value manually. In the mean time we revert the change in > > QEMU and declare QEMU with that change buggy. > > > > No response to this so far. But in fact I consider this the most viable > > solution. > > I initially thought that this was just #4 in a silly hat and was > therefore no more acceptable than that. > > But actually I think you are suggesting that users should have to > manually request additional RAM for option roms via some new interface > and that the old thing in qemu should be deprecated and removed? > > How would a user know what value to use here? Just "a bigger one till it > works"? That's, well, not super... This should not be a user configurable field. In fact it only depends on the QEMU version in use. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 17:13 ` Stefano Stabellini @ 2015-06-05 19:06 ` Wei Liu 0 siblings, 0 replies; 38+ messages in thread From: Wei Liu @ 2015-06-05 19:06 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, Jun 05, 2015 at 06:13:01PM +0100, Stefano Stabellini wrote: > On Fri, 5 Jun 2015, Ian Campbell wrote: > > On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote: > > > > > 3. Add a libxl layer that wraps necessary information, take over > > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > > part of migration v2 is a waste of effort. > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > > layer in migration v2 still has unresolved issues. It has > > > inter-dependency with Remus / COLO. > > > > > > Most importantly it doesn't inherently solve the problem. It still > > > requires the current libxl JSON blob to contain information about max > > > pages > > > > It doesn't require that, the whole point of the libxl layer is to > > provide a suitable home for that information which is not the current > > libxl json blob (which is user facing cfg data) or the libxc stream > > (which is opaque to libxl). > > > > Once you have the general concept of the libxl layer, adding a new field > > to it will be trivial (because it will have been designed to be > > trivially extendable). > > > > > (or information used to derive max pages). > > > > > > Andrew, correct me if I'm wrong. > > > > > > 4. Add a none user configurable field in current libxl JSON structure to > > > record max pages information. > > > > > > This is not desirable. All fields in libxl JSON should be user > > > configurable. > > > > > > 5. Add a user configurable field in current libxl JSON structure to > > > record how much more memory this domain needs. Admin is required to > > > fill in that value manually. In the mean time we revert the change in > > > QEMU and declare QEMU with that change buggy. > > > > > > No response to this so far. But in fact I consider this the most viable > > > solution. > > > > I initially thought that this was just #4 in a silly hat and was > > therefore no more acceptable than that. > > > > But actually I think you are suggesting that users should have to > > manually request additional RAM for option roms via some new interface > > and that the old thing in qemu should be deprecated and removed? > > > > How would a user know what value to use here? Just "a bigger one till it > > works"? That's, well, not super... > > This should not be a user configurable field. In fact it only depends on > the QEMU version in use. That field is generic so that we can use it to add some extra pages to domain. Using it to cover QEMU option roms would be one use case. It's not very nice, but it's straight-forward. Wei. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 16:58 ` Ian Campbell 2015-06-05 17:13 ` Stefano Stabellini @ 2015-06-05 17:17 ` Andrew Cooper 2015-06-05 17:39 ` Wei Liu 2 siblings, 0 replies; 38+ messages in thread From: Andrew Cooper @ 2015-06-05 17:17 UTC (permalink / raw) To: Ian Campbell, Wei Liu Cc: George Dunlap, Stefano Stabellini, Ian Jackson, dslutz, xen-devel On 05/06/15 17:58, Ian Campbell wrote: > On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote: > >> 3. Add a libxl layer that wraps necessary information, take over >> Andrew's work on libxl migration v2. Having a libxl layer that's not >> part of migration v2 is a waste of effort. >> >> There are several obstacles for libxl migration v2 at the moment. Libxl >> layer in migration v2 still has unresolved issues. It has >> inter-dependency with Remus / COLO. >> >> Most importantly it doesn't inherently solve the problem. It still >> requires the current libxl JSON blob to contain information about max >> pages > It doesn't require that, the whole point of the libxl layer is to > provide a suitable home for that information which is not the current > libxl json blob (which is user facing cfg data) or the libxc stream > (which is opaque to libxl). > > Once you have the general concept of the libxl layer, adding a new field > to it will be trivial (because it will have been designed to be > trivially extendable). Oh - I had not realised your intention of having some data blob in the libxl stream which is not the JSON configuration. Conceptually that would work, although this new information would still have to be at the head of the stream. i.e. head of the libxc blob. ~Andrew ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 16:58 ` Ian Campbell 2015-06-05 17:13 ` Stefano Stabellini 2015-06-05 17:17 ` Andrew Cooper @ 2015-06-05 17:39 ` Wei Liu 2 siblings, 0 replies; 38+ messages in thread From: Wei Liu @ 2015-06-05 17:39 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, Jun 05, 2015 at 05:58:11PM +0100, Ian Campbell wrote: > On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote: > > > 3. Add a libxl layer that wraps necessary information, take over > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > part of migration v2 is a waste of effort. > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > layer in migration v2 still has unresolved issues. It has > > inter-dependency with Remus / COLO. > > > > Most importantly it doesn't inherently solve the problem. It still > > requires the current libxl JSON blob to contain information about max > > pages > > It doesn't require that, the whole point of the libxl layer is to > provide a suitable home for that information which is not the current > libxl json blob (which is user facing cfg data) or the libxc stream > (which is opaque to libxl). > Right, it doesn't have to be in the user facing blob. I was wrong on that one. > Once you have the general concept of the libxl layer, adding a new field > to it will be trivial (because it will have been designed to be > trivially extendable). > In light of Andrew's reply when we talked about JSON we were referring to subtly different things. This libxl layer might be a viable solution. I need to check it again. The concern of not able to make it in time for 4.6 remains. > > (or information used to derive max pages). > > > > Andrew, correct me if I'm wrong. > > > > 4. Add a none user configurable field in current libxl JSON structure to > > record max pages information. > > > > This is not desirable. All fields in libxl JSON should be user > > configurable. > > > > 5. Add a user configurable field in current libxl JSON structure to > > record how much more memory this domain needs. Admin is required to > > fill in that value manually. In the mean time we revert the change in > > QEMU and declare QEMU with that change buggy. > > > > No response to this so far. But in fact I consider this the most viable > > solution. > > I initially thought that this was just #4 in a silly hat and was > therefore no more acceptable than that. > > But actually I think you are suggesting that users should have to > manually request additional RAM for option roms via some new interface > and that the old thing in qemu should be deprecated and removed? > Yes. I'm considering removing xc_domain_setmaxmem needs regardless of this bug because that's going to cause problem in QEMU upstream stubdom with strict XSM policy and deprivileged QEMU (may not have privilege to call setmaxmem). The security implication as it is now is big enough. One malicious guest that controls QEMU has a vector to DoS hypervisor by setting its own max_pages to -1; > How would a user know what value to use here? Just "a bigger one till it > works"? That's, well, not super... > Not very good, but should work. A few trial and error will give you the acceptable value. And this is easily superseded by any other more advanced solutions. This is going to be our last resort if Andrew's work is not a viable solution within 4.6 time frame. Wei. > Ian. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 16:43 QEMU bumping memory bug analysis Wei Liu 2015-06-05 16:58 ` Ian Campbell @ 2015-06-05 17:10 ` Stefano Stabellini 2015-06-05 18:10 ` Wei Liu ` (2 more replies) 1 sibling, 3 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-05 17:10 UTC (permalink / raw) To: Wei Liu Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, 5 Jun 2015, Wei Liu wrote: > Hi all > > This bug is now considered a blocker for 4.6 release. > > The premises of the problem remain the same (George's translated > version): > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at > the moment it calls set_max_mem() to increase max_pages so that it can > allocate more pages to the guest. libxl doesn't know what max_pages a > domain needs prior to qemu start-up. > > 2. Libxl doesn't know max_pages even after qemu start-up, because there > is no mechanism to communicate between qemu and libxl. I might not know what is the right design for the overall solution, but I do know that libxl shouldn't have its own state tracking for max_pages, because max_pages is kept, maintained and enforced by Xen. Ian might still remember, but at the beginning of the xl/libxl project, we had few simple design principles. One of which was that we should not have two places where we keep track of the same thing. If Xen keeps track of something, libxl should avoid it. In this specific case, libxl can ask Xen at any time what max_pages is for the domain, so I don't think that libxl should store it or have its own tracking for it. Even if QEMU called into libxl to change max_pages, I don't think that libxl should store max_pages anywhere. It is already stored in Xen and can be retrieved at any time. > 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. > Those pages are only accounted for in the hypervisor. Libxl > (currently) does not extract that value from the hypervisor. > > Several solutions were proposed: > > 1. Add a new record type in libxc migration stream and call setmaxmem > in the middle of xc migration stream. > > Main objections are calling xc_domain_setmaxmem in the middle of xc > migration stream is layer violation. Also this prevents us from > disaggregating domain construction to a less privileged domain. It seems to me that max_pages is one of the memory properties of the domain, so it should be saved and restored together with the rest of memory. > 2. Use libxl toolstack save restore blob to tranmit max pages > information to remote end. > > This is considered a bodge and has been proven not to work because > toolstack blob restore happens after xc_domain_restore. Saving and restoring max_pages in libxl seems to me like a layering violation. I would avoid 2. 3. 4. and 5. > 3. Add a libxl layer that wraps necessary information, take over > Andrew's work on libxl migration v2. Having a libxl layer that's not > part of migration v2 is a waste of effort. > > There are several obstacles for libxl migration v2 at the moment. Libxl > layer in migration v2 still has unresolved issues. It has > inter-dependency with Remus / COLO. > > Most importantly it doesn't inherently solve the problem. It still > requires the current libxl JSON blob to contain information about max > pages (or information used to derive max pages). > > Andrew, correct me if I'm wrong. > > 4. Add a none user configurable field in current libxl JSON structure to > record max pages information. > > This is not desirable. All fields in libxl JSON should be user > configurable. > > 5. Add a user configurable field in current libxl JSON structure to > record how much more memory this domain needs. Admin is required to > fill in that value manually. In the mean time we revert the change in > QEMU and declare QEMU with that change buggy. QEMU 2.3.0 was released with that change in it, so it is not quite possible to revert it. Also I think it is the right change for QEMU. > No response to this so far. But in fact I consider this the most viable > solution. > > It's a simple enough solution that is achievable within 4.6 time frame. > It doesn't prevent us from doing useful work in the future > (disaggregated architecture with stricter security policy). It provides > a way to work around buggy QEMU (admin sets that value to prevent QEMU > from bumping memory limit). It's orgthogonal to migration v2 which means > it won't be blocked by migration v2 or block migration v2. > > I tend to go with solution 5. Speak up if you don't agree with my > analysis or you think I miss some aspects. I disagree > For long term we need to: > > 1. Establish libxl as the arbitrator how much pages a domain can have. > Anything else doing stuff behind arbitrator's back is considered > buggy. This principle probably apply to other aspects of a domain as > well. I disagree that libxl should be the arbitrator of a property that is stored, maintained and enforced by Xen. Xen should be the arbitrator. > 2. Work out a solution communicate between QEMU and libxl. This can be > expanded to cover other components in a Xen setup, but currently we > only have QEMU. Even if QEMU called into libxl to change maxmem, I don't think that libxl should store maxmem anywhere. It is already stored in Xen. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 17:10 ` Stefano Stabellini @ 2015-06-05 18:10 ` Wei Liu 2015-06-08 11:39 ` Stefano Stabellini 2015-06-05 18:49 ` Ian Campbell 2015-06-08 14:14 ` George Dunlap 2 siblings, 1 reply; 38+ messages in thread From: Wei Liu @ 2015-06-05 18:10 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote: > On Fri, 5 Jun 2015, Wei Liu wrote: > > Hi all > > > > This bug is now considered a blocker for 4.6 release. > > > > The premises of the problem remain the same (George's translated > > version): > > > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at > > the moment it calls set_max_mem() to increase max_pages so that it can > > allocate more pages to the guest. libxl doesn't know what max_pages a > > domain needs prior to qemu start-up. > > > > 2. Libxl doesn't know max_pages even after qemu start-up, because there > > is no mechanism to communicate between qemu and libxl. > > I might not know what is the right design for the overall solution, but > I do know that libxl shouldn't have its own state tracking for > max_pages, because max_pages is kept, maintained and enforced by Xen. > > Ian might still remember, but at the beginning of the xl/libxl project, > we had few simple design principles. One of which was that we should not > have two places where we keep track of the same thing. If Xen keeps > track of something, libxl should avoid it. > > In this specific case, libxl can ask Xen at any time what max_pages is > for the domain, so I don't think that libxl should store it or have its > own tracking for it. > > Even if QEMU called into libxl to change max_pages, I don't think that > libxl should store max_pages anywhere. It is already stored in Xen and > can be retrieved at any time. > I think you're talking about keep track of that record permanently. I only care about getting the right value at the right time and transfer it to the other end. Getting the value whenever needed is OK. > > > 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. > > Those pages are only accounted for in the hypervisor. Libxl > > (currently) does not extract that value from the hypervisor. > > > > Several solutions were proposed: > > > > 1. Add a new record type in libxc migration stream and call setmaxmem > > in the middle of xc migration stream. > > > > Main objections are calling xc_domain_setmaxmem in the middle of xc > > migration stream is layer violation. Also this prevents us from > > disaggregating domain construction to a less privileged domain. > > It seems to me that max_pages is one of the memory properties of the > domain, so it should be saved and restored together with the rest of > memory. > #1 was actually referring to Don's patch. When processing libxc record in the middle of the stream, we should not alter the size of memory. It's not safe because we don't know whether that record comes early enough before we exceed the limit. The only safe way of doing it is to mandate that specific record at the beginning of libxc stream, which might have other implications. > > > 2. Use libxl toolstack save restore blob to tranmit max pages > > information to remote end. > > > > This is considered a bodge and has been proven not to work because > > toolstack blob restore happens after xc_domain_restore. > > Saving and restoring max_pages in libxl seems to me like a layering > violation. I would avoid 2. 3. 4. and 5. > No, not really. If we follow the principle of "libxl is the arbitrator". It needs to have thorough information on every aspect of the domain and set up limit. > > > 3. Add a libxl layer that wraps necessary information, take over > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > part of migration v2 is a waste of effort. > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > layer in migration v2 still has unresolved issues. It has > > inter-dependency with Remus / COLO. > > > > Most importantly it doesn't inherently solve the problem. It still > > requires the current libxl JSON blob to contain information about max > > pages (or information used to derive max pages). > > > > Andrew, correct me if I'm wrong. > > > > 4. Add a none user configurable field in current libxl JSON structure to > > record max pages information. > > > > This is not desirable. All fields in libxl JSON should be user > > configurable. > > > > 5. Add a user configurable field in current libxl JSON structure to > > record how much more memory this domain needs. Admin is required to > > fill in that value manually. In the mean time we revert the change in > > QEMU and declare QEMU with that change buggy. > > QEMU 2.3.0 was released with that change in it, so it is not quite > possible to revert it. Also I think it is the right change for QEMU. > It has security implications. Here is my reply copied from my mail to Ian: I'm considering removing xc_domain_setmaxmem needs regardless of this bug because that's going to cause problem in QEMU upstream stubdom with strict XSM policy and deprivileged QEMU (may not have privilege to call setmaxmem). The security implication as it is now is big enough. One malicious guest that controls QEMU has a vector to DoS hypervisor by setting its own max_pages to -1; > > > No response to this so far. But in fact I consider this the most viable > > solution. > > > > It's a simple enough solution that is achievable within 4.6 time frame. > > It doesn't prevent us from doing useful work in the future > > (disaggregated architecture with stricter security policy). It provides > > a way to work around buggy QEMU (admin sets that value to prevent QEMU > > from bumping memory limit). It's orgthogonal to migration v2 which means > > it won't be blocked by migration v2 or block migration v2. > > > > I tend to go with solution 5. Speak up if you don't agree with my > > analysis or you think I miss some aspects. > > I disagree > > > > For long term we need to: > > > > 1. Establish libxl as the arbitrator how much pages a domain can have. > > Anything else doing stuff behind arbitrator's back is considered > > buggy. This principle probably apply to other aspects of a domain as > > well. > > I disagree that libxl should be the arbitrator of a property that is > stored, maintained and enforced by Xen. Xen should be the arbitrator. > But it would be a burden for Xen once the arbitration goes beyond the issue we discuss here. It certainly doesn't have as much as information libxl has. In the long run we would still end up having libxl doing most of the work so we might as well establish the principle now. Wei. > > > 2. Work out a solution communicate between QEMU and libxl. This can be > > expanded to cover other components in a Xen setup, but currently we > > only have QEMU. > > Even if QEMU called into libxl to change maxmem, I don't think that > libxl should store maxmem anywhere. It is already stored in Xen. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 18:10 ` Wei Liu @ 2015-06-08 11:39 ` Stefano Stabellini 2015-06-08 12:14 ` Andrew Cooper 2015-06-08 13:10 ` Wei Liu 0 siblings, 2 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 11:39 UTC (permalink / raw) To: Wei Liu Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, 5 Jun 2015, Wei Liu wrote: > On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote: > > On Fri, 5 Jun 2015, Wei Liu wrote: > > > Hi all > > > > > > This bug is now considered a blocker for 4.6 release. > > > > > > The premises of the problem remain the same (George's translated > > > version): > > > > > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at > > > the moment it calls set_max_mem() to increase max_pages so that it can > > > allocate more pages to the guest. libxl doesn't know what max_pages a > > > domain needs prior to qemu start-up. > > > > > > 2. Libxl doesn't know max_pages even after qemu start-up, because there > > > is no mechanism to communicate between qemu and libxl. > > > > I might not know what is the right design for the overall solution, but > > I do know that libxl shouldn't have its own state tracking for > > max_pages, because max_pages is kept, maintained and enforced by Xen. > > > > Ian might still remember, but at the beginning of the xl/libxl project, > > we had few simple design principles. One of which was that we should not > > have two places where we keep track of the same thing. If Xen keeps > > track of something, libxl should avoid it. > > > > In this specific case, libxl can ask Xen at any time what max_pages is > > for the domain, so I don't think that libxl should store it or have its > > own tracking for it. > > > > Even if QEMU called into libxl to change max_pages, I don't think that > > libxl should store max_pages anywhere. It is already stored in Xen and > > can be retrieved at any time. > > > > I think you're talking about keep track of that record permanently. I > only care about getting the right value at the right time and transfer > it to the other end. Getting the value whenever needed is OK. > > > > > > 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. > > > Those pages are only accounted for in the hypervisor. Libxl > > > (currently) does not extract that value from the hypervisor. > > > > > > Several solutions were proposed: > > > > > > 1. Add a new record type in libxc migration stream and call setmaxmem > > > in the middle of xc migration stream. > > > > > > Main objections are calling xc_domain_setmaxmem in the middle of xc > > > migration stream is layer violation. Also this prevents us from > > > disaggregating domain construction to a less privileged domain. > > > > It seems to me that max_pages is one of the memory properties of the > > domain, so it should be saved and restored together with the rest of > > memory. > > > > #1 was actually referring to Don's patch. When processing libxc record > in the middle of the stream, we should not alter the size of memory. > It's not safe because we don't know whether that record comes early > enough before we exceed the limit. > > The only safe way of doing it is to mandate that specific record at > the beginning of libxc stream, which might have other implications. I see, that makes sense > > > > > 2. Use libxl toolstack save restore blob to tranmit max pages > > > information to remote end. > > > > > > This is considered a bodge and has been proven not to work because > > > toolstack blob restore happens after xc_domain_restore. > > > > Saving and restoring max_pages in libxl seems to me like a layering > > violation. I would avoid 2. 3. 4. and 5. > > > > No, not really. If we follow the principle of "libxl is the > arbitrator". It needs to have thorough information on every aspect of > the domain and set up limit. I am not sure I buy this "libxl is the arbitrator" concept. I am not seeing libxl as adding much value in this context. > > > 3. Add a libxl layer that wraps necessary information, take over > > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > > part of migration v2 is a waste of effort. > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > > layer in migration v2 still has unresolved issues. It has > > > inter-dependency with Remus / COLO. > > > > > > Most importantly it doesn't inherently solve the problem. It still > > > requires the current libxl JSON blob to contain information about max > > > pages (or information used to derive max pages). > > > > > > Andrew, correct me if I'm wrong. > > > > > > 4. Add a none user configurable field in current libxl JSON structure to > > > record max pages information. > > > > > > This is not desirable. All fields in libxl JSON should be user > > > configurable. > > > > > > 5. Add a user configurable field in current libxl JSON structure to > > > record how much more memory this domain needs. Admin is required to > > > fill in that value manually. In the mean time we revert the change in > > > QEMU and declare QEMU with that change buggy. > > > > QEMU 2.3.0 was released with that change in it, so it is not quite > > possible to revert it. Also I think it is the right change for QEMU. > > > > It has security implications. Here is my reply copied from my mail to > Ian: > > I'm considering removing xc_domain_setmaxmem needs regardless of this > bug because that's going to cause problem in QEMU upstream stubdom with > strict XSM policy and deprivileged QEMU (may not have privilege to call > setmaxmem). QEMU running in the stubdom should be able to set the maxmem for its target domain, but not for the others. > The security implication as it is now is big enough. One malicious guest > that controls QEMU has a vector to DoS hypervisor by setting its own > max_pages to -1; Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen side? Could you please give me a bit more info on this security issue? > > > > > No response to this so far. But in fact I consider this the most viable > > > solution. > > > > > > It's a simple enough solution that is achievable within 4.6 time frame. > > > It doesn't prevent us from doing useful work in the future > > > (disaggregated architecture with stricter security policy). It provides > > > a way to work around buggy QEMU (admin sets that value to prevent QEMU > > > from bumping memory limit). It's orgthogonal to migration v2 which means > > > it won't be blocked by migration v2 or block migration v2. > > > > > > I tend to go with solution 5. Speak up if you don't agree with my > > > analysis or you think I miss some aspects. > > > > I disagree > > > > > > > For long term we need to: > > > > > > 1. Establish libxl as the arbitrator how much pages a domain can have. > > > Anything else doing stuff behind arbitrator's back is considered > > > buggy. This principle probably apply to other aspects of a domain as > > > well. > > > > I disagree that libxl should be the arbitrator of a property that is > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > But it would be a burden for Xen once the arbitration goes beyond the > issue we discuss here. It certainly doesn't have as much as information > libxl has. In the long run we would still end up having libxl doing most > of the work so we might as well establish the principle now. I would like to see some concrete examples of how libxl is adding value in this context instead of just adding layers of indirection. > > > > > 2. Work out a solution communicate between QEMU and libxl. This can be > > > expanded to cover other components in a Xen setup, but currently we > > > only have QEMU. > > > > Even if QEMU called into libxl to change maxmem, I don't think that > > libxl should store maxmem anywhere. It is already stored in Xen. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 11:39 ` Stefano Stabellini @ 2015-06-08 12:14 ` Andrew Cooper 2015-06-08 13:01 ` Stefano Stabellini 2015-06-08 13:10 ` Wei Liu 1 sibling, 1 reply; 38+ messages in thread From: Andrew Cooper @ 2015-06-08 12:14 UTC (permalink / raw) To: Stefano Stabellini, Wei Liu Cc: George Dunlap, Ian Jackson, Ian Campbell, dslutz, xen-devel On 08/06/15 12:39, Stefano Stabellini wrote: > On Fri, 5 Jun 2015, Wei Liu wrote: >> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote: >>> On Fri, 5 Jun 2015, Wei Liu wrote: >>>> Hi all >>>> >>>> This bug is now considered a blocker for 4.6 release. >>>> >>>> The premises of the problem remain the same (George's translated >>>> version): >>>> >>>> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at >>>> the moment it calls set_max_mem() to increase max_pages so that it can >>>> allocate more pages to the guest. libxl doesn't know what max_pages a >>>> domain needs prior to qemu start-up. >>>> >>>> 2. Libxl doesn't know max_pages even after qemu start-up, because there >>>> is no mechanism to communicate between qemu and libxl. >>> I might not know what is the right design for the overall solution, but >>> I do know that libxl shouldn't have its own state tracking for >>> max_pages, because max_pages is kept, maintained and enforced by Xen. >>> >>> Ian might still remember, but at the beginning of the xl/libxl project, >>> we had few simple design principles. One of which was that we should not >>> have two places where we keep track of the same thing. If Xen keeps >>> track of something, libxl should avoid it. >>> >>> In this specific case, libxl can ask Xen at any time what max_pages is >>> for the domain, so I don't think that libxl should store it or have its >>> own tracking for it. >>> >>> Even if QEMU called into libxl to change max_pages, I don't think that >>> libxl should store max_pages anywhere. It is already stored in Xen and >>> can be retrieved at any time. >>> >> I think you're talking about keep track of that record permanently. I >> only care about getting the right value at the right time and transfer >> it to the other end. Getting the value whenever needed is OK. >> >>>> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. >>>> Those pages are only accounted for in the hypervisor. Libxl >>>> (currently) does not extract that value from the hypervisor. >>>> >>>> Several solutions were proposed: >>>> >>>> 1. Add a new record type in libxc migration stream and call setmaxmem >>>> in the middle of xc migration stream. >>>> >>>> Main objections are calling xc_domain_setmaxmem in the middle of xc >>>> migration stream is layer violation. Also this prevents us from >>>> disaggregating domain construction to a less privileged domain. >>> It seems to me that max_pages is one of the memory properties of the >>> domain, so it should be saved and restored together with the rest of >>> memory. >>> >> #1 was actually referring to Don's patch. When processing libxc record >> in the middle of the stream, we should not alter the size of memory. >> It's not safe because we don't know whether that record comes early >> enough before we exceed the limit. >> >> The only safe way of doing it is to mandate that specific record at >> the beginning of libxc stream, which might have other implications. > I see, that makes sense > > >>>> 2. Use libxl toolstack save restore blob to tranmit max pages >>>> information to remote end. >>>> >>>> This is considered a bodge and has been proven not to work because >>>> toolstack blob restore happens after xc_domain_restore. >>> Saving and restoring max_pages in libxl seems to me like a layering >>> violation. I would avoid 2. 3. 4. and 5. >>> >> No, not really. If we follow the principle of "libxl is the >> arbitrator". It needs to have thorough information on every aspect of >> the domain and set up limit. > I am not sure I buy this "libxl is the arbitrator" concept. I am not > seeing libxl as adding much value in this context. > > >>>> 3. Add a libxl layer that wraps necessary information, take over >>>> Andrew's work on libxl migration v2. Having a libxl layer that's not >>>> part of migration v2 is a waste of effort. >>>> >>>> There are several obstacles for libxl migration v2 at the moment. Libxl >>>> layer in migration v2 still has unresolved issues. It has >>>> inter-dependency with Remus / COLO. >>>> >>>> Most importantly it doesn't inherently solve the problem. It still >>>> requires the current libxl JSON blob to contain information about max >>>> pages (or information used to derive max pages). >>>> >>>> Andrew, correct me if I'm wrong. >>>> >>>> 4. Add a none user configurable field in current libxl JSON structure to >>>> record max pages information. >>>> >>>> This is not desirable. All fields in libxl JSON should be user >>>> configurable. >>>> >>>> 5. Add a user configurable field in current libxl JSON structure to >>>> record how much more memory this domain needs. Admin is required to >>>> fill in that value manually. In the mean time we revert the change in >>>> QEMU and declare QEMU with that change buggy. >>> QEMU 2.3.0 was released with that change in it, so it is not quite >>> possible to revert it. Also I think it is the right change for QEMU. >>> >> It has security implications. Here is my reply copied from my mail to >> Ian: >> >> I'm considering removing xc_domain_setmaxmem needs regardless of this >> bug because that's going to cause problem in QEMU upstream stubdom with >> strict XSM policy and deprivileged QEMU (may not have privilege to call >> setmaxmem). > QEMU running in the stubdom should be able to set the maxmem for its > target domain, but not for the others. At the moment, set_max_mem is the only method the toolstack has of putting a hard upper bound on a domains memory usage (along with a few others like the shadow mem size, and PoD cache.) In a disaggregated case, no deprivileged entity should be able to play with this limit. Being able to do so renders the security moot, as a compromised stubdom can force a host OOM. ~Andrew ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 12:14 ` Andrew Cooper @ 2015-06-08 13:01 ` Stefano Stabellini 2015-06-08 13:33 ` Jan Beulich 0 siblings, 1 reply; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 13:01 UTC (permalink / raw) To: Andrew Cooper Cc: Wei Liu, Ian Campbell, Stefano Stabellini, George Dunlap, Ian Jackson, dslutz, xen-devel On Mon, 8 Jun 2015, Andrew Cooper wrote: > On 08/06/15 12:39, Stefano Stabellini wrote: > > On Fri, 5 Jun 2015, Wei Liu wrote: > >> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote: > >>> On Fri, 5 Jun 2015, Wei Liu wrote: > >>>> Hi all > >>>> > >>>> This bug is now considered a blocker for 4.6 release. > >>>> > >>>> The premises of the problem remain the same (George's translated > >>>> version): > >>>> > >>>> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at > >>>> the moment it calls set_max_mem() to increase max_pages so that it can > >>>> allocate more pages to the guest. libxl doesn't know what max_pages a > >>>> domain needs prior to qemu start-up. > >>>> > >>>> 2. Libxl doesn't know max_pages even after qemu start-up, because there > >>>> is no mechanism to communicate between qemu and libxl. > >>> I might not know what is the right design for the overall solution, but > >>> I do know that libxl shouldn't have its own state tracking for > >>> max_pages, because max_pages is kept, maintained and enforced by Xen. > >>> > >>> Ian might still remember, but at the beginning of the xl/libxl project, > >>> we had few simple design principles. One of which was that we should not > >>> have two places where we keep track of the same thing. If Xen keeps > >>> track of something, libxl should avoid it. > >>> > >>> In this specific case, libxl can ask Xen at any time what max_pages is > >>> for the domain, so I don't think that libxl should store it or have its > >>> own tracking for it. > >>> > >>> Even if QEMU called into libxl to change max_pages, I don't think that > >>> libxl should store max_pages anywhere. It is already stored in Xen and > >>> can be retrieved at any time. > >>> > >> I think you're talking about keep track of that record permanently. I > >> only care about getting the right value at the right time and transfer > >> it to the other end. Getting the value whenever needed is OK. > >> > >>>> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. > >>>> Those pages are only accounted for in the hypervisor. Libxl > >>>> (currently) does not extract that value from the hypervisor. > >>>> > >>>> Several solutions were proposed: > >>>> > >>>> 1. Add a new record type in libxc migration stream and call setmaxmem > >>>> in the middle of xc migration stream. > >>>> > >>>> Main objections are calling xc_domain_setmaxmem in the middle of xc > >>>> migration stream is layer violation. Also this prevents us from > >>>> disaggregating domain construction to a less privileged domain. > >>> It seems to me that max_pages is one of the memory properties of the > >>> domain, so it should be saved and restored together with the rest of > >>> memory. > >>> > >> #1 was actually referring to Don's patch. When processing libxc record > >> in the middle of the stream, we should not alter the size of memory. > >> It's not safe because we don't know whether that record comes early > >> enough before we exceed the limit. > >> > >> The only safe way of doing it is to mandate that specific record at > >> the beginning of libxc stream, which might have other implications. > > I see, that makes sense > > > > > >>>> 2. Use libxl toolstack save restore blob to tranmit max pages > >>>> information to remote end. > >>>> > >>>> This is considered a bodge and has been proven not to work because > >>>> toolstack blob restore happens after xc_domain_restore. > >>> Saving and restoring max_pages in libxl seems to me like a layering > >>> violation. I would avoid 2. 3. 4. and 5. > >>> > >> No, not really. If we follow the principle of "libxl is the > >> arbitrator". It needs to have thorough information on every aspect of > >> the domain and set up limit. > > I am not sure I buy this "libxl is the arbitrator" concept. I am not > > seeing libxl as adding much value in this context. > > > > > >>>> 3. Add a libxl layer that wraps necessary information, take over > >>>> Andrew's work on libxl migration v2. Having a libxl layer that's not > >>>> part of migration v2 is a waste of effort. > >>>> > >>>> There are several obstacles for libxl migration v2 at the moment. Libxl > >>>> layer in migration v2 still has unresolved issues. It has > >>>> inter-dependency with Remus / COLO. > >>>> > >>>> Most importantly it doesn't inherently solve the problem. It still > >>>> requires the current libxl JSON blob to contain information about max > >>>> pages (or information used to derive max pages). > >>>> > >>>> Andrew, correct me if I'm wrong. > >>>> > >>>> 4. Add a none user configurable field in current libxl JSON structure to > >>>> record max pages information. > >>>> > >>>> This is not desirable. All fields in libxl JSON should be user > >>>> configurable. > >>>> > >>>> 5. Add a user configurable field in current libxl JSON structure to > >>>> record how much more memory this domain needs. Admin is required to > >>>> fill in that value manually. In the mean time we revert the change in > >>>> QEMU and declare QEMU with that change buggy. > >>> QEMU 2.3.0 was released with that change in it, so it is not quite > >>> possible to revert it. Also I think it is the right change for QEMU. > >>> > >> It has security implications. Here is my reply copied from my mail to > >> Ian: > >> > >> I'm considering removing xc_domain_setmaxmem needs regardless of this > >> bug because that's going to cause problem in QEMU upstream stubdom with > >> strict XSM policy and deprivileged QEMU (may not have privilege to call > >> setmaxmem). > > QEMU running in the stubdom should be able to set the maxmem for its > > target domain, but not for the others. > > At the moment, set_max_mem is the only method the toolstack has of > putting a hard upper bound on a domains memory usage (along with a few > others like the shadow mem size, and PoD cache.) > > In a disaggregated case, no deprivileged entity should be able to play > with this limit. Being able to do so renders the security moot, as a > compromised stubdom can force a host OOM. You have a good point but I wouldn't say that it renders the security moot, as one needs to break into the stubdom first, and then it still only gets as far as OOMing the host, not compromising other people's data. Without stubdom, with QEMU deprivileged in dom0, this wouldn't be a problem because QEMU can set max_mem before dropping privileges. The set_max_mem calls happen before unpausing the guest. Maybe something similar could be arranged for stubdoms too. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:01 ` Stefano Stabellini @ 2015-06-08 13:33 ` Jan Beulich 0 siblings, 0 replies; 38+ messages in thread From: Jan Beulich @ 2015-06-08 13:33 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, IanJackson, dslutz, xen-devel >>> On 08.06.15 at 15:01, <stefano.stabellini@eu.citrix.com> wrote: > On Mon, 8 Jun 2015, Andrew Cooper wrote: >> At the moment, set_max_mem is the only method the toolstack has of >> putting a hard upper bound on a domains memory usage (along with a few >> others like the shadow mem size, and PoD cache.) >> >> In a disaggregated case, no deprivileged entity should be able to play >> with this limit. Being able to do so renders the security moot, as a >> compromised stubdom can force a host OOM. > > You have a good point but I wouldn't say that it renders the security > moot, as one needs to break into the stubdom first, and then it still > only gets as far as OOMing the host, not compromising other people's > data. OOMing the host is quite likely to compromise other people's (i.e. VM's) data, due to Xen having to fail any request to it that involves memory allocation. I'm sure there are code paths here that end in domain_crash(). Jan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 11:39 ` Stefano Stabellini 2015-06-08 12:14 ` Andrew Cooper @ 2015-06-08 13:10 ` Wei Liu 2015-06-08 13:27 ` Stefano Stabellini 1 sibling, 1 reply; 38+ messages in thread From: Wei Liu @ 2015-06-08 13:10 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote: [...] > > > > 3. Add a libxl layer that wraps necessary information, take over > > > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > > > part of migration v2 is a waste of effort. > > > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > > > layer in migration v2 still has unresolved issues. It has > > > > inter-dependency with Remus / COLO. > > > > > > > > Most importantly it doesn't inherently solve the problem. It still > > > > requires the current libxl JSON blob to contain information about max > > > > pages (or information used to derive max pages). > > > > > > > > Andrew, correct me if I'm wrong. > > > > > > > > 4. Add a none user configurable field in current libxl JSON structure to > > > > record max pages information. > > > > > > > > This is not desirable. All fields in libxl JSON should be user > > > > configurable. > > > > > > > > 5. Add a user configurable field in current libxl JSON structure to > > > > record how much more memory this domain needs. Admin is required to > > > > fill in that value manually. In the mean time we revert the change in > > > > QEMU and declare QEMU with that change buggy. > > > > > > QEMU 2.3.0 was released with that change in it, so it is not quite > > > possible to revert it. Also I think it is the right change for QEMU. > > > > > > > It has security implications. Here is my reply copied from my mail to > > Ian: > > > > I'm considering removing xc_domain_setmaxmem needs regardless of this > > bug because that's going to cause problem in QEMU upstream stubdom with > > strict XSM policy and deprivileged QEMU (may not have privilege to call > > setmaxmem). > > QEMU running in the stubdom should be able to set the maxmem for its > target domain, but not for the others. > Right, but this is still not safe. I will explain below. > > > The security implication as it is now is big enough. One malicious guest > > that controls QEMU has a vector to DoS hypervisor by setting its own > > max_pages to -1; > > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen > side? Could you please give me a bit more info on this security issue? > Consider now we have a malicious guest. It has gained control of QEMU. It then calls xc_domain_setmaxmem to set it's own max_pages inside hypervisor to be -1 and start calling populate_physmap inside the guest kernel. This is going to make hypervisor OOM. XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't think there is sensible way of distinguishing a valid setmaxmem call from a malicious setmaxmem call from QEMU. It's untrusted code base after all. > > > > > > > > No response to this so far. But in fact I consider this the most viable > > > > solution. > > > > > > > > It's a simple enough solution that is achievable within 4.6 time frame. > > > > It doesn't prevent us from doing useful work in the future > > > > (disaggregated architecture with stricter security policy). It provides > > > > a way to work around buggy QEMU (admin sets that value to prevent QEMU > > > > from bumping memory limit). It's orgthogonal to migration v2 which means > > > > it won't be blocked by migration v2 or block migration v2. > > > > > > > > I tend to go with solution 5. Speak up if you don't agree with my > > > > analysis or you think I miss some aspects. > > > > > > I disagree > > > > > > > > > > For long term we need to: > > > > > > > > 1. Establish libxl as the arbitrator how much pages a domain can have. > > > > Anything else doing stuff behind arbitrator's back is considered > > > > buggy. This principle probably apply to other aspects of a domain as > > > > well. > > > > > > I disagree that libxl should be the arbitrator of a property that is > > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > > > > But it would be a burden for Xen once the arbitration goes beyond the > > issue we discuss here. It certainly doesn't have as much as information > > libxl has. In the long run we would still end up having libxl doing most > > of the work so we might as well establish the principle now. > > I would like to see some concrete examples of how libxl is adding value > in this context instead of just adding layers of indirection. > No, not layer of indirection. The fact that initial max_pages being calculated in libxl suggests that libxl already has that responsibility to arbitrate that value. Wei. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:10 ` Wei Liu @ 2015-06-08 13:27 ` Stefano Stabellini 2015-06-08 13:32 ` Wei Liu 0 siblings, 1 reply; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 13:27 UTC (permalink / raw) To: Wei Liu Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, 8 Jun 2015, Wei Liu wrote: > On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote: > [...] > > > > > 3. Add a libxl layer that wraps necessary information, take over > > > > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > > > > part of migration v2 is a waste of effort. > > > > > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > > > > layer in migration v2 still has unresolved issues. It has > > > > > inter-dependency with Remus / COLO. > > > > > > > > > > Most importantly it doesn't inherently solve the problem. It still > > > > > requires the current libxl JSON blob to contain information about max > > > > > pages (or information used to derive max pages). > > > > > > > > > > Andrew, correct me if I'm wrong. > > > > > > > > > > 4. Add a none user configurable field in current libxl JSON structure to > > > > > record max pages information. > > > > > > > > > > This is not desirable. All fields in libxl JSON should be user > > > > > configurable. > > > > > > > > > > 5. Add a user configurable field in current libxl JSON structure to > > > > > record how much more memory this domain needs. Admin is required to > > > > > fill in that value manually. In the mean time we revert the change in > > > > > QEMU and declare QEMU with that change buggy. > > > > > > > > QEMU 2.3.0 was released with that change in it, so it is not quite > > > > possible to revert it. Also I think it is the right change for QEMU. > > > > > > > > > > It has security implications. Here is my reply copied from my mail to > > > Ian: > > > > > > I'm considering removing xc_domain_setmaxmem needs regardless of this > > > bug because that's going to cause problem in QEMU upstream stubdom with > > > strict XSM policy and deprivileged QEMU (may not have privilege to call > > > setmaxmem). > > > > QEMU running in the stubdom should be able to set the maxmem for its > > target domain, but not for the others. > > > > Right, but this is still not safe. I will explain below. > > > > > > The security implication as it is now is big enough. One malicious guest > > > that controls QEMU has a vector to DoS hypervisor by setting its own > > > max_pages to -1; > > > > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen > > side? Could you please give me a bit more info on this security issue? > > > > Consider now we have a malicious guest. It has gained control of QEMU. > It then calls xc_domain_setmaxmem to set it's own max_pages inside > hypervisor to be -1 and start calling populate_physmap inside the guest > kernel. This is going to make hypervisor OOM. > > XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't > think there is sensible way of distinguishing a valid setmaxmem call > from a malicious setmaxmem call from QEMU. It's untrusted code base > after all. Valid maxmem happens before device-model/$DOMID/state is set to running. > > > > > > > > > > > No response to this so far. But in fact I consider this the most viable > > > > > solution. > > > > > > > > > > It's a simple enough solution that is achievable within 4.6 time frame. > > > > > It doesn't prevent us from doing useful work in the future > > > > > (disaggregated architecture with stricter security policy). It provides > > > > > a way to work around buggy QEMU (admin sets that value to prevent QEMU > > > > > from bumping memory limit). It's orgthogonal to migration v2 which means > > > > > it won't be blocked by migration v2 or block migration v2. > > > > > > > > > > I tend to go with solution 5. Speak up if you don't agree with my > > > > > analysis or you think I miss some aspects. > > > > > > > > I disagree > > > > > > > > > > > > > For long term we need to: > > > > > > > > > > 1. Establish libxl as the arbitrator how much pages a domain can have. > > > > > Anything else doing stuff behind arbitrator's back is considered > > > > > buggy. This principle probably apply to other aspects of a domain as > > > > > well. > > > > > > > > I disagree that libxl should be the arbitrator of a property that is > > > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > > > > > > > But it would be a burden for Xen once the arbitration goes beyond the > > > issue we discuss here. It certainly doesn't have as much as information > > > libxl has. In the long run we would still end up having libxl doing most > > > of the work so we might as well establish the principle now. > > > > I would like to see some concrete examples of how libxl is adding value > > in this context instead of just adding layers of indirection. > > > > No, not layer of indirection. The fact that initial max_pages being > calculated in libxl suggests that libxl already has that responsibility > to arbitrate that value. > > Wei. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:27 ` Stefano Stabellini @ 2015-06-08 13:32 ` Wei Liu 2015-06-08 13:38 ` Stefano Stabellini 0 siblings, 1 reply; 38+ messages in thread From: Wei Liu @ 2015-06-08 13:32 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, Jun 08, 2015 at 02:27:11PM +0100, Stefano Stabellini wrote: > On Mon, 8 Jun 2015, Wei Liu wrote: > > On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote: > > [...] > > > > > > 3. Add a libxl layer that wraps necessary information, take over > > > > > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > > > > > part of migration v2 is a waste of effort. > > > > > > > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > > > > > layer in migration v2 still has unresolved issues. It has > > > > > > inter-dependency with Remus / COLO. > > > > > > > > > > > > Most importantly it doesn't inherently solve the problem. It still > > > > > > requires the current libxl JSON blob to contain information about max > > > > > > pages (or information used to derive max pages). > > > > > > > > > > > > Andrew, correct me if I'm wrong. > > > > > > > > > > > > 4. Add a none user configurable field in current libxl JSON structure to > > > > > > record max pages information. > > > > > > > > > > > > This is not desirable. All fields in libxl JSON should be user > > > > > > configurable. > > > > > > > > > > > > 5. Add a user configurable field in current libxl JSON structure to > > > > > > record how much more memory this domain needs. Admin is required to > > > > > > fill in that value manually. In the mean time we revert the change in > > > > > > QEMU and declare QEMU with that change buggy. > > > > > > > > > > QEMU 2.3.0 was released with that change in it, so it is not quite > > > > > possible to revert it. Also I think it is the right change for QEMU. > > > > > > > > > > > > > It has security implications. Here is my reply copied from my mail to > > > > Ian: > > > > > > > > I'm considering removing xc_domain_setmaxmem needs regardless of this > > > > bug because that's going to cause problem in QEMU upstream stubdom with > > > > strict XSM policy and deprivileged QEMU (may not have privilege to call > > > > setmaxmem). > > > > > > QEMU running in the stubdom should be able to set the maxmem for its > > > target domain, but not for the others. > > > > > > > Right, but this is still not safe. I will explain below. > > > > > > > > > The security implication as it is now is big enough. One malicious guest > > > > that controls QEMU has a vector to DoS hypervisor by setting its own > > > > max_pages to -1; > > > > > > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen > > > side? Could you please give me a bit more info on this security issue? > > > > > > > Consider now we have a malicious guest. It has gained control of QEMU. > > It then calls xc_domain_setmaxmem to set it's own max_pages inside > > hypervisor to be -1 and start calling populate_physmap inside the guest > > kernel. This is going to make hypervisor OOM. > > > > XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't > > think there is sensible way of distinguishing a valid setmaxmem call > > from a malicious setmaxmem call from QEMU. It's untrusted code base > > after all. > > Valid maxmem happens before device-model/$DOMID/state is set to running. > But Xen wouldn't know about that. It couldn't read xenstore. Also device-mode/$domid/state is writable by QEMU so we can't trust the content as indicator either. Wei. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:32 ` Wei Liu @ 2015-06-08 13:38 ` Stefano Stabellini 2015-06-08 13:44 ` Andrew Cooper 0 siblings, 1 reply; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 13:38 UTC (permalink / raw) To: Wei Liu Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, 8 Jun 2015, Wei Liu wrote: > On Mon, Jun 08, 2015 at 02:27:11PM +0100, Stefano Stabellini wrote: > > On Mon, 8 Jun 2015, Wei Liu wrote: > > > On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote: > > > [...] > > > > > > > 3. Add a libxl layer that wraps necessary information, take over > > > > > > > Andrew's work on libxl migration v2. Having a libxl layer that's not > > > > > > > part of migration v2 is a waste of effort. > > > > > > > > > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl > > > > > > > layer in migration v2 still has unresolved issues. It has > > > > > > > inter-dependency with Remus / COLO. > > > > > > > > > > > > > > Most importantly it doesn't inherently solve the problem. It still > > > > > > > requires the current libxl JSON blob to contain information about max > > > > > > > pages (or information used to derive max pages). > > > > > > > > > > > > > > Andrew, correct me if I'm wrong. > > > > > > > > > > > > > > 4. Add a none user configurable field in current libxl JSON structure to > > > > > > > record max pages information. > > > > > > > > > > > > > > This is not desirable. All fields in libxl JSON should be user > > > > > > > configurable. > > > > > > > > > > > > > > 5. Add a user configurable field in current libxl JSON structure to > > > > > > > record how much more memory this domain needs. Admin is required to > > > > > > > fill in that value manually. In the mean time we revert the change in > > > > > > > QEMU and declare QEMU with that change buggy. > > > > > > > > > > > > QEMU 2.3.0 was released with that change in it, so it is not quite > > > > > > possible to revert it. Also I think it is the right change for QEMU. > > > > > > > > > > > > > > > > It has security implications. Here is my reply copied from my mail to > > > > > Ian: > > > > > > > > > > I'm considering removing xc_domain_setmaxmem needs regardless of this > > > > > bug because that's going to cause problem in QEMU upstream stubdom with > > > > > strict XSM policy and deprivileged QEMU (may not have privilege to call > > > > > setmaxmem). > > > > > > > > QEMU running in the stubdom should be able to set the maxmem for its > > > > target domain, but not for the others. > > > > > > > > > > Right, but this is still not safe. I will explain below. > > > > > > > > > > > > The security implication as it is now is big enough. One malicious guest > > > > > that controls QEMU has a vector to DoS hypervisor by setting its own > > > > > max_pages to -1; > > > > > > > > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen > > > > side? Could you please give me a bit more info on this security issue? > > > > > > > > > > Consider now we have a malicious guest. It has gained control of QEMU. > > > It then calls xc_domain_setmaxmem to set it's own max_pages inside > > > hypervisor to be -1 and start calling populate_physmap inside the guest > > > kernel. This is going to make hypervisor OOM. > > > > > > XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't > > > think there is sensible way of distinguishing a valid setmaxmem call > > > from a malicious setmaxmem call from QEMU. It's untrusted code base > > > after all. > > > > Valid maxmem happens before device-model/$DOMID/state is set to running. > > > > But Xen wouldn't know about that. It couldn't read xenstore. The way I would do it would be to drop the max_mem write priviledge from the stubdom at the time QEMU writes to device-mode/$domid/state, in a similar way to QEMU in dom0 dropping from root to unprivileged uid. Alternatively, as libxl is watching for changes to device-model/$DOMID/state, libxl could be the one that takes maxmem writing privileges away from the stubdom at that time. > Also device-mode/$domid/state is writable by QEMU so we can't trust > the content as indicator either. We can because the write happens before we unpause the guest ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:38 ` Stefano Stabellini @ 2015-06-08 13:44 ` Andrew Cooper 2015-06-08 13:45 ` Stefano Stabellini 0 siblings, 1 reply; 38+ messages in thread From: Andrew Cooper @ 2015-06-08 13:44 UTC (permalink / raw) To: Stefano Stabellini, Wei Liu Cc: George Dunlap, Ian Jackson, Ian Campbell, dslutz, xen-devel On 08/06/15 14:38, Stefano Stabellini wrote: >> Also device-mode/$domid/state is writable by QEMU so we can't trust >> > the content as indicator either. > We can because the write happens before we unpause the guest Only when creating the domain fresh. On resume, the guest has possibly had the chance to code-inject via the qemu save format. There are many CVEs in this area, and I am not willing to be all of them are fixed. In XenServer, even loading VM state from the save file happens in the deprivilelged environment. ~Andrew ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:44 ` Andrew Cooper @ 2015-06-08 13:45 ` Stefano Stabellini 0 siblings, 0 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 13:45 UTC (permalink / raw) To: Andrew Cooper Cc: Wei Liu, Ian Campbell, Stefano Stabellini, George Dunlap, Ian Jackson, dslutz, xen-devel On Mon, 8 Jun 2015, Andrew Cooper wrote: > On 08/06/15 14:38, Stefano Stabellini wrote: > >> Also device-mode/$domid/state is writable by QEMU so we can't trust > >> > the content as indicator either. > > We can because the write happens before we unpause the guest > > Only when creating the domain fresh. On resume, the guest has possibly > had the chance to code-inject via the qemu save format. There are many > CVEs in this area, and I am not willing to be all of them are fixed. > > In XenServer, even loading VM state from the save file happens in the > deprivilelged environment. QEMU doesn't do any maxmem changes at restore time. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 17:10 ` Stefano Stabellini 2015-06-05 18:10 ` Wei Liu @ 2015-06-05 18:49 ` Ian Campbell 2015-06-08 11:40 ` Stefano Stabellini 2015-06-08 14:14 ` George Dunlap 2 siblings, 1 reply; 38+ messages in thread From: Ian Campbell @ 2015-06-05 18:49 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, 2015-06-05 at 18:10 +0100, Stefano Stabellini wrote: > On Fri, 5 Jun 2015, Wei Liu wrote: > > Hi all > > > > This bug is now considered a blocker for 4.6 release. > > > > The premises of the problem remain the same (George's translated > > version): > > > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at > > the moment it calls set_max_mem() to increase max_pages so that it can > > allocate more pages to the guest. libxl doesn't know what max_pages a > > domain needs prior to qemu start-up. > > > > 2. Libxl doesn't know max_pages even after qemu start-up, because there > > is no mechanism to communicate between qemu and libxl. > > I might not know what is the right design for the overall solution, but > I do know that libxl shouldn't have its own state tracking for > max_pages, because max_pages is kept, maintained and enforced by Xen. > > Ian might still remember, but at the beginning of the xl/libxl project, > we had few simple design principles. One of which was that we should not > have two places where we keep track of the same thing. If Xen keeps > track of something, libxl should avoid it. This isn't about libxl tracking something duplicating information in Xen. It is about who gets to choose what that value is, which is not the same as who stores that value. So this is about libxl being the owner of what the current maxmem value is. It can so this by using setmaxmem and getmaxmem to set and retrieve the value with no state in libxl. > I disagree that libxl should be the arbitrator of a property that is > stored, maintained and enforced by Xen. Xen should be the arbitrator. That's not what "arbitrator" means, an arbitrator decides what the value should be, but that doesn't necessarily imply that it either stores, maintains nor enforces that value. > Even if QEMU called into libxl to change maxmem, I don't think that > libxl should store maxmem anywhere. It is already stored in Xen. I don't think anyone suggested otherwise, did they? What locking is there around QEMU's read/modify/write of the maxmem value? What happens if someone else also modifies maxmem at the same time? Ian. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 18:49 ` Ian Campbell @ 2015-06-08 11:40 ` Stefano Stabellini 2015-06-08 12:11 ` Ian Campbell 0 siblings, 1 reply; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 11:40 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Fri, 5 Jun 2015, Ian Campbell wrote: > On Fri, 2015-06-05 at 18:10 +0100, Stefano Stabellini wrote: > > On Fri, 5 Jun 2015, Wei Liu wrote: > > > Hi all > > > > > > This bug is now considered a blocker for 4.6 release. > > > > > > The premises of the problem remain the same (George's translated > > > version): > > > > > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at > > > the moment it calls set_max_mem() to increase max_pages so that it can > > > allocate more pages to the guest. libxl doesn't know what max_pages a > > > domain needs prior to qemu start-up. > > > > > > 2. Libxl doesn't know max_pages even after qemu start-up, because there > > > is no mechanism to communicate between qemu and libxl. > > > > I might not know what is the right design for the overall solution, but > > I do know that libxl shouldn't have its own state tracking for > > max_pages, because max_pages is kept, maintained and enforced by Xen. > > > > Ian might still remember, but at the beginning of the xl/libxl project, > > we had few simple design principles. One of which was that we should not > > have two places where we keep track of the same thing. If Xen keeps > > track of something, libxl should avoid it. > > This isn't about libxl tracking something duplicating information in > Xen. It is about who gets to choose what that value is, which is not the > same as who stores that value. > > So this is about libxl being the owner of what the current maxmem value > is. It can so this by using setmaxmem and getmaxmem to set and retrieve > the value with no state in libxl. > > > I disagree that libxl should be the arbitrator of a property that is > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > That's not what "arbitrator" means, an arbitrator decides what the value > should be, but that doesn't necessarily imply that it either stores, > maintains nor enforces that value. The way you describe it, it looks like some kind of host wide memory policy manager, that also doesn't belong to xl/libxl, the same way as other memory management tools were never accepted into xl/libxl but kept to separate daemons, like xenballoond or squeezed. Let's step away from this specific issue for a second. If it is not an host way policy manager but a per-VM layer on top of libxc, what value is this indirection actually adding? > > Even if QEMU called into libxl to change maxmem, I don't think that > > libxl should store maxmem anywhere. It is already stored in Xen. > > I don't think anyone suggested otherwise, did they? > > What locking is there around QEMU's read/modify/write of the maxmem > value? What happens if someone else also modifies maxmem at the same > time? It only happens at start of day before the domain is unpaused. At the time I couldn't come up with a scenario where it would be an issue, unless the admin is purposely trying to shut himself in the foot. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 11:40 ` Stefano Stabellini @ 2015-06-08 12:11 ` Ian Campbell 2015-06-08 13:22 ` Stefano Stabellini 2015-06-08 14:53 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 38+ messages in thread From: Ian Campbell @ 2015-06-08 12:11 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote: > > > I disagree that libxl should be the arbitrator of a property that is > > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > That's not what "arbitrator" means, an arbitrator decides what the value > > should be, but that doesn't necessarily imply that it either stores, > > maintains nor enforces that value. > > The way you describe it, it looks like some kind of host wide memory > policy manager, Not necessarily. We need to make a distinction between the entity which enforces maxpages, that (or those) which decide on what maxpages should be and tells Xen. Xen is the enforcer. It is also therefore by necessity the holder of the canonical value for the maxpages, since it has to be. Xen cannot be the second, it simply doesn't have the necessary information to make a decision about a domain's memory limit, it just blindly does whatever it is told by the last thing to tell it something. There are lots of ways the desired value of maxpages can be decided on, given a bunch of entities all with some level of responsibility for the maxpages of a given domain: 1. One single entity which decides on the limit and tells Xen. I think this is the xapi model (not squeezed/balloond, they just ask xapi to do things AFAIK, so are a red-herring). The unwary might think "libxl" was in this category, but it isn't since it is in fact multiple instances of the library. 2. A group of entities which cooperate as a distributed system such that any one of them can recalculate (from from the shared state, or by intra entity communication) the current desired value of max pages and propagate it to Xen. Prior to QEMU starting to play with the max pages the multiple libxl's on a system fell into this case, coordinating via xenstore transactions and libxl userdata. 3. A group of entities which operate in isolation by only ever increasing or descreasing the max pages according to their own requirements, without reference to anyone else. When QEMU entered the fray, and with the various libxl fixes since, you might think we are implementing this model, but we aren't because the hypervisor interface is only "set", not "increment/decrement" and so there is a racy read/modify/write cycle in every entity now. 4. Xen exposes multiple "maxpages" variables corresponding to the various entities which have a need to control this (or multiple per entity). Xen adds all of those up to come up with the actual maxpages. 5. I thought there wasw one more but I've forgotten what it was. We used to have #2, we now have a broken version of #3 (broken because it is racy due to the xen interface used). Fixing #2 would be a case of adding a qmp command to allow libxl to ask qmp what its extra memory needs are at any given point (start or day, post hotplug event etc, or even of every memset). Fixing #3 would be a case of adding a new hypercall interface to allow the amount of memory to be changed by a delta instead of just set and having everyone use it. You'd need to decide on migration whether to propagate the current at stop and copy time and having everyone migrate their requirements vs having everyone reregister their requirements on the far end. #4 is a new hypercall interface, which everyone would have to use. I happen to think this is a pretty silly design. > the same way as other memory management tools were never accepted into > xl/libxl but kept to separate daemons, like xenballoond or squeezed. I think squeezed and friends are a red-herring here, since they are either a client of one or more entities or in the more distributed ones are just one party among many. > Let's step away from this specific issue for a second. If it is not an > host way policy manager but a per-VM layer on top of libxc, what value > is this indirection actually adding? Given the raciness of libxc -- it is adding correctness. > > What locking is there around QEMU's read/modify/write of the maxmem > > value? What happens if someone else also modifies maxmem at the > > same time? > > It only happens at start of day before the domain is unpaused. At the > time I couldn't come up with a scenario where it would be an issue, > unless the admin is purposely trying to shut himself in the foot. I don't think there is anything preventing a call to the toolstack to set the amount of memory for a domain right after the domid exists. In a system with something like balloond or squeezed I can well imagine them doing something when a new domain was started resulting in opening of this race condition. (even if something is "ask libxl" not "call xc_setmaxpages"). Ian. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 12:11 ` Ian Campbell @ 2015-06-08 13:22 ` Stefano Stabellini 2015-06-08 13:52 ` Ian Campbell 2015-06-08 14:20 ` George Dunlap 2015-06-08 14:53 ` Konrad Rzeszutek Wilk 1 sibling, 2 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-08 13:22 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, 8 Jun 2015, Ian Campbell wrote: > On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote: > > > > > I disagree that libxl should be the arbitrator of a property that is > > > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > > > That's not what "arbitrator" means, an arbitrator decides what the value > > > should be, but that doesn't necessarily imply that it either stores, > > > maintains nor enforces that value. > > > > The way you describe it, it looks like some kind of host wide memory > > policy manager, > > Not necessarily. > > We need to make a distinction between the entity which enforces > maxpages, that (or those) which decide on what maxpages should be and > tells Xen. > > Xen is the enforcer. It is also therefore by necessity the holder of the > canonical value for the maxpages, since it has to be. > > Xen cannot be the second, it simply doesn't have the necessary > information to make a decision about a domain's memory limit, it just > blindly does whatever it is told by the last thing to tell it something. > > There are lots of ways the desired value of maxpages can be decided on, > given a bunch of entities all with some level of responsibility for the > maxpages of a given domain: > > 1. One single entity which decides on the limit and tells Xen. I > think this is the xapi model (not squeezed/balloond, they just > ask xapi to do things AFAIK, so are a red-herring). The unwary > might think "libxl" was in this category, but it isn't since it > is in fact multiple instances of the library. > 2. A group of entities which cooperate as a distributed system such > that any one of them can recalculate (from from the shared > state, or by intra entity communication) the current desired > value of max pages and propagate it to Xen. Prior to QEMU > starting to play with the max pages the multiple libxl's on a > system fell into this case, coordinating via xenstore > transactions and libxl userdata. The current libxl/xl code doesn't use xenstore or userdata to coordinate maxmem values. Xenstore is used to store the memory "target" for the in-guest balloon driver. Userdata is used to store a few QEMU properties. Maxmem is set at runtime by libxl is two cases: - when the user calls xl mem-max xl mem-max -> libxl_domain_setmaxmem -> xc_domain_setmaxmem libxl_domain_setmaxmem does a xenstore read to check the target value, not the maxmem value. It doesn't allow the user to lower maxmem below the current target value. Arguably this doesn't make too much sense. - when libxl_set_memory_target is called with enforce = 1, that happens when the user executes xl mem-set There is no coordination as far as I can tell. Libxl is just exposing xc_domain_setmaxmem straight to the user. There is no locking, no synchronization, nothing. Maybe this is where our different views come from: I am not seeing libxl doing any arbitrating at all, at least today. > 3. A group of entities which operate in isolation by only ever > increasing or descreasing the max pages according to their own > requirements, without reference to anyone else. When QEMU > entered the fray, and with the various libxl fixes since, you > might think we are implementing this model, but we aren't > because the hypervisor interface is only "set", not > "increment/decrement" and so there is a racy read/modify/write > cycle in every entity now. I don't think this is true: QEMU only sets maxmem at domain creation, before "xl create" even returns. I think that is safe. We had an email exchange at the time, I explained this behaviour and the general opinon was that it is acceptable. I don't understand why it is not anymore. > 4. Xen exposes multiple "maxpages" variables corresponding to the > various entities which have a need to control this (or multiple > per entity). Xen adds all of those up to come up with the actual > maxpages. > 5. I thought there wasw one more but I've forgotten what it was. > > We used to have #2, we now have a broken version of #3 (broken because > it is racy due to the xen interface used). I don't think we had #2, we always had a working version of #3 and we still have, except for migration being broken. I think libxc just needs to properly save/restore maxmem. I understand that given the current interfaces it is difficult to do, but rethinking the world because we cannot save/restore maxmem in a simple manner doesn't seem the right answer to me. > Fixing #2 would be a case of adding a qmp command to allow libxl to ask > qmp what its extra memory needs are at any given point (start or day, > post hotplug event etc, or even of every memset). > > Fixing #3 would be a case of adding a new hypercall interface to allow > the amount of memory to be changed by a delta instead of just set and > having everyone use it. You'd need to decide on migration whether to > propagate the current at stop and copy time and having everyone migrate > their requirements vs having everyone reregister their requirements on > the far end. > > #4 is a new hypercall interface, which everyone would have to use. I > happen to think this is a pretty silly design. > > > the same way as other memory management tools were never accepted into > > xl/libxl but kept to separate daemons, like xenballoond or squeezed. > > I think squeezed and friends are a red-herring here, since they are > either a client of one or more entities or in the more distributed ones > are just one party among many. > > > Let's step away from this specific issue for a second. If it is not an > > host way policy manager but a per-VM layer on top of libxc, what value > > is this indirection actually adding? > > Given the raciness of libxc -- it is adding correctness. > > > > What locking is there around QEMU's read/modify/write of the maxmem > > > value? What happens if someone else also modifies maxmem at the > > > same time? > > > > It only happens at start of day before the domain is unpaused. At the > > time I couldn't come up with a scenario where it would be an issue, > > unless the admin is purposely trying to shut himself in the foot. > > I don't think there is anything preventing a call to the toolstack to > set the amount of memory for a domain right after the domid exists. > > In a system with something like balloond or squeezed I can well imagine > them doing something when a new domain was started resulting in opening > of this race condition. (even if something is "ask libxl" not "call > xc_setmaxpages"). > > Ian. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:22 ` Stefano Stabellini @ 2015-06-08 13:52 ` Ian Campbell 2015-06-08 14:20 ` George Dunlap 1 sibling, 0 replies; 38+ messages in thread From: Ian Campbell @ 2015-06-08 13:52 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, 2015-06-08 at 14:22 +0100, Stefano Stabellini wrote: > On Mon, 8 Jun 2015, Ian Campbell wrote: > > On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote: > > > > > > > I disagree that libxl should be the arbitrator of a property that is > > > > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > > > > > That's not what "arbitrator" means, an arbitrator decides what the value > > > > should be, but that doesn't necessarily imply that it either stores, > > > > maintains nor enforces that value. > > > > > > The way you describe it, it looks like some kind of host wide memory > > > policy manager, > > > > Not necessarily. > > > > We need to make a distinction between the entity which enforces > > maxpages, that (or those) which decide on what maxpages should be and > > tells Xen. > > > > Xen is the enforcer. It is also therefore by necessity the holder of the > > canonical value for the maxpages, since it has to be. > > > > Xen cannot be the second, it simply doesn't have the necessary > > information to make a decision about a domain's memory limit, it just > > blindly does whatever it is told by the last thing to tell it something. > > > > There are lots of ways the desired value of maxpages can be decided on, > > given a bunch of entities all with some level of responsibility for the > > maxpages of a given domain: > > > > 1. One single entity which decides on the limit and tells Xen. I > > think this is the xapi model (not squeezed/balloond, they just > > ask xapi to do things AFAIK, so are a red-herring). The unwary > > might think "libxl" was in this category, but it isn't since it > > is in fact multiple instances of the library. > > 2. A group of entities which cooperate as a distributed system such > > that any one of them can recalculate (from from the shared > > state, or by intra entity communication) the current desired > > value of max pages and propagate it to Xen. Prior to QEMU > > starting to play with the max pages the multiple libxl's on a > > system fell into this case, coordinating via xenstore > > transactions and libxl userdata. > > The current libxl/xl code doesn't use xenstore or userdata to coordinate > maxmem values. Xenstore is used to store the memory "target" for the > in-guest balloon driver. Userdata is used to store a few QEMU > properties. > > Maxmem is set at runtime by libxl is two cases: > > - when the user calls xl mem-max > xl mem-max -> libxl_domain_setmaxmem -> xc_domain_setmaxmem > > libxl_domain_setmaxmem does a xenstore read to check the target value, > not the maxmem value. It doesn't allow the user to lower maxmem below > the current target value. Arguably this doesn't make too much sense. > > - when libxl_set_memory_target is called with enforce = 1, that happens > when the user executes xl mem-set > > > There is no coordination as far as I can tell. Libxl is just exposing > xc_domain_setmaxmem straight to the user. There is no locking, no > synchronization, nothing. Maybe this is where our different views come > from: I am not seeing libxl doing any arbitrating at all, at least today. I think this is consistent with what I describe as #2, except that as it happens there is no need for communication because other than the new value for the "user requested max memory size" all the other factors are hardcoded constants (LIBXL_MAXMEM_CONSTANT and friends), so there wasn't actually any need to communicate in order to be able to recalculate the new correct value for max pages. If for example we supported multiple versions of libxl in parallel and they might have different constants then this would be broken. Luckily we don't support that. > > 3. A group of entities which operate in isolation by only ever > > increasing or descreasing the max pages according to their own > > requirements, without reference to anyone else. When QEMU > > entered the fray, and with the various libxl fixes since, you > > might think we are implementing this model, but we aren't > > because the hypervisor interface is only "set", not > > "increment/decrement" and so there is a racy read/modify/write > > cycle in every entity now. > > I don't think this is true: QEMU only sets maxmem at domain creation, > before "xl create" even returns. I think that is safe. We had an email > exchange at the time, I explained this behaviour and the general opinon > was that it is acceptable. I don't understand why it is not anymore. At the time we didn't appreciate the race inherent in this, I think. Or else the argument is complex enough that it is going to need explaining again (and again). In which case it really ought to be clearly written down somewhere. The third option is that we ended up here by a series of plausible looking changes (the frog in boiling water effect). Anyway, without locking the current scheme is not #3 as defined above, due to the R/M/W race, it is just something a bit like #3 (but broken). Maybe there is an argument around the locking in libxl being sufficient at start of day, I'm not sure, it certainly doesn't look like libxl_set_memory_target(enforce=1) does anything to interlock against domain creation, so if it were called half way I'm not sure it would do the right thing. IOW. I think libxl_set_memory_target is now racy against parallel invocations vs libxl_domain_setmaxmem or during creation. I think/hope the xs transaction makes libxl_set_memory_target safe against itself, but I'm not sure if the transaction fails and we go around again we might use a stale ptr.max_memkb (because the lost transaction might have been another libxl_set_memory_target). > > 4. Xen exposes multiple "maxpages" variables corresponding to the > > various entities which have a need to control this (or multiple > > per entity). Xen adds all of those up to come up with the actual > > maxpages. > > 5. I thought there wasw one more but I've forgotten what it was. > > > > We used to have #2, we now have a broken version of #3 (broken because > > it is racy due to the xen interface used). > > I don't think we had #2, we always had a working version of #3 and we > still have, except for migration being broken. Before QEMU got involved #2 and #3 were for most practical purposes the same, I suppose (given the trivial nop coordination in #2 described above). Adding QEMU strictly at start of day into the mix didn't inherently break that (if we assume the libxl creation lock sufficient, which I'm not sure of), it was the following changes to libxl to use a R/M/W for updates which is where things start to get much more fishy IMHO. > I think libxc just needs to properly save/restore maxmem. I understand > that given the current interfaces it is difficult to do, but rethinking > the world because we cannot save/restore maxmem in a simple manner > doesn't seem the right answer to me. I think save/restore is just what has caused us to notice that the current model has issues, you are correct that we could apply a band aid to (perhaps) fix save/restore with the current model. I don't think that changes the fact that the model has issues though. Ian. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 13:22 ` Stefano Stabellini 2015-06-08 13:52 ` Ian Campbell @ 2015-06-08 14:20 ` George Dunlap 2015-06-08 15:01 ` Don Slutz 1 sibling, 1 reply; 38+ messages in thread From: George Dunlap @ 2015-06-08 14:20 UTC (permalink / raw) To: Stefano Stabellini, Ian Campbell Cc: Ian Jackson, Andrew Cooper, Wei Liu, dslutz, xen-devel On 06/08/2015 02:22 PM, Stefano Stabellini wrote: >> 3. A group of entities which operate in isolation by only ever >> increasing or descreasing the max pages according to their own >> requirements, without reference to anyone else. When QEMU >> entered the fray, and with the various libxl fixes since, you >> might think we are implementing this model, but we aren't >> because the hypervisor interface is only "set", not >> "increment/decrement" and so there is a racy read/modify/write >> cycle in every entity now. > > I don't think this is true: QEMU only sets maxmem at domain creation, > before "xl create" even returns. I think that is safe. We had an email > exchange at the time, I explained this behaviour and the general opinon > was that it is acceptable. I don't understand why it is not anymore. Well for one, nobody on the hypervisor side seems to have been brought in -- I definitely would have objected, and it sounds like AndyC would have objected too. I think we need to go back one level further. So first, let's make a distinction between *pages*, which are actual host RAM assigned to the guest and put in its p2m table, and *memory*, which is a virtualization construct (i.e., virtual RAM and video memory for virtual graphics cards). The hypervisor only cares about *pages*. It allocates pages to a domain, it puts them in the p2m. That's all it knows. The purpose of max_pages in the hypervisor is to make sure that no guest can allocate more host memory (pages) than it is allowed to have. How many pages is a particular guest allowed to have? Well pages are used for a number of purposes: * To implement virtual RAM in the guest * To implement video ram for virtual devices in qemu * To implement virtual ROMs * For magic "shared pages" used behind-the-scenes (not visible to the guest) (Feel free to add anything I missed.) max_pages in the hypervisor must be set to the sum of all the pages the domain is allowed to have. So the first point is this: Xen doesn't have a clue about any of those. It doesn't know how much virtual RAM a guest has, how much video RAM it has, how many virtual ROMs, how many magic shared pages, or anything. All it knows are what pages are in the p2m table. So although Xen certainly *enforces* max_pages, it is (at the moment) in no position to *decide* what max_pages should be. At the moment, in fact, nobody is. There is no single place that has a clear picture into how virtual RAM, guest devices, guest ROMs, and "magic pages" convert into actual number of pages. I think that's a bug. And at the moment, pages in the p2m are allocated by a number of entities: * In the libxc domain builder. * In the guest balloon driver * And now, in qemu, to allocate extra memory for virtual ROMs. Did I miss anything? For the first two, it's libxl that sets maxmem, based in its calculation of the size of virtual RAM plus various other bits that will be needed. Having qemu *also* set maxmem was always the wrong thing to do, IMHO. In theory, from the interface perspective, what libxl promises to provide is virtual RAM. When you say "memory=8192" in a domain config, that means (or should mean) 8192MiB of virtual RAM, exclusive of video RAM, virtual ROMs, and magic pages. Then when you say "xl mem-set 4096", it should again be aiming at giving the VM the equivalent of 4096MiB of virtual RAM, exclusive of video RAM, &c &c. We already have the problem that the balloon driver at the moment doesn't actually know how big the guest RAM is, nor , but is being told to make a balloon exactly big enough to bring the total RAM down to a specific target. I think we do need to have some place in the middle that actually knows how much memory is actually needed for the different sub-systems, so it can calculate and set maxmem appropriately. libxl is the obvious place. What about this: * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set up virtual ROMs, video ram for virtual devices, &c * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) + max_qemu_pages * Qemu allocates as many pages as it needs for option ROMS, and writes the amount that it actually did use into a special node in xenstore. * When the domain is unpaused, libxl will set maxpages to PAGES(virtual RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore. I think also that probably libxl, rather than setting a target amount of memory the balloon driver is supposed to aim at, should set the target size of the balloon. Once qemu tells it how many pages are actually being used for virtual devices, We could, in theory, expose all this information in xenstore such that *either* libxl or qemu would be able to calculate max_pages based on the numbers that were written there. And that would work if we could enforce a lock-step between the toolstack and qemu, as we can between Xen and the toolstack. But I think setting anything like this in stone is a really bad idea; which unfortulately excludes the idea of putting it in qemu. -George ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 14:20 ` George Dunlap @ 2015-06-08 15:01 ` Don Slutz 2015-06-08 15:37 ` George Dunlap 0 siblings, 1 reply; 38+ messages in thread From: Don Slutz @ 2015-06-08 15:01 UTC (permalink / raw) To: George Dunlap, Stefano Stabellini, Ian Campbell Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org On 06/08/15 10:20, George Dunlap wrote: > On 06/08/2015 02:22 PM, Stefano Stabellini wrote: >>> 3. A group of entities which operate in isolation by only ever >>> increasing or descreasing the max pages according to their own >>> requirements, without reference to anyone else. When QEMU >>> entered the fray, and with the various libxl fixes since, you >>> might think we are implementing this model, but we aren't >>> because the hypervisor interface is only "set", not >>> "increment/decrement" and so there is a racy read/modify/write >>> cycle in every entity now. >> >> I don't think this is true: QEMU only sets maxmem at domain creation, >> before "xl create" even returns. I think that is safe. We had an email >> exchange at the time, I explained this behaviour and the general opinon >> was that it is acceptable. I don't understand why it is not anymore. > > Well for one, nobody on the hypervisor side seems to have been brought > in -- I definitely would have objected, and it sounds like AndyC would > have objected too. > > I think we need to go back one level further. > > So first, let's make a distinction between *pages*, which are actual > host RAM assigned to the guest and put in its p2m table, and *memory*, > which is a virtualization construct (i.e., virtual RAM and video memory > for virtual graphics cards). > > The hypervisor only cares about *pages*. It allocates pages to a > domain, it puts them in the p2m. That's all it knows. > > The purpose of max_pages in the hypervisor is to make sure that no guest > can allocate more host memory (pages) than it is allowed to have. > > How many pages is a particular guest allowed to have? > > Well pages are used for a number of purposes: > * To implement virtual RAM in the guest > * To implement video ram for virtual devices in qemu > * To implement virtual ROMs > * For magic "shared pages" used behind-the-scenes (not visible to the > guest) > > (Feel free to add anything I missed.) > > max_pages in the hypervisor must be set to the sum of all the pages the > domain is allowed to have. > > So the first point is this: Xen doesn't have a clue about any of those. > It doesn't know how much virtual RAM a guest has, how much video RAM it > has, how many virtual ROMs, how many magic shared pages, or anything. > All it knows are what pages are in the p2m table. > > So although Xen certainly *enforces* max_pages, it is (at the moment) in > no position to *decide* what max_pages should be. > > At the moment, in fact, nobody is. There is no single place that has a > clear picture into how virtual RAM, guest devices, guest ROMs, and > "magic pages" convert into actual number of pages. I think that's a bug. > > And at the moment, pages in the p2m are allocated by a number of entities: > * In the libxc domain builder. > * In the guest balloon driver > * And now, in qemu, to allocate extra memory for virtual ROMs. This is not correct. QEMU and hvmloader both allocate pages for their use. LIBXL_MAXMEM_CONSTANT allows QEMU and hvmloader to allocate some pages. The QEMU change only comes into play after LIBXL_MAXMEM_CONSTANT has been reached. > > Did I miss anything? > > For the first two, it's libxl that sets maxmem, based in its calculation > of the size of virtual RAM plus various other bits that will be needed. > Having qemu *also* set maxmem was always the wrong thing to do, IMHO. > It does it for all 3 (4?) because it adds LIBXL_MAXMEM_CONSTANT. > In theory, from the interface perspective, what libxl promises to > provide is virtual RAM. When you say "memory=8192" in a domain config, > that means (or should mean) 8192MiB of virtual RAM, exclusive of video > RAM, virtual ROMs, and magic pages. Then when you say "xl mem-set > 4096", it should again be aiming at giving the VM the equivalent of > 4096MiB of virtual RAM, exclusive of video RAM, &c &c. Not what is currently done. virtual video RAM is subtracted from "memory=". > > We already have the problem that the balloon driver at the moment > doesn't actually know how big the guest RAM is, nor , but is being told > to make a balloon exactly big enough to bring the total RAM down to a > specific target. > > I think we do need to have some place in the middle that actually knows > how much memory is actually needed for the different sub-systems, so it > can calculate and set maxmem appropriately. libxl is the obvious place. Maybe. So you want libxl to know the detail of balloon overhead? How about the different sizes of all possible Option ROMs in all QEMU version? What about hvmloader usage of memory? > > What about this: > * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set > up virtual ROMs, video ram for virtual devices, &c > * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) + > max_qemu_pages > * Qemu allocates as many pages as it needs for option ROMS, and writes > the amount that it actually did use into a special node in xenstore. > * When the domain is unpaused, libxl will set maxpages to PAGES(virtual > RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore. > I think this does match What Wei Liu said: On 06/05/15 15:06, Wei Liu wrote:> On Fri, Jun 05, 2015 at 06:13:01PM +0100, Stefano Stabellini wrote: >> On Fri, 5 Jun 2015, Ian Campbell wrote: >>> On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote: ... >>>> 5. Add a user configurable field in current libxl JSON structure to >>>> record how much more memory this domain needs. Admin is required to >>>> fill in that value manually. In the mean time we revert the change in >>>> QEMU and declare QEMU with that change buggy. >>>> >>>> No response to this so far. But in fact I consider this the most viable >>>> solution. >>> >>> I initially thought that this was just #4 in a silly hat and was >>> therefore no more acceptable than that. >>> >>> But actually I think you are suggesting that users should have to >>> manually request additional RAM for option roms via some new interface >>> and that the old thing in qemu should be deprecated and removed? >>> >>> How would a user know what value to use here? Just "a bigger one till it >>> works"? That's, well, not super... >> >> This should not be a user configurable field. In fact it only depends on >> the QEMU version in use. > > That field is generic so that we can use it to add some extra pages to > domain. Using it to cover QEMU option roms would be one use case. It's > not very nice, but it's straight-forward. > > Wei. > I think also that probably libxl, rather than setting a target amount of > memory the balloon driver is supposed to aim at, should set the target > size of the balloon. Once qemu tells it how many pages are actually > being used for virtual devices, > > We could, in theory, expose all this information in xenstore such that > *either* libxl or qemu would be able to calculate max_pages based on the > numbers that were written there. And that would work if we could > enforce a lock-step between the toolstack and qemu, as we can between > Xen and the toolstack. But I think setting anything like this in stone > is a really bad idea; which unfortulately excludes the idea of putting > it in qemu. > -Don Slutz > -George > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 15:01 ` Don Slutz @ 2015-06-08 15:37 ` George Dunlap 2015-06-08 16:06 ` Don Slutz 2015-06-09 10:14 ` Stefano Stabellini 0 siblings, 2 replies; 38+ messages in thread From: George Dunlap @ 2015-06-08 15:37 UTC (permalink / raw) To: Don Slutz, Stefano Stabellini, Ian Campbell Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org On 06/08/2015 04:01 PM, Don Slutz wrote: > On 06/08/15 10:20, George Dunlap wrote: >> And at the moment, pages in the p2m are allocated by a number of entities: >> * In the libxc domain builder. >> * In the guest balloon driver >> * And now, in qemu, to allocate extra memory for virtual ROMs. > > This is not correct. QEMU and hvmloader both allocate pages for their > use. LIBXL_MAXMEM_CONSTANT allows QEMU and hvmloader to allocate some > pages. The QEMU change only comes into play after LIBXL_MAXMEM_CONSTANT > has been reached. Thanks -- so the correct statement here is (in time order): Pages in the p2m are allocated by a number of entities: * In the libxc domain builder * In qemu * In hvmloader * In the guest balloon driver >> For the first two, it's libxl that sets maxmem, based in its calculation >> of the size of virtual RAM plus various other bits that will be needed. >> Having qemu *also* set maxmem was always the wrong thing to do, IMHO. >> > > It does it for all 3 (4?) because it adds LIBXL_MAXMEM_CONSTANT. So the correct statement is: In the past, libxl has set maxmem for all of those, based on its calculation of virtual RAM plus various other bits that might be needed (including pages needed by qemu or hvmloader). The change as of qemu $WHATEVER is that now qemu also sets it if it finds that libxl didn't give it enough "slack". That was always the wrong thing to do, IMHO. >> In theory, from the interface perspective, what libxl promises to >> provide is virtual RAM. When you say "memory=8192" in a domain config, >> that means (or should mean) 8192MiB of virtual RAM, exclusive of video >> RAM, virtual ROMs, and magic pages. Then when you say "xl mem-set >> 4096", it should again be aiming at giving the VM the equivalent of >> 4096MiB of virtual RAM, exclusive of video RAM, &c &c. > > > Not what is currently done. virtual video RAM is subtracted from "memory=". Right. After I sent this, it occurred to me that there were two sensible interpretations of "memory=". The first is, "This is how much virtual RAM to give the guest. Please allocate non-RAM pages in addition to this." The second is, "This is the total amount of host RAM I want the guest to use. Please take non-RAM pages from this total amount." In reality we apparently do neither of these. :-) I think both break the "principle of least surprise" in different ways, but I suspect that admins on the whole would rather have the second interpretation, as I think it makes their lives a bit easier. >> We already have the problem that the balloon driver at the moment >> doesn't actually know how big the guest RAM is, nor , but is being told >> to make a balloon exactly big enough to bring the total RAM down to a >> specific target. >> >> I think we do need to have some place in the middle that actually knows >> how much memory is actually needed for the different sub-systems, so it >> can calculate and set maxmem appropriately. libxl is the obvious place. > > Maybe. So you want libxl to know the detail of balloon overhead? How > about the different sizes of all possible Option ROMs in all QEMU > version? What about hvmloader usage of memory? I'm not sure what you mean by "balloon overhead", but if you mean "guest pages wasted keeping track of pages which have been ballooned out", then no, that's not what I mean. Neither libxl nor the balloon driver keep track of that at the moment. I think that qemu needs to tell libxl how much memory it is using for all of its needs -- including option ROMs. (See my example below.) For older qemus we can just make some assumptions like we always have. I do think it would make sense to have the hvmloader amount listed somewhere explicitly. I'm not sure how often hvmloader may need to change the amount it uses for itself. >> What about this: >> * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set >> up virtual ROMs, video ram for virtual devices, &c >> * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) + >> max_qemu_pages >> * Qemu allocates as many pages as it needs for option ROMS, and writes >> the amount that it actually did use into a special node in xenstore. >> * When the domain is unpaused, libxl will set maxpages to PAGES(virtual >> RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore. >> > > I think this does match What Wei Liu said: The suggestion you quote below is that the *user* should have to put in some number in the config file, not that qemu should write the number into xenstore. The key distinction of my suggestion was to set maxpages purposely high, wait for qemu to use what it needs, then to reduce it down to what was needed. -George ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 15:37 ` George Dunlap @ 2015-06-08 16:06 ` Don Slutz 2015-06-09 10:00 ` George Dunlap 2015-06-09 10:14 ` Stefano Stabellini 1 sibling, 1 reply; 38+ messages in thread From: Don Slutz @ 2015-06-08 16:06 UTC (permalink / raw) To: George Dunlap, Stefano Stabellini, Ian Campbell Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org On 06/08/15 11:37, George Dunlap wrote: > On 06/08/2015 04:01 PM, Don Slutz wrote: >> On 06/08/15 10:20, George Dunlap wrote: >>> And at the moment, pages in the p2m are allocated by a number of entities: >>> * In the libxc domain builder. >>> * In the guest balloon driver >>> * And now, in qemu, to allocate extra memory for virtual ROMs. >> >> This is not correct. QEMU and hvmloader both allocate pages for their >> use. LIBXL_MAXMEM_CONSTANT allows QEMU and hvmloader to allocate some >> pages. The QEMU change only comes into play after LIBXL_MAXMEM_CONSTANT >> has been reached. > > Thanks -- so the correct statement here is (in time order): > > Pages in the p2m are allocated by a number of entities: > * In the libxc domain builder > * In qemu > * In hvmloader > * In the guest balloon driver > That is my understanding. As Ian C pointed out there is a file: docs/misc/libxl_memory.txt That attempts to talk about this. >>> For the first two, it's libxl that sets maxmem, based in its calculation >>> of the size of virtual RAM plus various other bits that will be needed. >>> Having qemu *also* set maxmem was always the wrong thing to do, IMHO. >>> >> >> It does it for all 3 (4?) because it adds LIBXL_MAXMEM_CONSTANT. > > So the correct statement is: > > In the past, libxl has set maxmem for all of those, based on its > calculation of virtual RAM plus various other bits that might be needed > (including pages needed by qemu or hvmloader). > > The change as of qemu $WHATEVER is that now qemu also sets it if it > finds that libxl didn't give it enough "slack". That was always the > wrong thing to do, IMHO. > Ok. >>> In theory, from the interface perspective, what libxl promises to >>> provide is virtual RAM. When you say "memory=8192" in a domain config, >>> that means (or should mean) 8192MiB of virtual RAM, exclusive of video >>> RAM, virtual ROMs, and magic pages. Then when you say "xl mem-set >>> 4096", it should again be aiming at giving the VM the equivalent of >>> 4096MiB of virtual RAM, exclusive of video RAM, &c &c. >> >> >> Not what is currently done. virtual video RAM is subtracted from "memory=". > > Right. > > After I sent this, it occurred to me that there were two sensible > interpretations of "memory=". The first is, "This is how much virtual > RAM to give the guest. Please allocate non-RAM pages in addition to > this." The second is, "This is the total amount of host RAM I want the > guest to use. Please take non-RAM pages from this total amount." > > In reality we apparently do neither of these. :-) > > I think both break the "principle of least surprise" in different ways, > but I suspect that admins on the whole would rather have the second > interpretation, as I think it makes their lives a bit easier. > Before I knew as much about this as I currently do, I had assumed that second interpretation was what libxl did. Normally video RAM is the largest amount and so the smaller delta (LIBXL_MAXMEM_CONSTANT 1MiB and LIBXL_HVM_EXTRA_MEMORY 2MiB) just was not noticed. There is also shadow memory, which needs to be in the above. >>> We already have the problem that the balloon driver at the moment >>> doesn't actually know how big the guest RAM is, nor , but is being told >>> to make a balloon exactly big enough to bring the total RAM down to a >>> specific target. >>> >>> I think we do need to have some place in the middle that actually knows >>> how much memory is actually needed for the different sub-systems, so it >>> can calculate and set maxmem appropriately. libxl is the obvious place. >> >> Maybe. So you want libxl to know the detail of balloon overhead? How >> about the different sizes of all possible Option ROMs in all QEMU >> version? What about hvmloader usage of memory? > > I'm not sure what you mean by "balloon overhead", but if you mean "guest > pages wasted keeping track of pages which have been ballooned out", then > no, that's not what I mean. Neither libxl nor the balloon driver keep > track of that at the moment. > I was trying to refer to: NOTE: Because of the way ballooning works, the guest has to allocate memory to keep track of maxmem pages, regardless of how much memory it actually has available to it. A guest with maxmem=262144 and memory=8096 will report significantly less memory available for use than a system with maxmem=8096 memory=8096 due to the memory overhead of having to track the unused pages. (from xl.cfg man page). > I think that qemu needs to tell libxl how much memory it is using for > all of its needs -- including option ROMs. (See my example below.) For > older qemus we can just make some assumptions like we always have. > I am happy with this. Note: I think libxl could determine this number now without QEMU changes. However it does depend on no other thread changing a "staring" domain's memory while libxl is calculating this. > I do think it would make sense to have the hvmloader amount listed > somewhere explicitly. I'm not sure how often hvmloader may need to > change the amount it uses for itself. > hvmloader does yet a different method. If xc_domain_populate_physmap_exact() fails, it reduces guest RAM (if my memory is correct). >>> What about this: >>> * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set >>> up virtual ROMs, video ram for virtual devices, &c >>> * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) + >>> max_qemu_pages >>> * Qemu allocates as many pages as it needs for option ROMS, and writes >>> the amount that it actually did use into a special node in xenstore. >>> * When the domain is unpaused, libxl will set maxpages to PAGES(virtual >>> RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore. >>> >> >> I think this does match What Wei Liu said: > > The suggestion you quote below is that the *user* should have to put in > some number in the config file, not that qemu should write the number > into xenstore. > > The key distinction of my suggestion was to set maxpages purposely high, > wait for qemu to use what it needs, then to reduce it down to what was > needed. > Sorry, I did not get that. -Don Slutz > -George > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 16:06 ` Don Slutz @ 2015-06-09 10:00 ` George Dunlap 2015-06-09 10:17 ` Wei Liu 0 siblings, 1 reply; 38+ messages in thread From: George Dunlap @ 2015-06-09 10:00 UTC (permalink / raw) To: Don Slutz, Stefano Stabellini, Ian Campbell Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org On 06/08/2015 05:06 PM, Don Slutz wrote: > On 06/08/15 11:37, George Dunlap wrote: >>>> We already have the problem that the balloon driver at the moment >>>> doesn't actually know how big the guest RAM is, nor , but is being told >>>> to make a balloon exactly big enough to bring the total RAM down to a >>>> specific target. >>>> >>>> I think we do need to have some place in the middle that actually knows >>>> how much memory is actually needed for the different sub-systems, so it >>>> can calculate and set maxmem appropriately. libxl is the obvious place. >>> >>> Maybe. So you want libxl to know the detail of balloon overhead? How >>> about the different sizes of all possible Option ROMs in all QEMU >>> version? What about hvmloader usage of memory? >> >> I'm not sure what you mean by "balloon overhead", but if you mean "guest >> pages wasted keeping track of pages which have been ballooned out", then >> no, that's not what I mean. Neither libxl nor the balloon driver keep >> track of that at the moment. >> > > I was trying to refer to: > > NOTE: Because of the way ballooning works, the guest has to allocate > memory to keep track of maxmem pages, regardless of how much memory it > actually has available to it. A guest with maxmem=262144 and > memory=8096 will report significantly less memory available for use than > a system with maxmem=8096 memory=8096 due to the memory overhead of > having to track the unused pages. > > (from xl.cfg man page). Right -- that's what I guessed. As I said, at the moment the balloon driver is (in theory) given the target and asked to balloon down such that tot_pages that the guest uses would be equal to the tot_pages it would use if it were given that amount of memory. It's got nothing to do with what a user process sees from inside the guest (which is what this note is referring to). >> I think that qemu needs to tell libxl how much memory it is using for >> all of its needs -- including option ROMs. (See my example below.) For >> older qemus we can just make some assumptions like we always have. >> > > I am happy with this. Note: I think libxl could determine this number > now without QEMU changes. However it does depend on no other thread > changing a "staring" domain's memory while libxl is calculating this. > >> I do think it would make sense to have the hvmloader amount listed >> somewhere explicitly. I'm not sure how often hvmloader may need to >> change the amount it uses for itself. >> > > hvmloader does yet a different method. If > xc_domain_populate_physmap_exact() fails, it reduces guest RAM (if my > memory is correct). This makes me wonder if we could make qemu do the same thing. -George ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-09 10:00 ` George Dunlap @ 2015-06-09 10:17 ` Wei Liu 0 siblings, 0 replies; 38+ messages in thread From: Wei Liu @ 2015-06-09 10:17 UTC (permalink / raw) To: George Dunlap Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper, Ian Jackson, Don Slutz, xen-devel@lists.xen.org On Tue, Jun 09, 2015 at 11:00:12AM +0100, George Dunlap wrote: > >> I think that qemu needs to tell libxl how much memory it is using for > >> all of its needs -- including option ROMs. (See my example below.) For > >> older qemus we can just make some assumptions like we always have. > >> > > > > I am happy with this. Note: I think libxl could determine this number > > now without QEMU changes. However it does depend on no other thread > > changing a "staring" domain's memory while libxl is calculating this. > > > >> I do think it would make sense to have the hvmloader amount listed > >> somewhere explicitly. I'm not sure how often hvmloader may need to > >> change the amount it uses for itself. > >> > > > > hvmloader does yet a different method. If > > xc_domain_populate_physmap_exact() fails, it reduces guest RAM (if my > > memory is correct). > > This makes me wonder if we could make qemu do the same thing. > I don't think so, because QEMU can't adjust hvm_info. > -George ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 15:37 ` George Dunlap 2015-06-08 16:06 ` Don Slutz @ 2015-06-09 10:14 ` Stefano Stabellini 2015-06-09 11:20 ` George Dunlap 2015-06-09 12:45 ` Ian Campbell 1 sibling, 2 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-09 10:14 UTC (permalink / raw) To: George Dunlap Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper, Ian Jackson, Don Slutz, xen-devel@lists.xen.org On Mon, 8 Jun 2015, George Dunlap wrote: > I think that qemu needs to tell libxl how much memory it is using for > all of its needs -- including option ROMs. (See my example below.) For > older qemus we can just make some assumptions like we always have. I'll just note that I have still not seen any evidence that telling libxl would improve things. It seems to me that we are just adding one more layer of indirection just to solve the migration issue elsewhere. I haven't seen convincing examples that things would be better by telling libxl yet, except that we would be able to fix the migration problem. I don't think that the current solution is inherently racy. If we are really interested in an honest evaluation of the current solution in terms of races, I would be happy to do so. When I committed it, I didn't do it without thinking. I don't think that libxl should be in the middle of this, because I don't think it has anything to add. That said, as I am not reading any new arguments and it doesn't look like the thread is going in the right direction from my POV, I don't think there is any need to repeat myself again, so I'll step away from this thread and let you decide what you think is the best way forward. Of course I'll evaluate any QEMU patches on their merits as always. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-09 10:14 ` Stefano Stabellini @ 2015-06-09 11:20 ` George Dunlap 2015-06-16 16:44 ` Stefano Stabellini 2015-06-09 12:45 ` Ian Campbell 1 sibling, 1 reply; 38+ messages in thread From: George Dunlap @ 2015-06-09 11:20 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, Ian Campbell, Andrew Cooper, Ian Jackson, Don Slutz, xen-devel@lists.xen.org On 06/09/2015 11:14 AM, Stefano Stabellini wrote: > On Mon, 8 Jun 2015, George Dunlap wrote: >> I think that qemu needs to tell libxl how much memory it is using for >> all of its needs -- including option ROMs. (See my example below.) For >> older qemus we can just make some assumptions like we always have. > > I'll just note that I have still not seen any evidence that telling > libxl would improve things. It seems to me that we are just adding one > more layer of indirection just to solve the migration issue elsewhere. > I haven't seen convincing examples that things would be better by telling > libxl yet, except that we would be able to fix the migration problem. So from a migration perspective, I agree there's not much difference between "libxl reads max_pages from Xen, inserts it into the migration stream, and sets it again on the other side" and "libxl has global knowledge of what max_pages should be, inserts it into the migration stream, and sets it again on the other side". Improvements of having it in libxl: 1. libxl is not an external interface; we can change and improve things as we go along. Whatever we put in qemu we have to support indefinitely. 2. We can begin to solve the "where is the memory" mess that is the current state of things. 3. In particular, we could conceivably actually change the interface to be consistent, so that "memory" means "the maximum amount of memory used by the guest including all overhead", rather than the random who knows what we have now. This will be impossible if we do it in qemu. Having more than one place change max_pages would be poor design even if we were still packaging everything together in the same release. Having an external entity with which we must be backwards-compatible change max_pages it is just a bad idea and always was. > When I committed it, I didn't > do it without thinking. I don't think that libxl should be in the middle > of this, because I don't think it has anything to add. Just to be clear, nobody thinks you did (at least, as far as I know). The problem is that all of us are so specialized: some people work almost entirely within qemu, others in the kernel, others in the toolstack, others in Xen. It's just natural for people who work mainly within one component to look at the problem from a particular perspective. But it causes exactly this sort of problem, where the left hand doesn't know what the right hand is doing, and the overall system is really uncoordinated. -George ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-09 11:20 ` George Dunlap @ 2015-06-16 16:44 ` Stefano Stabellini 0 siblings, 0 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-16 16:44 UTC (permalink / raw) To: George Dunlap Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper, Ian Jackson, Don Slutz, xen-devel@lists.xen.org On Tue, 9 Jun 2015, George Dunlap wrote: > On 06/09/2015 11:14 AM, Stefano Stabellini wrote: > > On Mon, 8 Jun 2015, George Dunlap wrote: > >> I think that qemu needs to tell libxl how much memory it is using for > >> all of its needs -- including option ROMs. (See my example below.) For > >> older qemus we can just make some assumptions like we always have. > > > > I'll just note that I have still not seen any evidence that telling > > libxl would improve things. It seems to me that we are just adding one > > more layer of indirection just to solve the migration issue elsewhere. > > I haven't seen convincing examples that things would be better by telling > > libxl yet, except that we would be able to fix the migration problem. > > So from a migration perspective, I agree there's not much difference > between "libxl reads max_pages from Xen, inserts it into the migration > stream, and sets it again on the other side" and "libxl has global > knowledge of what max_pages should be, inserts it into the migration > stream, and sets it again on the other side". > > Improvements of having it in libxl: > > 1. libxl is not an external interface; we can change and improve things > as we go along. Actually libxl is an external interface, did you mean libxc? > Whatever we put in qemu we have to support indefinitely. The maxmem call is already in QEMU :-) > 2. We can begin to solve the "where is the memory" mess that is the > current state of things. > > 3. In particular, we could conceivably actually change the interface to > be consistent, so that "memory" means "the maximum amount of memory used > by the guest including all overhead", rather than the random who knows > what we have now. That's it, I agree. But the issue is that we don't have a solution or even the design of a solution. If we had, and the plan included reverting the QEMU commit, then fine, let's do it. > This will be impossible if we do it in qemu. > > Having more than one place change max_pages would be poor design even if > we were still packaging everything together in the same release. Having > an external entity with which we must be backwards-compatible change > max_pages it is just a bad idea and always was. I disagree, if it was such an obvious design mistake, more people would have picked up on it immediately when the patch was posted the first time around. This is blurry terrain at best. > > When I committed it, I didn't > > do it without thinking. I don't think that libxl should be in the middle > > of this, because I don't think it has anything to add. > > Just to be clear, nobody thinks you did (at least, as far as I know). > The problem is that all of us are so specialized: some people work > almost entirely within qemu, others in the kernel, others in the > toolstack, others in Xen. It's just natural for people who work mainly > within one component to look at the problem from a particular > perspective. But it causes exactly this sort of problem, where the left > hand doesn't know what the right hand is doing, and the overall system > is really uncoordinated. That is true. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-09 10:14 ` Stefano Stabellini 2015-06-09 11:20 ` George Dunlap @ 2015-06-09 12:45 ` Ian Campbell 2015-06-17 13:35 ` Stefano Stabellini 1 sibling, 1 reply; 38+ messages in thread From: Ian Campbell @ 2015-06-09 12:45 UTC (permalink / raw) To: Stefano Stabellini Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Don Slutz, xen-devel@lists.xen.org On Tue, 2015-06-09 at 11:14 +0100, Stefano Stabellini wrote: > I don't think that the current solution is inherently racy. If we are > really interested in an honest evaluation of the current solution in > terms of races, I would be happy to do so. Consider a domain with 1M of RAM (==target and maxmem for the sake of argument) and two simultaneous calls to libxl_set_memory_target, both with relative=0 and enforce=1, one with target=3 and the other with target=5. target=3 call target=5 call get ptr.max_memkb, gets 1 get ptr.max_memkb, gets 1 start transaction do xenstore stuff, seeing target=1, setting target=3 memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb; memorykb = 1 - 1 + 3 xc_setmaxmem(3) transaction commit (success) Now target=maxmem=3 start transaction do xenstore stuff, seeing target=3, setting target=5 memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb; memorykb = 1 - 3 + 5 xc_setmaxmem(3) transaction commit (success) Now target=5, maxmem=3. The obvious fix of moving the get ptr.max_memkb inside the transaction fails in a different way in the case where the first transaction commit fails and is retried after the second one, I think. Prior to 0c029c4da21 "libxl_set_memory_target: retain the same maxmem offset on top of the current target" this issue didn't exist because memorykb was just memorykb = new_target_memkb + videoram. BTW, I noticed some other (unrelated) dubious stuff in there while looking at this, in particular the setmaxmem is not rolled back if set_pod_target fails. Also the transiently "wrong" maxmem between the setmaxmem and a failed transaction getting redone might possibly give rise to some interesting cases, especially if anything else fails the second time around the loop. > When I committed it, I didn't do it without thinking. Likewise I thought about it when reviewing, but it seems we have both missed an important aspect. Ian. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-09 12:45 ` Ian Campbell @ 2015-06-17 13:35 ` Stefano Stabellini 0 siblings, 0 replies; 38+ messages in thread From: Stefano Stabellini @ 2015-06-17 13:35 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, Don Slutz, xen-devel@lists.xen.org On Tue, 9 Jun 2015, Ian Campbell wrote: > On Tue, 2015-06-09 at 11:14 +0100, Stefano Stabellini wrote: > > I don't think that the current solution is inherently racy. If we are > > really interested in an honest evaluation of the current solution in > > terms of races, I would be happy to do so. > > Consider a domain with 1M of RAM (==target and maxmem for the sake of > argument) and two simultaneous calls to libxl_set_memory_target, both > with relative=0 and enforce=1, one with target=3 and the other with > target=5. > > target=3 call target=5 call > > get ptr.max_memkb, gets 1 > > get ptr.max_memkb, gets 1 > start transaction > do xenstore stuff, seeing target=1, setting target=3 > memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb; > memorykb = 1 - 1 + 3 > xc_setmaxmem(3) > transaction commit (success) > > Now target=maxmem=3 > > start transaction > do xenstore stuff, seeing target=3, setting target=5 > memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb; > memorykb = 1 - 3 + 5 > xc_setmaxmem(3) > transaction commit (success) > > Now target=5, maxmem=3. Yes, I see the issue. Pretty nasty! > The obvious fix of moving the get ptr.max_memkb inside the transaction > fails in a different way in the case where the first transaction commit > fails and is retried after the second one, I think. The underlying problem is that xc_domain_setmaxmem only takes absolute values. Retrieving the old maxmem value and setting the new one cannot be done atomically, so they need to be protected by a lock or a transaction. You are right that moving get ptr.max_memkb inside the transaction would only fix the problem if we also properly rolled back the old maxmem value in case of failures. But I don't think it is actually possible because it could race against other libxl_set_memory_target calls. Am I wrong? This would also be a problem before 0c029c4da21. > Prior to 0c029c4da21 "libxl_set_memory_target: retain the same maxmem > offset on top of the current target" this issue didn't exist because > memorykb was just memorykb = new_target_memkb + videoram. It is true that reverting 0c029c4da21 would improve the situation. However I think that you found a problem here that goes beyond 0c029c4da21 :-( > BTW, I noticed some other (unrelated) dubious stuff in there while > looking at this, in particular the setmaxmem is not rolled back if > set_pod_target fails. Yes, you are right! It is also not rolled back in case the xenstore transaction fails with errno != EAGAIN, even before 0c029c4da21. > Also the transiently "wrong" maxmem between the > setmaxmem and a failed transaction getting redone might possibly give > rise to some interesting cases, especially if anything else fails the > second time around the loop. I am thinking that the while idea of an "enforce" option was a bad to begin with. Not only we should revert 0c029c4da21, but actually we should just get rid of "enforce" altogether? I am suggesting it because I don't think that reverting 0c029c4da21 would fix all the issues. In a pre-0c029c4da21 world: target=3 enforce=1 start transaction do xenstore stuff, seeing target=1, setting target=3 memorykb = new_target_memkb + videoram; xc_setmaxmem(3+videoram) transaction commit (fail errno=ESOMETHING) Now target=1, maxmem=3+videoram Unless we say that we don't care about leaving the system in a consistent state in case of failures != EAGAIN. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 12:11 ` Ian Campbell 2015-06-08 13:22 ` Stefano Stabellini @ 2015-06-08 14:53 ` Konrad Rzeszutek Wilk 2015-06-08 15:20 ` George Dunlap 1 sibling, 1 reply; 38+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-06-08 14:53 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, Jun 08, 2015 at 01:11:38PM +0100, Ian Campbell wrote: > On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote: > > > > > I disagree that libxl should be the arbitrator of a property that is > > > > stored, maintained and enforced by Xen. Xen should be the arbitrator. > > > > > > That's not what "arbitrator" means, an arbitrator decides what the value > > > should be, but that doesn't necessarily imply that it either stores, > > > maintains nor enforces that value. > > > > The way you describe it, it looks like some kind of host wide memory > > policy manager, > > Not necessarily. > > We need to make a distinction between the entity which enforces > maxpages, that (or those) which decide on what maxpages should be and > tells Xen. > > Xen is the enforcer. It is also therefore by necessity the holder of the > canonical value for the maxpages, since it has to be. > > Xen cannot be the second, it simply doesn't have the necessary > information to make a decision about a domain's memory limit, it just > blindly does whatever it is told by the last thing to tell it something. > > There are lots of ways the desired value of maxpages can be decided on, > given a bunch of entities all with some level of responsibility for the > maxpages of a given domain: > > 1. One single entity which decides on the limit and tells Xen. I > think this is the xapi model (not squeezed/balloond, they just > ask xapi to do things AFAIK, so are a red-herring). The unwary > might think "libxl" was in this category, but it isn't since it > is in fact multiple instances of the library. > 2. A group of entities which cooperate as a distributed system such > that any one of them can recalculate (from from the shared > state, or by intra entity communication) the current desired > value of max pages and propagate it to Xen. Prior to QEMU > starting to play with the max pages the multiple libxl's on a > system fell into this case, coordinating via xenstore > transactions and libxl userdata. > 3. A group of entities which operate in isolation by only ever > increasing or descreasing the max pages according to their own > requirements, without reference to anyone else. When QEMU > entered the fray, and with the various libxl fixes since, you > might think we are implementing this model, but we aren't > because the hypervisor interface is only "set", not > "increment/decrement" and so there is a racy read/modify/write > cycle in every entity now. > 4. Xen exposes multiple "maxpages" variables corresponding to the > various entities which have a need to control this (or multiple > per entity). Xen adds all of those up to come up with the actual > maxpages. > 5. I thought there wasw one more but I've forgotten what it was. claim hypercall kind of fits in this category. It sticks an stake in the ground so that the 'memory' available for a guest will be available when QEMU,libxl etc go up and down. And when libxl is done it unclaims the memory (which hopefully at this point has all been given to the guest). > > We used to have #2, we now have a broken version of #3 (broken because > it is racy due to the xen interface used). > > Fixing #2 would be a case of adding a qmp command to allow libxl to ask > qmp what its extra memory needs are at any given point (start or day, > post hotplug event etc, or even of every memset). > > Fixing #3 would be a case of adding a new hypercall interface to allow > the amount of memory to be changed by a delta instead of just set and > having everyone use it. You'd need to decide on migration whether to > propagate the current at stop and copy time and having everyone migrate > their requirements vs having everyone reregister their requirements on > the far end. > > #4 is a new hypercall interface, which everyone would have to use. I > happen to think this is a pretty silly design. > > > the same way as other memory management tools were never accepted into > > xl/libxl but kept to separate daemons, like xenballoond or squeezed. > > I think squeezed and friends are a red-herring here, since they are > either a client of one or more entities or in the more distributed ones > are just one party among many. > > > Let's step away from this specific issue for a second. If it is not an > > host way policy manager but a per-VM layer on top of libxc, what value > > is this indirection actually adding? > > Given the raciness of libxc -- it is adding correctness. > > > > What locking is there around QEMU's read/modify/write of the maxmem > > > value? What happens if someone else also modifies maxmem at the > > > same time? > > > > It only happens at start of day before the domain is unpaused. At the > > time I couldn't come up with a scenario where it would be an issue, > > unless the admin is purposely trying to shut himself in the foot. > > I don't think there is anything preventing a call to the toolstack to > set the amount of memory for a domain right after the domid exists. > > In a system with something like balloond or squeezed I can well imagine > them doing something when a new domain was started resulting in opening > of this race condition. (even if something is "ask libxl" not "call > xc_setmaxpages"). > > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 14:53 ` Konrad Rzeszutek Wilk @ 2015-06-08 15:20 ` George Dunlap 2015-06-08 15:42 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 38+ messages in thread From: George Dunlap @ 2015-06-08 15:20 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, Ian Campbell Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson, dslutz, xen-devel On 06/08/2015 03:53 PM, Konrad Rzeszutek Wilk wrote: > On Mon, Jun 08, 2015 at 01:11:38PM +0100, Ian Campbell wrote: >> On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote: >> >>>>> I disagree that libxl should be the arbitrator of a property that is >>>>> stored, maintained and enforced by Xen. Xen should be the arbitrator. >>>> >>>> That's not what "arbitrator" means, an arbitrator decides what the value >>>> should be, but that doesn't necessarily imply that it either stores, >>>> maintains nor enforces that value. >>> >>> The way you describe it, it looks like some kind of host wide memory >>> policy manager, >> >> Not necessarily. >> >> We need to make a distinction between the entity which enforces >> maxpages, that (or those) which decide on what maxpages should be and >> tells Xen. >> >> Xen is the enforcer. It is also therefore by necessity the holder of the >> canonical value for the maxpages, since it has to be. >> >> Xen cannot be the second, it simply doesn't have the necessary >> information to make a decision about a domain's memory limit, it just >> blindly does whatever it is told by the last thing to tell it something. >> >> There are lots of ways the desired value of maxpages can be decided on, >> given a bunch of entities all with some level of responsibility for the >> maxpages of a given domain: >> >> 1. One single entity which decides on the limit and tells Xen. I >> think this is the xapi model (not squeezed/balloond, they just >> ask xapi to do things AFAIK, so are a red-herring). The unwary >> might think "libxl" was in this category, but it isn't since it >> is in fact multiple instances of the library. >> 2. A group of entities which cooperate as a distributed system such >> that any one of them can recalculate (from from the shared >> state, or by intra entity communication) the current desired >> value of max pages and propagate it to Xen. Prior to QEMU >> starting to play with the max pages the multiple libxl's on a >> system fell into this case, coordinating via xenstore >> transactions and libxl userdata. >> 3. A group of entities which operate in isolation by only ever >> increasing or descreasing the max pages according to their own >> requirements, without reference to anyone else. When QEMU >> entered the fray, and with the various libxl fixes since, you >> might think we are implementing this model, but we aren't >> because the hypervisor interface is only "set", not >> "increment/decrement" and so there is a racy read/modify/write >> cycle in every entity now. >> 4. Xen exposes multiple "maxpages" variables corresponding to the >> various entities which have a need to control this (or multiple >> per entity). Xen adds all of those up to come up with the actual >> maxpages. >> 5. I thought there wasw one more but I've forgotten what it was. > > claim hypercall kind of fits in this category. It sticks an stake in the ground > so that the 'memory' available for a guest will be available when QEMU,libxl > etc go up and down. > > And when libxl is done it unclaims the memory (which hopefully at this > point has all been given to the guest). Right -- in an ideal world maxpages and claim would be set at the same time, so that you wouldn't have to worry (for instance) about qemu failing because although it was *allowed* to have more memory to allocate ROMs, it wan't able to due to lack of available host memory. But deciding exactly how big either maxpages or the claim needs to be involves either having a central place where all the values are stored and calculated (1, 2, or 4), or having each of them just increment or decrement the total based on their own need (3), with no way of accounting for who asked for what. -George ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-08 15:20 ` George Dunlap @ 2015-06-08 15:42 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 38+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-06-08 15:42 UTC (permalink / raw) To: George Dunlap Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper, Ian Jackson, dslutz, xen-devel On Mon, Jun 08, 2015 at 04:20:48PM +0100, George Dunlap wrote: > On 06/08/2015 03:53 PM, Konrad Rzeszutek Wilk wrote: > > On Mon, Jun 08, 2015 at 01:11:38PM +0100, Ian Campbell wrote: > >> On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote: > >> > >>>>> I disagree that libxl should be the arbitrator of a property that is > >>>>> stored, maintained and enforced by Xen. Xen should be the arbitrator. > >>>> > >>>> That's not what "arbitrator" means, an arbitrator decides what the value > >>>> should be, but that doesn't necessarily imply that it either stores, > >>>> maintains nor enforces that value. > >>> > >>> The way you describe it, it looks like some kind of host wide memory > >>> policy manager, > >> > >> Not necessarily. > >> > >> We need to make a distinction between the entity which enforces > >> maxpages, that (or those) which decide on what maxpages should be and > >> tells Xen. > >> > >> Xen is the enforcer. It is also therefore by necessity the holder of the > >> canonical value for the maxpages, since it has to be. > >> > >> Xen cannot be the second, it simply doesn't have the necessary > >> information to make a decision about a domain's memory limit, it just > >> blindly does whatever it is told by the last thing to tell it something. > >> > >> There are lots of ways the desired value of maxpages can be decided on, > >> given a bunch of entities all with some level of responsibility for the > >> maxpages of a given domain: > >> > >> 1. One single entity which decides on the limit and tells Xen. I > >> think this is the xapi model (not squeezed/balloond, they just > >> ask xapi to do things AFAIK, so are a red-herring). The unwary > >> might think "libxl" was in this category, but it isn't since it > >> is in fact multiple instances of the library. > >> 2. A group of entities which cooperate as a distributed system such > >> that any one of them can recalculate (from from the shared > >> state, or by intra entity communication) the current desired > >> value of max pages and propagate it to Xen. Prior to QEMU > >> starting to play with the max pages the multiple libxl's on a > >> system fell into this case, coordinating via xenstore > >> transactions and libxl userdata. > >> 3. A group of entities which operate in isolation by only ever > >> increasing or descreasing the max pages according to their own > >> requirements, without reference to anyone else. When QEMU > >> entered the fray, and with the various libxl fixes since, you > >> might think we are implementing this model, but we aren't > >> because the hypervisor interface is only "set", not > >> "increment/decrement" and so there is a racy read/modify/write > >> cycle in every entity now. > >> 4. Xen exposes multiple "maxpages" variables corresponding to the > >> various entities which have a need to control this (or multiple > >> per entity). Xen adds all of those up to come up with the actual > >> maxpages. > >> 5. I thought there wasw one more but I've forgotten what it was. > > > > claim hypercall kind of fits in this category. It sticks an stake in the ground > > so that the 'memory' available for a guest will be available when QEMU,libxl > > etc go up and down. > > > > And when libxl is done it unclaims the memory (which hopefully at this > > point has all been given to the guest). > > Right -- in an ideal world maxpages and claim would be set at the same > time, so that you wouldn't have to worry (for instance) about qemu > failing because although it was *allowed* to have more memory to > allocate ROMs, it wan't able to due to lack of available host memory. > > But deciding exactly how big either maxpages or the claim needs to be > involves either having a central place where all the values are stored > and calculated (1, 2, or 4), or having each of them just increment or > decrement the total based on their own need (3), with no way of > accounting for who asked for what. Right. I was just pointing out that with claim call you have an semi atomic mechanism to deal with the amount of memory you need - so you don't have to worry about the race aspect (another guest taking away the memory you want to use). So what is the next step? (I concur BTW with your assesment - just not sure which group [1,2, or 4 vs 3] is the right direction). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: QEMU bumping memory bug analysis 2015-06-05 17:10 ` Stefano Stabellini 2015-06-05 18:10 ` Wei Liu 2015-06-05 18:49 ` Ian Campbell @ 2015-06-08 14:14 ` George Dunlap 2 siblings, 0 replies; 38+ messages in thread From: George Dunlap @ 2015-06-08 14:14 UTC (permalink / raw) To: Stefano Stabellini, Wei Liu Cc: Andrew Cooper, Ian Jackson, Ian Campbell, dslutz, xen-devel On 06/05/2015 06:10 PM, Stefano Stabellini wrote: >> 5. Add a user configurable field in current libxl JSON structure to >> record how much more memory this domain needs. Admin is required to >> fill in that value manually. In the mean time we revert the change in >> QEMU and declare QEMU with that change buggy. > > QEMU 2.3.0 was released with that change in it, so it is not quite > possible to revert it. This is at least one reason it should never have been in qemu in the first place, and why we should deprecate that as soon as possible. -George ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2015-06-17 13:35 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-05 16:43 QEMU bumping memory bug analysis Wei Liu 2015-06-05 16:58 ` Ian Campbell 2015-06-05 17:13 ` Stefano Stabellini 2015-06-05 19:06 ` Wei Liu 2015-06-05 17:17 ` Andrew Cooper 2015-06-05 17:39 ` Wei Liu 2015-06-05 17:10 ` Stefano Stabellini 2015-06-05 18:10 ` Wei Liu 2015-06-08 11:39 ` Stefano Stabellini 2015-06-08 12:14 ` Andrew Cooper 2015-06-08 13:01 ` Stefano Stabellini 2015-06-08 13:33 ` Jan Beulich 2015-06-08 13:10 ` Wei Liu 2015-06-08 13:27 ` Stefano Stabellini 2015-06-08 13:32 ` Wei Liu 2015-06-08 13:38 ` Stefano Stabellini 2015-06-08 13:44 ` Andrew Cooper 2015-06-08 13:45 ` Stefano Stabellini 2015-06-05 18:49 ` Ian Campbell 2015-06-08 11:40 ` Stefano Stabellini 2015-06-08 12:11 ` Ian Campbell 2015-06-08 13:22 ` Stefano Stabellini 2015-06-08 13:52 ` Ian Campbell 2015-06-08 14:20 ` George Dunlap 2015-06-08 15:01 ` Don Slutz 2015-06-08 15:37 ` George Dunlap 2015-06-08 16:06 ` Don Slutz 2015-06-09 10:00 ` George Dunlap 2015-06-09 10:17 ` Wei Liu 2015-06-09 10:14 ` Stefano Stabellini 2015-06-09 11:20 ` George Dunlap 2015-06-16 16:44 ` Stefano Stabellini 2015-06-09 12:45 ` Ian Campbell 2015-06-17 13:35 ` Stefano Stabellini 2015-06-08 14:53 ` Konrad Rzeszutek Wilk 2015-06-08 15:20 ` George Dunlap 2015-06-08 15:42 ` Konrad Rzeszutek Wilk 2015-06-08 14:14 ` George Dunlap
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.