* Re: regarding the x86_64 zero-based percpu patches [not found] ` <20090107120225.GA30651@elte.hu> @ 2009-01-07 12:13 ` Tejun Heo 2009-01-10 6:46 ` Rusty Russell 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2009-01-07 12:13 UTC (permalink / raw) To: Ingo Molnar Cc: travis, Rusty Russell, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, Eric Biederman, steiner, Hugh Dickins (cc'ing people from the original thread and LKML as it seems to require actual discussion.) Hello, this thread started with me asking for help regarding the zero-based percpu patches and the initial message is quoted below. Ingo Molnar wrote: > * Tejun Heo <tj@kernel.org> wrote: > >> Hello, Mike, Ingo. >> >> I was working on something which requires better dynamic per-cpu >> performance and have been working on implementing it myself but >> realized the strange gcc stack protector ABI limitation and with >> Rusty's hint and googling found out that Mike already did the heavy >> lifting. >> >> I read the "x86_64: Optimize percpu accesses" from July last year and >> it looks like it got stuck on tool chain problem which showed up as >> two problems (is one of the two resolved?). >> >> * Notifier call chain corruption >> >> * Stack overflow with default stack size >> >> >From the cpu_alloc thread from November, it seems Mike is quite >> pre-occupied, so I'm willing to give it a shot as it's blocking stuff >> I have in queue. The problem is that I'm having problem finding some >> information. >> >> 1. Mike seems to have splitted the patch but haven't posted them. >> >> 2. Ingo's x86/percpu-zerobased branch doesn't contain any revision not >> in the current upstream. Maybe the commits got lost during merges? >> >> 3. What failed and what got fixed and how to reproduce the problem. >> >> So, can you please help me a bit? I'll be happy to forward port the >> patches if they have bit-rotted. > > hm, i zapped them two days ago, because they collided with Rusty's ongoing > percpu-alloc work in his tree. Mike should be able to tell you what the > plans are for the resurrection of those patches. IIUC, Rusty is somewhat leaning toward limiting per-cpu area and using static allocator. (right?) As I was trying to do more stuff per-cpu (not putting a lot of stuff into per-cpu area but even with small things limited per-cpu area poses scalability problems), cpu_alloc seems to fit the bill better. Anyways, I think it's worthwhile to listen what people have on mind regarding how per-cpu stuff should proceed. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-07 12:13 ` regarding the x86_64 zero-based percpu patches Tejun Heo @ 2009-01-10 6:46 ` Rusty Russell 2009-01-12 17:23 ` Christoph Lameter 0 siblings, 1 reply; 16+ messages in thread From: Rusty Russell @ 2009-01-10 6:46 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, Eric Biederman, steiner, Hugh Dickins, Christoph Lameter On Wednesday 07 January 2009 22:43:25 Tejun Heo wrote: > IIUC, Rusty is somewhat leaning toward limiting per-cpu area and using > static allocator. (right?) Not quite. Six years ago I didn't do "proper" dynamic per-cpu because of this lack-of-expanding problem. I expected (myself or someone else) would fix that and the current temporary solution would be replaced. But Christoph showed that even in a limited form it can be used for more than static per-cpu vars and such vars in modules. (It's also in dire need of a cleanup, since there have been several abortive changes made in the last few years). > As I was trying to do more stuff per-cpu > (not putting a lot of stuff into per-cpu area but even with small > things limited per-cpu area poses scalability problems), cpu_alloc > seems to fit the bill better. Unfortunately cpu_alloc didn't solve this problem either. We need to grow the areas, but for NUMA layouts it's non-trivial. I don't like the idea of remapping: one TLB entry per page per cpu is going to suck. Finding pages which are "congruent" with the original percpu pages is more promising, but it will almost certainly need to elbow pages out the way to have a chance of succeeding on a real system. > Anyways, I think it's worthwhile to listen what people have on mind > regarding how per-cpu stuff should proceed. Absolutely. Thanks, Rusty. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-10 6:46 ` Rusty Russell @ 2009-01-12 17:23 ` Christoph Lameter 2009-01-12 17:44 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Christoph Lameter @ 2009-01-12 17:23 UTC (permalink / raw) To: Rusty Russell Cc: Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, Eric Biederman, steiner, Hugh Dickins On Sat, 10 Jan 2009, Rusty Russell wrote: > > As I was trying to do more stuff per-cpu > > (not putting a lot of stuff into per-cpu area but even with small > > things limited per-cpu area poses scalability problems), cpu_alloc > > seems to fit the bill better. > > Unfortunately cpu_alloc didn't solve this problem either. > > We need to grow the areas, but for NUMA layouts it's non-trivial. I don't > like the idea of remapping: one TLB entry per page per cpu is going to suck. > Finding pages which are "congruent" with the original percpu pages is more > promising, but it will almost certainly need to elbow pages out the way to > have a chance of succeeding on a real system. An allocation automatically falls back to the nearest node on NUMA cpu_to_node() gives you the current node. There are 2M TLB entries on x86_64. If we really get into a high usage scenario then the 2M entry makes sense. Average server memory sizes likely already are way beyond 10G per box. The higher that goes the more reasonable the 2M TLB entry will be. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-12 17:23 ` Christoph Lameter @ 2009-01-12 17:44 ` Eric W. Biederman 2009-01-12 19:00 ` Christoph Lameter ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Eric W. Biederman @ 2009-01-12 17:44 UTC (permalink / raw) To: Christoph Lameter Cc: Rusty Russell, Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins Christoph Lameter <cl@linux-foundation.org> writes: > On Sat, 10 Jan 2009, Rusty Russell wrote: > >> > As I was trying to do more stuff per-cpu >> > (not putting a lot of stuff into per-cpu area but even with small >> > things limited per-cpu area poses scalability problems), cpu_alloc >> > seems to fit the bill better. >> >> Unfortunately cpu_alloc didn't solve this problem either. >> >> We need to grow the areas, but for NUMA layouts it's non-trivial. I don't >> like the idea of remapping: one TLB entry per page per cpu is going to suck. >> Finding pages which are "congruent" with the original percpu pages is more >> promising, but it will almost certainly need to elbow pages out the way to >> have a chance of succeeding on a real system. > > An allocation automatically falls back to the nearest node on NUMA > cpu_to_node() gives you the current node. > > There are 2M TLB entries on x86_64. If we really get into a high usage > scenario then the 2M entry makes sense. Average server memory sizes likely > already are way beyond 10G per box. The higher that goes the more > reasonable the 2M TLB entry will be. 2M of per cpu data doesn't make sense, and likely indicates a design flaw somewhere. It just doesn't make sense to have large amounts of data allocated per cpu. The most common user of per cpu data I am aware of is allocating one word per cpu for counters. What would be better is simply to: - Require a lock to access another cpus per cpu data. - Do large page allocations for the per cpu data. At which point we could grow the per cpu data by simply reallocating it on each cpu and updating the register that holds the base pointer. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-12 17:44 ` Eric W. Biederman @ 2009-01-12 19:00 ` Christoph Lameter 2009-01-13 0:33 ` Tejun Heo 2009-01-15 1:34 ` Rusty Russell 2 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2009-01-12 19:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Rusty Russell, Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins On Mon, 12 Jan 2009, Eric W. Biederman wrote: > > There are 2M TLB entries on x86_64. If we really get into a high usage > > scenario then the 2M entry makes sense. Average server memory sizes likely > > already are way beyond 10G per box. The higher that goes the more > > reasonable the 2M TLB entry will be. > > 2M of per cpu data doesn't make sense, and likely indicates a design > flaw somewhere. It just doesn't make sense to have large amounts of > data allocated per cpu. Some data is not small. MIB data is allocated per cpu etc etc > What would be better is simply to: > - Require a lock to access another cpus per cpu data. > - Do large page allocations for the per cpu data. > > At which point we could grow the per cpu data by simply reallocating it on > each cpu and updating the register that holds the base pointer. If per cpu data areas have no fixed address then you cannot use list operations on per cpu data nor can the address of per cpu variables be stored anywhere. But maybe that is okay? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-12 17:44 ` Eric W. Biederman 2009-01-12 19:00 ` Christoph Lameter @ 2009-01-13 0:33 ` Tejun Heo 2009-01-13 3:01 ` Eric W. Biederman 2009-01-15 1:34 ` Rusty Russell 2 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2009-01-13 0:33 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins Hello, Eric. Eric W. Biederman wrote: >> There are 2M TLB entries on x86_64. If we really get into a high usage >> scenario then the 2M entry makes sense. Average server memory sizes likely >> already are way beyond 10G per box. The higher that goes the more >> reasonable the 2M TLB entry will be. > > 2M of per cpu data doesn't make sense, and likely indicates a design > flaw somewhere. It just doesn't make sense to have large amounts of > data allocated per cpu. Why? On almost all large machines I've seen or heard of, memory size scales way better than the number of cpus. Whether certain usage makes sense or not surely is debatable but I can't imagine all use cases where 2MB percpu TLB entry could be useful would be senseless. > The most common user of per cpu data I am aware of is allocating one > word per cpu for counters. > > What would be better is simply to: > - Require a lock to access another cpus per cpu data. > - Do large page allocations for the per cpu data. > > At which point we could grow the per cpu data by simply reallocating it on > each cpu and updating the register that holds the base pointer. I don't think moving live objects is such a good idea for the following reasons. 1. Programming convenience is usually much more important than people think it is. Even in the kernel. I think it's very likely that we'll have unending stream of small feature requirements which would step just outside the supported bounds and ever smart workaround until the restriction is finally removed years later. 2. Moving live objects is inherently dangerous + it won't happen often. Thinking about possible subtle bugs is scary. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-13 0:33 ` Tejun Heo @ 2009-01-13 3:01 ` Eric W. Biederman 2009-01-13 3:14 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Eric W. Biederman @ 2009-01-13 3:01 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins Tejun Heo <tj@kernel.org> writes: > Hello, Eric. > > Eric W. Biederman wrote: >>> There are 2M TLB entries on x86_64. If we really get into a high usage >>> scenario then the 2M entry makes sense. Average server memory sizes likely >>> already are way beyond 10G per box. The higher that goes the more >>> reasonable the 2M TLB entry will be. >> >> 2M of per cpu data doesn't make sense, and likely indicates a design >> flaw somewhere. It just doesn't make sense to have large amounts of >> data allocated per cpu. > > Why? On almost all large machines I've seen or heard of, memory size > scales way better than the number of cpus. Whether certain usage > makes sense or not surely is debatable but I can't imagine all use > cases where 2MB percpu TLB entry could be useful would be senseless. Right, there are cases where you could hit 2MB but they aren't likely to be that common. In particular the common case is to allocate a single word of per cpu data, with a given allocation request. To get 2 2MB with 8byte requests requires 262144 different, which is a lot more than I expect to be common any time soon. So I figure reserving a 2MB tlb entry is not likely what we want, in the common case. >> The most common user of per cpu data I am aware of is allocating one >> word per cpu for counters. >> >> What would be better is simply to: >> - Require a lock to access another cpus per cpu data. >> - Do large page allocations for the per cpu data. >> >> At which point we could grow the per cpu data by simply reallocating it on >> each cpu and updating the register that holds the base pointer. > > I don't think moving live objects is such a good idea for the > following reasons. > > 1. Programming convenience is usually much more important than people > think it is. Even in the kernel. I think it's very likely that > we'll have unending stream of small feature requirements which > would step just outside the supported bounds and ever smart > workaround until the restriction is finally removed years later. > 2. Moving live objects is inherently dangerous + it won't happen > often. Thinking about possible subtle bugs is scary. But the question is what is per cpu memory. Per cpu memory is something we can access quickly without creating cross cpu cache line contention. Accessing that memory from other cpus implies we create that contention and will be the slow path. We need to do cross cpu access for the rollup of the statistics, but we clearly don't want to do it often. So I expect most times we will want to store a pointer to per cpu data will be a bug. per cpu memory is not something we ever want to use to lightly. So as long as the rules are clear we should be ok. And simply removing the address of function for per cpu data would make it impossible to point into it. So I think it is worth a look, to see if we can move live per cpu data. As it noticeably simplifies the problem of growing a per cpu area in the rare case when we need to. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-13 3:01 ` Eric W. Biederman @ 2009-01-13 3:14 ` Tejun Heo 2009-01-13 4:07 ` Eric W. Biederman 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2009-01-13 3:14 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins Hello, Eric. Eric W. Biederman wrote: > Right, there are cases where you could hit 2MB but they aren't likely > to be that common. > > In particular the common case is to allocate a single word of per > cpu data, with a given allocation request. To get 2 2MB with > 8byte requests requires 262144 different, which is a lot > more than I expect to be common any time soon. > > So I figure reserving a 2MB tlb entry is not likely what we want, > in the common case. Yeap, it probably won't hit 2MB in common cases but it still needs to scale for uncommon cases and if 4K TLB pressure becomes too high for such cases, promoting to 2MB TLB makes sense for those. IIUC, that's what Christoph Lameter is intending to do (haven't looked at the code yet tho). >>> The most common user of per cpu data I am aware of is allocating one >>> word per cpu for counters. >>> >>> What would be better is simply to: >>> - Require a lock to access another cpus per cpu data. >>> - Do large page allocations for the per cpu data. >>> >>> At which point we could grow the per cpu data by simply reallocating it on >>> each cpu and updating the register that holds the base pointer. >> I don't think moving live objects is such a good idea for the >> following reasons. >> >> 1. Programming convenience is usually much more important than people >> think it is. Even in the kernel. I think it's very likely that >> we'll have unending stream of small feature requirements which >> would step just outside the supported bounds and ever smart >> workaround until the restriction is finally removed years later. > >> 2. Moving live objects is inherently dangerous + it won't happen >> often. Thinking about possible subtle bugs is scary. > > But the question is what is per cpu memory. Per cpu memory is > something we can access quickly without creating cross cpu > cache line contention. > > Accessing that memory from other cpus implies we create that > contention and will be the slow path. We need to do cross > cpu access for the rollup of the statistics, but we clearly > don't want to do it often. > > So I expect most times we will want to store a pointer to per > cpu data will be a bug. > > per cpu memory is not something we ever want to use to lightly. > So as long as the rules are clear we should be ok. And simply > removing the address of function for per cpu data would make it impossible > to point into it. So I think it is worth a look, to see if we can > move live per cpu data. As it noticeably simplifies the problem of > growing a per cpu area in the rare case when we need to. I don't know. I think it's a dangerous thing which can be avoided. If there's no other solution, then we might have to live with it but I don't see the winning benefit of such design over per-cpu virtual mapping. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-13 3:14 ` Tejun Heo @ 2009-01-13 4:07 ` Eric W. Biederman 2009-01-14 3:58 ` Tejun Heo 2009-01-15 1:49 ` Rusty Russell 0 siblings, 2 replies; 16+ messages in thread From: Eric W. Biederman @ 2009-01-13 4:07 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins Tejun Heo <tj@kernel.org> writes: > Hello, Eric. > > Eric W. Biederman wrote: >> Right, there are cases where you could hit 2MB but they aren't likely >> to be that common. >> >> In particular the common case is to allocate a single word of per >> cpu data, with a given allocation request. To get 2 2MB with >> 8byte requests requires 262144 different, which is a lot >> more than I expect to be common any time soon. >> >> So I figure reserving a 2MB tlb entry is not likely what we want, >> in the common case. > > Yeap, it probably won't hit 2MB in common cases but it still needs to > scale for uncommon cases and if 4K TLB pressure becomes too high for > such cases, promoting to 2MB TLB makes sense for those. IIUC, that's > what Christoph Lameter is intending to do (haven't looked at the code > yet tho). > >>>> The most common user of per cpu data I am aware of is allocating one >>>> word per cpu for counters. >>>> >>>> What would be better is simply to: >>>> - Require a lock to access another cpus per cpu data. >>>> - Do large page allocations for the per cpu data. >>>> >>>> At which point we could grow the per cpu data by simply reallocating it on >>>> each cpu and updating the register that holds the base pointer. >>> I don't think moving live objects is such a good idea for the >>> following reasons. >>> >>> 1. Programming convenience is usually much more important than people >>> think it is. Even in the kernel. I think it's very likely that >>> we'll have unending stream of small feature requirements which >>> would step just outside the supported bounds and ever smart >>> workaround until the restriction is finally removed years later. >> >>> 2. Moving live objects is inherently dangerous + it won't happen >>> often. Thinking about possible subtle bugs is scary. >> >> But the question is what is per cpu memory. Per cpu memory is >> something we can access quickly without creating cross cpu >> cache line contention. >> >> Accessing that memory from other cpus implies we create that >> contention and will be the slow path. We need to do cross >> cpu access for the rollup of the statistics, but we clearly >> don't want to do it often. >> >> So I expect most times we will want to store a pointer to per >> cpu data will be a bug. >> >> per cpu memory is not something we ever want to use to lightly. >> So as long as the rules are clear we should be ok. And simply >> removing the address of function for per cpu data would make it impossible >> to point into it. So I think it is worth a look, to see if we can >> move live per cpu data. As it noticeably simplifies the problem of >> growing a per cpu area in the rare case when we need to. > > I don't know. I think it's a dangerous thing which can be avoided. > If there's no other solution, then we might have to live with it but I > don't see the winning benefit of such design over per-cpu virtual > mapping. It isn't incompatible with a per-cpu virtual mapping. It allows the possibility of each cpu reusing the same chunk of virtual address space for per cpu memory. On x86_64 and other architectures with enough address space bits it allows us to share the large pages that we use for the normal memory mapping with the ones for per cpu access. I definitely think the work of combining the pda and the percpu areas into a common area is worthwhile. I think it would be nice if the percpu area could grow and would not be a fixed size at boot time, I'm not particularly convinced it has to. Eric ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-13 4:07 ` Eric W. Biederman @ 2009-01-14 3:58 ` Tejun Heo 2009-01-15 1:47 ` Rusty Russell 2009-01-15 1:49 ` Rusty Russell 1 sibling, 1 reply; 16+ messages in thread From: Tejun Heo @ 2009-01-14 3:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Rusty Russell, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins Hello, Eric. Eric W. Biederman wrote: > Tejun Heo <tj@kernel.org> writes: >> I don't know. I think it's a dangerous thing which can be avoided. >> If there's no other solution, then we might have to live with it but I >> don't see the winning benefit of such design over per-cpu virtual >> mapping. > > It isn't incompatible with a per-cpu virtual mapping. It allows the > possibility of each cpu reusing the same chunk of virtual address > space for per cpu memory. > > On x86_64 and other architectures with enough address space bits it allows > us to share the large pages that we use for the normal memory mapping with > the ones for per cpu access. > > I definitely think the work of combining the pda and the percpu areas > into a common area is worthwhile. Yeah, it's gonna be necessary regardless of which way we go. > I think it would be nice if the percpu area could grow and would not be > a fixed size at boot time, I'm not particularly convinced it has to. The main problem is that the area needs to be congruent which basically mandates them to be contiguous. The three alternatives on table are... 1. Just reserve memory from the get-go. Simplest. No additional TLB pressure but memory is likely to be wasted and more importantly scalability suffers. 2. Reserve address space and map memory as necessary. We can be much more generous about reserving address space especially on 64bit machines and probably can mostly forget about scalability issue there. However, getting things just right for address space contrained 32bit might not be too easy but then again nothing really is scalable on 32bit these days, so we probably can live with boot time parameter or something. Another issue is added TLB pressure as it's likely to consume 4K TLB entries in addition to the default kernel mapping 2M TLB entries. The TLB pressure can be mostly avoided if percpu area is sufficiently large to justify 2MB page allocation but it isn't. 3. Do realloc(). This doesn't impose scalability issues or add to TLB pressure but it does contrain how the percpu variables can be used and introduces certain amount of possibility for scary once-in-a-blue-moon never-reproducible bugs. Maybe such possibility can be reduced by putting some restriction on the interface but I don't know. It still scares me. Hmm... IIUC, the biggest drawback of #2 is the added TLB pressure, right? What if we reserve percpu allocation by 2MB chunks? ie. use 4k mapping but always allocate the percpu pages from aligned 2MB chunks. That way it won't waste 2MB per cpu and although it will use additional 4K TLB entries, it will free up 2MB TLB entries. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-14 3:58 ` Tejun Heo @ 2009-01-15 1:47 ` Rusty Russell 0 siblings, 0 replies; 16+ messages in thread From: Rusty Russell @ 2009-01-15 1:47 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Christoph Lameter, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins On Wednesday 14 January 2009 14:28:56 Tejun Heo wrote: > The main problem is that the area needs to be congruent which > basically mandates them to be contiguous. I want to explore this assumption a little. Logically, yes, if 50% of pages are free and we have 4096 cpus, the chance that a page is free on all CPUs is 1 in 2^4095. But maybe such systems are fine with 2M pages for per-cpu areas at boot? And can page mobility tricks help us make the odds reasonable here? Only allowing movable pages in our expansion-of-percpu area? Thanks, Rusty. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-13 4:07 ` Eric W. Biederman 2009-01-14 3:58 ` Tejun Heo @ 2009-01-15 1:49 ` Rusty Russell 2009-01-15 20:26 ` Christoph Lameter 1 sibling, 1 reply; 16+ messages in thread From: Rusty Russell @ 2009-01-15 1:49 UTC (permalink / raw) To: Eric W. Biederman Cc: Tejun Heo, Christoph Lameter, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins On Tuesday 13 January 2009 14:37:38 Eric W. Biederman wrote: > It isn't incompatible with a per-cpu virtual mapping. It allows the > possibility of each cpu reusing the same chunk of virtual address > space for per cpu memory. This can be done (IA64 does it today), but it's not generically useful. You can use it to frob a few simple values, but it means you can't store any pointers, and that just doesn't fly in general kernel code. > I think it would be nice if the percpu area could grow and would not be > a fixed size at boot time, I'm not particularly convinced it has to. I used to be convinced it had to grow, but Christoph showed otherwise. Nonetheless, it's an annoying restriction which is going to bite us in the ass repeatedly as coders use per_cpu on random sizes. Rusty. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-15 1:49 ` Rusty Russell @ 2009-01-15 20:26 ` Christoph Lameter 0 siblings, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2009-01-15 20:26 UTC (permalink / raw) To: Rusty Russell Cc: Eric W. Biederman, Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins On Thu, 15 Jan 2009, Rusty Russell wrote: > On Tuesday 13 January 2009 14:37:38 Eric W. Biederman wrote: > > It isn't incompatible with a per-cpu virtual mapping. It allows the > > possibility of each cpu reusing the same chunk of virtual address > > space for per cpu memory. > > This can be done (IA64 does it today), but it's not generically useful. > You can use it to frob a few simple values, but it means you can't store > any pointers, and that just doesn't fly in general kernel code. Well if we can have some surelty that we are not going to store pointers to percpu data anywhere then this would work. > > I think it would be nice if the percpu area could grow and would not be > > a fixed size at boot time, I'm not particularly convinced it has to. > > I used to be convinced it had to grow, but Christoph showed otherwise. > Nonetheless, it's an annoying restriction which is going to bite us in > the ass repeatedly as coders use per_cpu on random sizes. Not exactly. I implemented a minimal version that had only limited use. I was fully intending to add further bloat to have dynamically extendable percpu areas at the end. Most of the early cpu_alloc patchsets already include that code. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-12 17:44 ` Eric W. Biederman 2009-01-12 19:00 ` Christoph Lameter 2009-01-13 0:33 ` Tejun Heo @ 2009-01-15 1:34 ` Rusty Russell 2009-01-15 13:55 ` Ingo Molnar 2009-01-15 20:27 ` Christoph Lameter 2 siblings, 2 replies; 16+ messages in thread From: Rusty Russell @ 2009-01-15 1:34 UTC (permalink / raw) To: Eric W. Biederman Cc: Christoph Lameter, Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins On Tuesday 13 January 2009 04:14:58 Eric W. Biederman wrote: > 2M of per cpu data doesn't make sense, and likely indicates a design > flaw somewhere. It just doesn't make sense to have large amounts of > data allocated per cpu. > > The most common user of per cpu data I am aware of is allocating one > word per cpu for counters. This is why I did a brief audit. Here it is: With x86/32 allyesconfig (trimmed a little, until it booted under kvm) we have 37148 bytes of static percpu data, and 117228 bytes of dynamic percpu data. File and line Number Size Total net/ipv4/af_inet.c:1287 21 2048 43008 net/ipv4/af_inet.c:1290 21 2048 43008 kernel/workqueue.c:819 72 128 9126 net/ipv4/af_inet.c:1287 48 128 6144 net/ipv4/af_inet.c:1290 48 128 6144 net/ipv4/route.c:3258 1 4096 4096 include/linux/genhd.h:271 72 40 2880 lib/percpu_counter.c:77 194 4 776 net/ipv4/af_inet.c:1287 1 288 288 net/ipv4/af_inet.c:1290 1 288 288 net/ipv4/af_inet.c:1287 1 256 256 net/ipv4/af_inet.c:1290 1 256 256 net/core/neighbour.c:1424 4 44 176 kernel/kexec.c:1143 1 176 176 net/ipv4/af_inet.c:1287 1 104 104 net/ipv4/af_inet.c:1290 1 104 104 arch/x86/.../acpi-cpufreq.c:528 96 1 96 arch/x86/acpi/cstate.c:153 1 64 64 net/.../nf_conntrack_core.c:1209 1 60 60 Others: 178 This is why my patch series adds "big_percpu_alloc" (basically identical to current code) for the bigger/unbounded users. I don't think moving per-cpu areas is going to fly. We do put complex datastructures in there. And you're going to need preempt_disable() on all per-cpu ops on many archs to make it work (assuming you use stop_machine to do the realloc. Even a rough audit quickly becomes overwhelming: 20 of the first 1/4 of DECLARE_PER_CPUs are non-movable datastructures. Cheers, Rusty. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-15 1:34 ` Rusty Russell @ 2009-01-15 13:55 ` Ingo Molnar 2009-01-15 20:27 ` Christoph Lameter 1 sibling, 0 replies; 16+ messages in thread From: Ingo Molnar @ 2009-01-15 13:55 UTC (permalink / raw) To: Rusty Russell Cc: Eric W. Biederman, Christoph Lameter, Tejun Heo, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins * Rusty Russell <rusty@rustcorp.com.au> wrote: > On Tuesday 13 January 2009 04:14:58 Eric W. Biederman wrote: > > 2M of per cpu data doesn't make sense, and likely indicates a design > > flaw somewhere. It just doesn't make sense to have large amounts of > > data allocated per cpu. > > > > The most common user of per cpu data I am aware of is allocating one > > word per cpu for counters. > > This is why I did a brief audit. Here it is: > > With x86/32 allyesconfig (trimmed a little, until it booted under kvm) > we have 37148 bytes of static percpu data, and 117228 bytes of dynamic > percpu data. > > File and line Number Size Total > net/ipv4/af_inet.c:1287 21 2048 43008 > net/ipv4/af_inet.c:1290 21 2048 43008 > kernel/workqueue.c:819 72 128 9126 > net/ipv4/af_inet.c:1287 48 128 6144 > net/ipv4/af_inet.c:1290 48 128 6144 > net/ipv4/route.c:3258 1 4096 4096 > include/linux/genhd.h:271 72 40 2880 > lib/percpu_counter.c:77 194 4 776 > net/ipv4/af_inet.c:1287 1 288 288 > net/ipv4/af_inet.c:1290 1 288 288 > net/ipv4/af_inet.c:1287 1 256 256 > net/ipv4/af_inet.c:1290 1 256 256 > net/core/neighbour.c:1424 4 44 176 > kernel/kexec.c:1143 1 176 176 > net/ipv4/af_inet.c:1287 1 104 104 > net/ipv4/af_inet.c:1290 1 104 104 > arch/x86/.../acpi-cpufreq.c:528 96 1 96 > arch/x86/acpi/cstate.c:153 1 64 64 > net/.../nf_conntrack_core.c:1209 1 60 60 > > Others: 178 > > This is why my patch series adds "big_percpu_alloc" (basically identical > to current code) for the bigger/unbounded users. > > I don't think moving per-cpu areas is going to fly. We do put complex > datastructures in there. And you're going to need preempt_disable() on > all per-cpu ops on many archs to make it work (assuming you use > stop_machine to do the realloc. Even a rough audit quickly becomes > overwhelming: 20 of the first 1/4 of DECLARE_PER_CPUs are non-movable > datastructures. Why do we have to move them? Even on an allyesconfig the total ~150K size seems to be peanuts - compared to the ~+4MB CONFIG_MAXSMP .data/.bss bloat. I must be missing something ... Ingo ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: regarding the x86_64 zero-based percpu patches 2009-01-15 1:34 ` Rusty Russell 2009-01-15 13:55 ` Ingo Molnar @ 2009-01-15 20:27 ` Christoph Lameter 1 sibling, 0 replies; 16+ messages in thread From: Christoph Lameter @ 2009-01-15 20:27 UTC (permalink / raw) To: Rusty Russell Cc: Eric W. Biederman, Tejun Heo, Ingo Molnar, travis, Linux Kernel Mailing List, H. Peter Anvin, Andrew Morton, steiner, Hugh Dickins On Thu, 15 Jan 2009, Rusty Russell wrote: > I don't think moving per-cpu areas is going to fly. We do put complex > datastructures in there. And you're going to need preempt_disable() on > all per-cpu ops on many archs to make it work (assuming you use > stop_machine to do the realloc. Even a rough audit quickly becomes > overwhelming: 20 of the first 1/4 of DECLARE_PER_CPUs are non-movable > datastructures. Ok then lets go for dynamically growing per cpu areas using 2M virtual mappings.... At least on 64 bit that should be fine. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-01-15 21:01 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <49649814.4040005@kernel.org>
[not found] ` <20090107120225.GA30651@elte.hu>
2009-01-07 12:13 ` regarding the x86_64 zero-based percpu patches Tejun Heo
2009-01-10 6:46 ` Rusty Russell
2009-01-12 17:23 ` Christoph Lameter
2009-01-12 17:44 ` Eric W. Biederman
2009-01-12 19:00 ` Christoph Lameter
2009-01-13 0:33 ` Tejun Heo
2009-01-13 3:01 ` Eric W. Biederman
2009-01-13 3:14 ` Tejun Heo
2009-01-13 4:07 ` Eric W. Biederman
2009-01-14 3:58 ` Tejun Heo
2009-01-15 1:47 ` Rusty Russell
2009-01-15 1:49 ` Rusty Russell
2009-01-15 20:26 ` Christoph Lameter
2009-01-15 1:34 ` Rusty Russell
2009-01-15 13:55 ` Ingo Molnar
2009-01-15 20:27 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox