From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4C35A2B3.9060606@domain.hid> Date: Thu, 08 Jul 2010 12:04:35 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <4C34438D.9020905@domain.hid> <4C34EF76.2040602@domain.hid> <4C3508E1.7090100@domain.hid> <1278578261.1810.67.camel@domain.hid> <4C359326.1090509@domain.hid> <1278581479.1810.111.camel@domain.hid> <4C359BE7.6080608@domain.hid> <1278583089.1810.131.camel@domain.hid> In-Reply-To: <1278583089.1810.131.camel@domain.hid> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-help] native: A 32k stack is not always a 'reasonable' size List-Id: Help regarding installation and common use of Xenomai List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: xenomai-help Philippe Gerum wrote: > On Thu, 2010-07-08 at 11:35 +0200, Gilles Chanteperdrix wrote: >> Philippe Gerum wrote: >>> On Thu, 2010-07-08 at 10:58 +0200, Gilles Chanteperdrix wrote: >>>> Philippe Gerum wrote: >>>>> On Thu, 2010-07-08 at 01:08 +0200, Gilles Chanteperdrix wrote: >>>>>> Peter Soetens wrote: >>>>>>> On Wed, Jul 7, 2010 at 11:19 PM, Gilles Chanteperdrix >>>>>>> wrote: >>>>>>>> Peter Soetens wrote: >>>>>>>>> On Wed, Jul 7, 2010 at 11:06 AM, Gilles Chanteperdrix >>>>>>>>> wrote: >>>>>>>>>> Peter Soetens wrote: >>>>>>>>>>> At least, not for Orocos applications. We've had hard to debug >>>>>>>>>>> application segfaults that used just a 'little' bit more than 32k. We >>>>>>>>>>> had to raise the stack size to 128k to get reliably through our >>>>>>>>>>> application startup. I stem from the old 'mlockall ate my RAM' >>>>>>>>>>> generation where we typically reduced stack sizes in order to have >>>>>>>>>>> some crumbles left for the heap. But 32k wasn't really what we were >>>>>>>>>>> aiming for. >>>>>>>>>>> >>>>>>>>>>> Maybe we should explicitly document the 32k limit and its limitations >>>>>>>>>>> for certain applications...? >>>>>>>>>> Again, things have been fixed in 2.5.3 with regard to stack sizes, could >>>>>>>>>> you check that you have the same behaviour? >>>>>>>>> I think we had, but I'm uncertain right now. >>>>>>>>> >>>>>>>>>> As for 32KiB, it is only a default stack size, it is only reasonable in >>>>>>>>>> the sense that 2MiB is unreasonable on a low-end system. 32KiB was >>>>>>>>>> picked because it allows printf to work. Now, whatever stack size we >>>>>>>>>> choose, there will be applications which need more, this does not really >>>>>>>>>> make the default unreasonable. >>>>>>>>> I knew you would say that. It deserves an entry in the faq or some >>>>>>>>> trouble shooting document though. >>>>>>>> It is documented. For instance, rt_task_create says: >>>>>>>> stksize The size of the stack (in bytes) for the new task. If >>>>>>>> zero is passed, a reasonable pre-defined size will be substituted. >>>>>>>> >>>>>>>> What else can we say? Documenting that this size is 32 KiB would be >>>>>>>> wrong, because we do not want applications to rely on a particular >>>>>>>> value, in case we want to change it. And the fact that if your stack is >>>>>>>> too small, you will get problems is kind of obvious. For anyone having >>>>>>>> played with stack sizes with Linux or any proprietary RTOS, at least. >>>>>>> And what with new RTOS/Xenomai users ? >>>>>>> >>>>>>> You have to take the user perspective here. The problem with stack >>>>>>> overflows is that they occur when the development of a program has >>>>>>> progressed a while and applications reached a certain level of >>>>>>> complexity (otherwise the overflow wouldn't have happend in the first >>>>>>> place). So it suddenly starts to segfault (from time to time). What he >>>>>>> does is this: he fires up the debugger to get a backtrace, sees >>>>>>> trouble and wrongly assumes that gdb can't really handle these Xenomai >>>>>>> threads and tries to eliminate causes of the crashes.. >>>>>> Last time I tried, debugging a stack overflow with gdb was possible. You >>>>>> can print the stack pointer and compare the value with the contents of >>>>>> /proc/pid/maps. >>>>>> >>>>>> The user comes >>>>>>> quickly to the conclusion that 'putting it all together' causes the >>>>>>> crash (the single unit tests pass) and is looking for a software >>>>>>> integration problem. In reality, it's the stack. >>>>>>> >>>>>>> If you've been through all this and then came to the correct >>>>>>> conclusion the same day, you've been burnt before, or are the >>>>>>> exception. >>>>>>> >>>>>>> In my view, 32k is a premature optimization. At least, it shows the >>>>>>> side effects of one. >>>>>> I guess you run Xenomai on one of these big irons, do you? Because if >>>>>> you ran on a low-end machine, you would have understand why we can not >>>>>> keep the 2MB default limit. 32 KiB looks already like a pretty large >>>>>> limit, so, maybe there is a problem in your application? >>>>>> >>>>>> The I-pipe patch for ARM detects stack overflows, I guess we can modify >>>>>> the kernel on all architectures to do the same thing on all architectures. >>>>>> >>>>> Peter made a good point considering the various braindamage outcomes a >>>>> stack smashing issue could trigger. I'm unsure whether anyone can >>>>> immediately suspect a stack overflow to be the cause of any random >>>>> application behavior; typically, that issue could cause a branch to any >>>>> random IP value on x86 since the return address is living on the stack >>>>> and could get trashed, but not necessarily on architectures with >>>>> branch-and-link registers. In the former case, GDB is of little help, >>>>> except for single-stepping until the offending statement is reached and >>>>> we can observe the trashing live, which means that we actually did the >>>>> work of spotting the issue manually. >>>>> >>>>> It turns out that people with large applications and lots of contexts >>>>> often end up naked in the cold most of the time when facing those >>>>> things, and the only option left to them is to go backward on the >>>>> integration path, in order to find a possibly faulty component. Before >>>>> people can reasonably compare %sp values, they need some help to narrow >>>>> the search, otherwise, it's hopeless. >>>>> >>>>> To this end, maybe an option would be to enable gcc's >>>>> -fstack-protector[-all] -fstack-check when the debug switch is given to >>>>> the configure script, provided the compiler in use supports this. >>>>> >>>>> Granted, a stack overflow is not identical to a smashing, but quite >>>>> often the stack memory unduly consumed by a thread belongs to some other >>>>> memory object, and therefore usually gets trashed when that object is >>>>> modified. At least, enabling some canary word checking in that case may >>>>> help. >>>> I do not think so. The glibc maps an unreadable/unwritable page below >>>> the stack. So, what you get is a segmentation fault. Unless, of course, >>>> you overflow more than one page. But we can map more than one page by >>>> using pthread_attr_setguardsize, if one page is not enough. >>> The page guard is restricted to MMU-enabled systems, we have two over >>> six of our architectures running without MMU. In this case, the only >>> option left that may work is the stack protector based on the canary >>> word checking. >>> >>> Relying on pthread_attr_setguardsize() when available will trigger the >>> same amount of uncertainty than we have now with setting the minimum >>> stack size. Which guard value would a sane default? one, two, four >>> pages? >>> >>>> We can detect the stack overflow in kernel-space, there it is easy to >>>> detect, the problem is that x86 users, which are the ones more likely to >>>> be hit by a stack overflow, may not be watching the console, so may not >>>> see the message. >>>> >>> Kernel-space is another issue, people writing applications in kernel >>> space are mostly on their own these days, and others implementing >>> drivers are expected to always consider stack space as a scarce resource >>> anyway. But helping with solving userland problems seems to be the most >>> urgent thing to do, since common practices in that environment may >>> conflict badly with real-time restrictions and requirements. >> I mean detecting the user-space stack overflows when handling user-space >> page faults in kernel-space. But granted, that also only works for >> systems with an MMU. The following piece of code does it in the I-pipe >> patch for ARM with FCSE enabled: >> >> + down_read(&mm->mmap_sem); >> + if (find_vma(mm, addr) == find_vma(mm, regs->ARM_sp)) >> + printk(KERN_INFO "FCSE: process %u(%s) probably overflowed stack >> at 0x%08lx.\n", >> + current->pid, current->comm, regs->ARM_pc); >> + up_read(&mm->mmap_sem); >> > > My understanding is that such code detects faulty references within the > _valid_ address space, typically when hitting a page guard area. But I > guess that this won't work when treading on stack memory outside of the > address space, e.g. below the red zone for instance, isn't it? AFAIU, > those things may happen when the heading space of preposterously large > stack-based objects are addressed. Yes, exactly, but that would have been enough to detect Peter's problem. I thought gcc had an option to yell when the objects on stack grow beyond some size, but I can not find it. -- Gilles.