Hard to debug problem with ceph_erasure

All of lore.kernel.org
 help / color / mirror / Atom feed

* Hard to debug problem with ceph_erasure_code
@ 2016-03-31 17:10 Willem Jan Withagen
  2016-04-01  5:12 ` Mykola Golub
  0 siblings, 1 reply; 4+ messages in thread
From: Willem Jan Withagen @ 2016-03-31 17:10 UTC (permalink / raw)
  To: Ceph Development

Hi,

I have this problem that testing ceph_erasure_code sometimes crashes in:

	ceph_erasure_code --debug-osd 20 --plugin_exists jerasure

If I just run this in a while loop on the command line then it crashes 
only once every few hundert runs.
Running it in the testset it crashes just about every time.

The crash from the core is an invalid point in the assertion code I 
added to log/Entry.h

#7  0x000000000077984d in ceph::log::Entry::hint_size (this=0x80405cf00) 
at log/Entry.h:70
70            assert( *m_exp_len != -1 );
(gdb) l
65        }
66
67        // function improves estimate for expected size of message
68        void hint_size() {
69          if (m_exp_len != NULL) {
70            assert( *m_exp_len != -1 );
71            assert( 0 <= *m_exp_len );
72            assert( *m_exp_len <= 100000 );
73            size_t size = m_streambuf.size();
74            if (size > __atomic_load_n(m_exp_len, __ATOMIC_RELAXED)) {
(gdb) p m_exp_len
$1 = (size_t *) 0x8045ec5c0
(gdb) p &m_exp_len
$2 = (size_t **) 0x80405cf90

the address in m_exp_len 0x8045ec5c0 is outside of the heap, and gives 
an illegal access.
And thus the program gets a SIGSEGV

Now my problem is that I can run this under gdb and watch the memory. 
But that rarely goes wrong.
Running it from 'make recheck' goes wrong just about every time but it 
will be hard to run that
trhu gdb and actually catch the code that is writting the illegal 
address into m_exp_len.

Does anybody have suggestions as how to track/debug this?

Thanx,
--WjW

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hard to debug problem with ceph_erasure_code
  2016-03-31 17:10 Hard to debug problem with ceph_erasure_code Willem Jan Withagen
@ 2016-04-01  5:12 ` Mykola Golub
  2016-04-01  9:34   ` Willem Jan Withagen
  0 siblings, 1 reply; 4+ messages in thread
From: Mykola Golub @ 2016-04-01  5:12 UTC (permalink / raw)
  To: Willem Jan Withagen; +Cc: Ceph Development

On Thu, Mar 31, 2016 at 07:10:45PM +0200, Willem Jan Withagen wrote:

> Does anybody have suggestions as how to track/debug this?

valgrind?

-- 
Mykola Golub

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hard to debug problem with ceph_erasure_code
  2016-04-01  5:12 ` Mykola Golub
@ 2016-04-01  9:34   ` Willem Jan Withagen
  2016-04-01 12:22     ` Willem Jan Withagen
  0 siblings, 1 reply; 4+ messages in thread
From: Willem Jan Withagen @ 2016-04-01  9:34 UTC (permalink / raw)
  To: Mykola Golub; +Cc: Ceph Development

On 1-4-2016 07:12, Mykola Golub wrote:
> On Thu, Mar 31, 2016 at 07:10:45PM +0200, Willem Jan Withagen wrote:
> 
>> Does anybody have suggestions as how to track/debug this?
> 
> valgrind?
> 

Yup, tried that one, but it is sort of hard to find an intermittent
erroneous write. I tried --track-addr=<addr of m_exp_len> But most of
the time it is only written  at exact the code line it is supposed to be
written. So no info there.

So perhaps I need a different set of tests?

On average I need about 600 runs to catch one SIGSEGV.

BTW: tried it on 2 FreeBSD systems, and on both the behaviour is
identical. So it has got to be the code. And since 65000 runs on Linux
give no errors, it is also typical for the combo
FreeBSD/Clang/FreeBSD-packages.

--WjW

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Hard to debug problem with ceph_erasure_code
  2016-04-01  9:34   ` Willem Jan Withagen
@ 2016-04-01 12:22     ` Willem Jan Withagen
  0 siblings, 0 replies; 4+ messages in thread
From: Willem Jan Withagen @ 2016-04-01 12:22 UTC (permalink / raw)
  To: Mykola Golub; +Cc: Ceph Development

On 1-4-2016 11:34, Willem Jan Withagen wrote:
> On 1-4-2016 07:12, Mykola Golub wrote:
>> On Thu, Mar 31, 2016 at 07:10:45PM +0200, Willem Jan Withagen wrote:
>>
>>> Does anybody have suggestions as how to track/debug this?
>>
>> valgrind?
>>
> 
> Yup, tried that one, but it is sort of hard to find an intermittent
> erroneous write. I tried --track-addr=<addr of m_exp_len> But most of
> the time it is only written  at exact the code line it is supposed to be
> written. So no info there.
> 
> So perhaps I need a different set of tests?
> 
> On average I need about 600 runs to catch one SIGSEGV.
> 
> BTW: tried it on 2 FreeBSD systems, and on both the behaviour is
> identical. So it has got to be the code. And since 65000 runs on Linux
> give no errors, it is also typical for the combo
> FreeBSD/Clang/FreeBSD-packages.

And it gets even weirder.
Clang allows to use of AddressSanitizer
( https://github.com/google/sanitizers/wiki/AddressSanitizer )

So i've compiled all of ceph with
    -fsanitize=address -fno-omit-frame-pointer
And now I've already logged 50.000 runs without crashes.

Feels a lot like Schrödinger's cat....

--WjW


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-04-01 12:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-31 17:10 Hard to debug problem with ceph_erasure_code Willem Jan Withagen
2016-04-01  5:12 ` Mykola Golub
2016-04-01  9:34   ` Willem Jan Withagen
2016-04-01 12:22     ` Willem Jan Withagen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.