All of lore.kernel.org
 help / color / mirror / Atom feed
* RHEL 6.5 shared library upgrade safety
@ 2014-08-18 11:57 Loic Dachary
  2014-08-18 12:11 ` Wido den Hollander
  2014-08-18 15:17 ` Sage Weil
  0 siblings, 2 replies; 6+ messages in thread
From: Loic Dachary @ 2014-08-18 11:57 UTC (permalink / raw)
  To: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 440 bytes --]

Hi Ceph,

In RHEL 6.5, is the following scenario possible : 

a) an OSD dlopen a shared library for erasure-code, 
b) the shared library file is replaced while the OSD is running, 
c) the OSD starts using the new file instead of the old one. 

It seems unlikely but it would explain a weird stack trace at http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

Cheers
-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RHEL 6.5 shared library upgrade safety
  2014-08-18 11:57 RHEL 6.5 shared library upgrade safety Loic Dachary
@ 2014-08-18 12:11 ` Wido den Hollander
  2014-08-18 14:06   ` Loic Dachary
  2014-08-18 15:17 ` Sage Weil
  1 sibling, 1 reply; 6+ messages in thread
From: Wido den Hollander @ 2014-08-18 12:11 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

On 08/18/2014 01:57 PM, Loic Dachary wrote:
> Hi Ceph,
>
> In RHEL 6.5, is the following scenario possible :
>
> a) an OSD dlopen a shared library for erasure-code,
> b) the shared library file is replaced while the OSD is running,
> c) the OSD starts using the new file instead of the old one.
>
> It seems unlikely but it would explain a weird stack trace at http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)
>

Well, it could be that it does so. I'm not 100% sure, but afaik it could 
happen that when you replace a library certain parts might not be in memory.

See: 
http://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running

> Cheers
>


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RHEL 6.5 shared library upgrade safety
  2014-08-18 12:11 ` Wido den Hollander
@ 2014-08-18 14:06   ` Loic Dachary
  0 siblings, 0 replies; 6+ messages in thread
From: Loic Dachary @ 2014-08-18 14:06 UTC (permalink / raw)
  To: Wido den Hollander, Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]

Hi Wido,

On 18/08/2014 14:11, Wido den Hollander wrote:> On 08/18/2014 01:57 PM, Loic Dachary wrote:
>> Hi Ceph,
>>
>> In RHEL 6.5, is the following scenario possible :
>>
>> a) an OSD dlopen a shared library for erasure-code,
>> b) the shared library file is replaced while the OSD is running,
>> c) the OSD starts using the new file instead of the old one.
>>
>> It seems unlikely but it would explain a weird stack trace at http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)
>>
> 
> Well, it could be that it does so. I'm not 100% sure, but afaik it could happen that when you replace a library certain parts might not be in memory.
> 
> See: http://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running

As it turns out, the problem is a simpler, but I still have not clue how it can happen.

http://tracker.ceph.com/issues/9153 shows

537187718- ceph version 0.80.5-164-gcc4e625 (cc4e6258d67fb16d4a92c25078a0822a9849cd77)
537187795- 1: ceph-osd() [0x9b58c1]
537187821- 2: (()+0xf710) [0x7f06a3e24710]
537187854- 3: (memcpy()+0x15b) [0x7f06a2d4daab]
537187892- 4: (jerasure_matrix_dotprod()+0xc8) [0x7f067fd11618]
537187946- 5: (jerasure_matrix_encode()+0x75) [0x7f067fd11865]
537187999- 6: (ErasureCodeJerasureReedSolomonVandermonde::jerasure_encode(char**, char**, int)+0x21) [0x7f067fd294b1]
537188107- 7: (ErasureCodeJerasure::encode_chunks(std::set<int, std::less<int>, std::allocator<int> > const&, std::map<int, ceph::buffer::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::list> > >*)+0x607) [0x7f067fd2a807]

Meaning ceph-osd firefly crashed trying to use a jerasure plugin coming from master, which is no surprise because the API is incompatible although the data coding / encoding is compatible. 

Cheers

>> Cheers
>>
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RHEL 6.5 shared library upgrade safety
  2014-08-18 11:57 RHEL 6.5 shared library upgrade safety Loic Dachary
  2014-08-18 12:11 ` Wido den Hollander
@ 2014-08-18 15:17 ` Sage Weil
  2014-08-18 15:25   ` Loic Dachary
  1 sibling, 1 reply; 6+ messages in thread
From: Sage Weil @ 2014-08-18 15:17 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Mon, 18 Aug 2014, Loic Dachary wrote:
> Hi Ceph,
> 
> In RHEL 6.5, is the following scenario possible : 
> 
> a) an OSD dlopen a shared library for erasure-code, 
> b) the shared library file is replaced while the OSD is running, 
> c) the OSD starts using the new file instead of the old one. 
> 
> It seems unlikely but it would explain a weird stack trace at 
> http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)

I think this is possible and likely.  We had similar problems with the 
rados classes and eventually just made them load all available plugins on 
startup (and also on demand in case one is installed later).

The simplest thing is probably to do that here as well...

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RHEL 6.5 shared library upgrade safety
  2014-08-18 15:17 ` Sage Weil
@ 2014-08-18 15:25   ` Loic Dachary
  2014-08-18 15:32     ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Loic Dachary @ 2014-08-18 15:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 1328 bytes --]



On 18/08/2014 17:17, Sage Weil wrote:
> On Mon, 18 Aug 2014, Loic Dachary wrote:
>> Hi Ceph,
>>
>> In RHEL 6.5, is the following scenario possible : 
>>
>> a) an OSD dlopen a shared library for erasure-code, 
>> b) the shared library file is replaced while the OSD is running, 
>> c) the OSD starts using the new file instead of the old one. 
>>
>> It seems unlikely but it would explain a weird stack trace at 
>> http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)
> 
> I think this is possible and likely.  We had similar problems with the 
> rados classes and eventually just made them load all available plugins on 
> startup (and also on demand in case one is installed later).
> 
> The simplest thing is probably to do that here as well...

This will not solve the upgrade problem for Firefly daemons which are are already running, unfortunately. Stopping the daemons while the package is being upgraded seems safer and more generic (see the other thread). Or are there issues with this approach ?

Cheers

> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RHEL 6.5 shared library upgrade safety
  2014-08-18 15:25   ` Loic Dachary
@ 2014-08-18 15:32     ` Sage Weil
  0 siblings, 0 replies; 6+ messages in thread
From: Sage Weil @ 2014-08-18 15:32 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Mon, 18 Aug 2014, Loic Dachary wrote:
> On 18/08/2014 17:17, Sage Weil wrote:
> > On Mon, 18 Aug 2014, Loic Dachary wrote:
> >> Hi Ceph,
> >>
> >> In RHEL 6.5, is the following scenario possible : 
> >>
> >> a) an OSD dlopen a shared library for erasure-code, 
> >> b) the shared library file is replaced while the OSD is running, 
> >> c) the OSD starts using the new file instead of the old one. 
> >>
> >> It seems unlikely but it would explain a weird stack trace at 
> >> http://tracker.ceph.com/issues/9153#note-5 so I'm double checking ;-)
> > 
> > I think this is possible and likely.  We had similar problems with the 
> > rados classes and eventually just made them load all available plugins on 
> > startup (and also on demand in case one is installed later).
> > 
> > The simplest thing is probably to do that here as well...
> 
> This will not solve the upgrade problem for Firefly daemons which are 
> are already running, unfortunately. Stopping the daemons while the 
> package is being upgraded seems safer and more generic (see the other 
> thread). Or are there issues with this approach ?

Operationally it is not something people want to do.  Usually admins 
upgrade and then do the restarts in a controlled way.  At least, that's 
what I've heard anecdotally.

FWIW the crash is also something that testing turns up but is unlikely to 
happen in production.  In testing, the workload is just starting when we 
start upgrading so the plugins haven't always loaded.  In production, it 
is unlikely that a user will be *just* starting to use the EC features 
right as they are also doing an upgrade.  Unless they forgot to restart 
daemons...

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-08-18 15:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-18 11:57 RHEL 6.5 shared library upgrade safety Loic Dachary
2014-08-18 12:11 ` Wido den Hollander
2014-08-18 14:06   ` Loic Dachary
2014-08-18 15:17 ` Sage Weil
2014-08-18 15:25   ` Loic Dachary
2014-08-18 15:32     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.