* [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync @ 2014-09-19 9:49 Marc Schiffbauer 2014-09-19 14:48 ` Lars Ellenberg 0 siblings, 1 reply; 16+ messages in thread From: Marc Schiffbauer @ 2014-09-19 9:49 UTC (permalink / raw) To: drbd-dev Hi, about a year ago I encountered a problem with drbd: On long running re-syncs a refcounter overflow happens in the drbd module resulting in loss of network connection (and reconnect). I am running a linux kernel that is hardened with grsecurity and PaX. It has a feature to detect such recounter overflows (CONFIG_PAX_REFCOUNT) Now I encountered that same Probleem egain with a much newer kernel. There may be two causes that can trigger those cases: 1) real bug in a part of the kernel (drbd in that case) 2) false positive in PAX The developer of PAX had a look at this issue and assumes a real bug in drbd but asked me to ask the drbd developer for details. Please see [1]. Now today, with a newer kernel the issue looks like that: [63999.116870] PAX: refcount overflow detected in: drbd_r_ms03:6378, uid/euid: 0/0 [63999.116875] CPU: 0 PID: 6378 Comm: drbd_r_ms03 Not tainted 3.14.18-hardened-r2 #1 [63999.116876] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [63999.116878] task: ffff882f8b599010 ti: ffff882f8b599730 task.ti: ffff882f8b599730 [63999.116879] RIP: 0010:[<ffffffffa00663ca>] [<ffffffffa00663ca>] ffffffffa00663ca [63999.116882] RSP: 0000:ffffc90016483dd8 EFLAGS: 00000a02 [63999.116883] RAX: 0000000000000000 RBX: 000000027fd7cb00 RCX: ffff88306c17e6a0 [63999.116884] RDX: 0000000000000100 RSI: ffffffff818d2101 RDI: ffff882fb6c9d650 [63999.116884] RBP: ffff882f9c577010 R08: ffff88306c17e6a0 R09: ffff882fb6e76cc0 [63999.116885] R10: ffff882fb6e76cc0 R11: ffff882fb6e76cc0 R12: ffffc90016483e50 [63999.116886] R13: ffff882fb617f228 R14: ffff882f35028200 R15: ffff882fb617f000 [63999.116888] FS: 0000000000000000(0000) GS:ffff88307f200000(0000) knlGS:0000000000000000 [63999.116889] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63999.116889] CR2: 00000320a5073008 CR3: 000000000154d000 CR4: 00000000001607f0 [63999.116890] Stack: [63999.116891] ffff882f8b85d800 0000000000000018 0000000000000018 0000010000000018 [63999.116893] 0000000000000000 ffff882f8b85d800 0000000000000009 00000000000000d8 [63999.116895] 0000000000000018 0000000000000018 0000000000000000 ffffffffa0068134 [63999.116896] Call Trace: [63999.116907] [<ffffffffa0068134>] ? drbdd_init+0x147/0x1d7 [drbd] [63999.116913] [<ffffffffa006f617>] ? drbd_thread_setup+0x4e/0x117 [drbd] [63999.116917] [<ffffffffa006f5c9>] ? conn_destroy+0x86/0x86 [drbd] [63999.116922] [<ffffffff8107fbfc>] ? kthread+0xd5/0xdd [63999.116924] [<ffffffff8107fb27>] ? kthread_worker_fn+0xf9/0xf9 [63999.116929] [<ffffffff81535f74>] ? ret_from_fork+0x74/0xa0 [63999.116930] [<ffffffff8107fb27>] ? kthread_worker_fn+0xf9/0xf9 [63999.116931] Code: 48 89 de 4c 89 ff e8 c3 80 00 00 85 c0 0f 85 b2 00 00 00 8b 54 24 1c f0 41 01 97 d0 04 00 00 71 0a f0 41 29 97 d0 04 00 00 cd 04 <bb> 03 00 00 00 f0 41 ff 87 24 02 00 00 71 0a f0 41 ff 8f 24 02 and drbd itself says: [63999.116965] block drbd0: drbd_alloc_pages interrupted! [63999.116968] d-con ms03: error receiving RSDataRequest, e: -12 l: 0! [63999.116986] d-con ms03: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) [63999.117021] d-con ms03: asender terminated [63999.117025] d-con ms03: Terminating drbd_a_ms03 [63999.130575] d-con ms03: Connection closed [63999.130599] d-con ms03: conn( ProtocolError -> Unconnected ) [63999.130601] d-con ms03: receiver terminated [63999.130602] d-con ms03: Restarting receiver thread [63999.130603] d-con ms03: receiver (re)started [63999.130614] d-con ms03: conn( Unconnected -> WFConnection ) [64000.116691] d-con ms03: initial packet S crossed [64009.195530] d-con ms03: Handshake successful: Agreed network protocol version 101 [64009.195807] d-con ms03: Peer authenticated using 64 bytes HMAC [64009.195834] d-con ms03: conn( WFConnection -> WFReportParams ) [64009.195843] d-con ms03: Starting asender thread (from drbd_r_ms03 [6378]) [ ... and continues to sync ... ] Is this a real bug in drbd? Thanks -Marc [1] https://forums.grsecurity.net/viewtopic.php?f=3&t=3786&p=13558&hilit=REFCOUNT#p13558 -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-19 9:49 [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync Marc Schiffbauer @ 2014-09-19 14:48 ` Lars Ellenberg 2014-09-19 15:16 ` Marc Schiffbauer 0 siblings, 1 reply; 16+ messages in thread From: Lars Ellenberg @ 2014-09-19 14:48 UTC (permalink / raw) To: drbd-dev On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: > Hi, > > about a year ago I encountered a problem with drbd: On long running > re-syncs a refcounter overflow happens in the drbd module resulting > in loss of network connection (and reconnect). > > I am running a linux kernel that is hardened with grsecurity and > PaX. It has a feature to detect such recounter overflows > (CONFIG_PAX_REFCOUNT) > > Now I encountered that same Probleem egain with a much newer kernel. > > There may be two causes that can trigger those cases: > 1) real bug in a part of the kernel (drbd in that case) > 2) false positive in PAX > > The developer of PAX had a look at this issue and assumes a real bug > in drbd but asked me to ask the drbd developer for details. > > Please see [1]. > > Now today, with a newer kernel the issue looks like that: > > [63999.116870] PAX: refcount overflow detected in: drbd_r_ms03:6378, uid/euid: 0/0 > [63999.116875] CPU: 0 PID: 6378 Comm: drbd_r_ms03 Not tainted 3.14.18-hardened-r2 #1 > [63999.116876] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 > [63999.116878] task: ffff882f8b599010 ti: ffff882f8b599730 task.ti: ffff882f8b599730 > [63999.116879] RIP: 0010:[<ffffffffa00663ca>] [<ffffffffa00663ca>] ffffffffa00663ca > [63999.116882] RSP: 0000:ffffc90016483dd8 EFLAGS: 00000a02 > [63999.116883] RAX: 0000000000000000 RBX: 000000027fd7cb00 RCX: ffff88306c17e6a0 > [63999.116884] RDX: 0000000000000100 RSI: ffffffff818d2101 RDI: ffff882fb6c9d650 > [63999.116884] RBP: ffff882f9c577010 R08: ffff88306c17e6a0 R09: ffff882fb6e76cc0 > [63999.116885] R10: ffff882fb6e76cc0 R11: ffff882fb6e76cc0 R12: ffffc90016483e50 > [63999.116886] R13: ffff882fb617f228 R14: ffff882f35028200 R15: ffff882fb617f000 > [63999.116888] FS: 0000000000000000(0000) GS:ffff88307f200000(0000) knlGS:0000000000000000 > [63999.116889] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [63999.116889] CR2: 00000320a5073008 CR3: 000000000154d000 CR4: 00000000001607f0 > [63999.116890] Stack: > [63999.116891] ffff882f8b85d800 0000000000000018 0000000000000018 0000010000000018 > [63999.116893] 0000000000000000 ffff882f8b85d800 0000000000000009 00000000000000d8 > [63999.116895] 0000000000000018 0000000000000018 0000000000000000 ffffffffa0068134 > [63999.116896] Call Trace: > [63999.116907] [<ffffffffa0068134>] ? drbdd_init+0x147/0x1d7 [drbd] If you resolve that to a code line, I may be able to figure out what PAX is talking about. But from this stack trace alone, I have absolutely no idea what PAX is trying to say, which refcount could possibly be meant there, let alone why it could possibly overflow or. Ah, ok. Looking at [1], "PaX Team" says: .--- | after having looked at the drbd code a bit i think this could be a | real bug in drbd but only upstream can tell for sure so you'll have to | contact them. you can show them the following that i figured out so far: | | the refcount overflow was detected in | drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the | | atomic_add(len >> 9, &mdev->rs_sect_ev) Well, yes, why would it not overflow. It is *not* a refcount. It is an atomic counter. It is meant to overflow. | statement. rs_sect_ev is an atomic_t in struct drbd_conf declared in | drivers/block/drbd/drbd_int.h (i'll note here that i think the | rs_sect_in field is simiarly affected by this problem). | | based on the code, these two fields don't look like refcounts, nor are | they free-running counters or statistics either (the usual cases for | false positives). instead they're some sector counts that get reset on | certain events (the details of which i can't tell as i don't know the | drbd code). therefore my feeling is that these counts are not supposed | to overflow as they'd otherwise lead to incorrect calculations in | drbd_rs_should_slow_down and drbd_rs_controller (the latter reads | rs_sect_in into an unsigned int btw, this is mixing up signed/unsigned | integers, that can't be good...). But yes, it *is* a "free running counter". For IO that has to be accounted to the resyncer. They are reset whenever a new resync starts. Apparently you are syncing more than 2*41 byte. That's ok. Others do too. Having a lot of storage is no reason to drop the connection. | so what happened to you is that somehow rs_sect_ev reached 2G (that | corresponds to about 1TB of traffic between two counter resets or | 'events') and the signed overflow detection triggered on it (if that's | too unrealistic traffic for drbd then there was some other problem | calculating the sector counts that resulted in some big enough value to | trigger a signed overflow, though at the moment of the overflow 'len' | had a value of 8 only). in any case it looks that an atomic_t is not | enough to store real life sector counts and will have to be enlarged | probably (or the counters will have to be reset more frequently). `--- It is perfectly ok for rs_sect_ev to overflow, it is meant to overflow. It is used in some signed modulo 32bit calculation only. > [63999.116913] [<ffffffffa006f617>] ? drbd_thread_setup+0x4e/0x117 [drbd] > [63999.116917] [<ffffffffa006f5c9>] ? conn_destroy+0x86/0x86 [drbd] > [63999.116922] [<ffffffff8107fbfc>] ? kthread+0xd5/0xdd > [63999.116924] [<ffffffff8107fb27>] ? kthread_worker_fn+0xf9/0xf9 > [63999.116929] [<ffffffff81535f74>] ? ret_from_fork+0x74/0xa0 > [63999.116930] [<ffffffff8107fb27>] ? kthread_worker_fn+0xf9/0xf9 > [63999.116931] Code: 48 89 de 4c 89 ff e8 c3 80 00 00 85 c0 0f 85 b2 00 00 00 8b 54 24 1c f0 41 01 97 d0 04 00 00 71 0a f0 41 29 97 d0 04 00 00 cd 04 <bb> 03 00 00 00 f0 41 ff 87 24 02 00 00 71 0a f0 41 ff 8f 24 02 > > > and drbd itself says: > > [63999.116965] block drbd0: drbd_alloc_pages interrupted! Um, I'd guess it does say some things before that, too. Also, the other node will likely have something to say. Can you give some more context? Interrupted by whom? Is PAX delivering some signal? Which? Why would it do that? If so, stop it :-) > [63999.116968] d-con ms03: error receiving RSDataRequest, e: -12 l: 0! > [63999.116986] d-con ms03: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) > [63999.117021] d-con ms03: asender terminated > [63999.117025] d-con ms03: Terminating drbd_a_ms03 > [63999.130575] d-con ms03: Connection closed > [63999.130599] d-con ms03: conn( ProtocolError -> Unconnected ) > [63999.130601] d-con ms03: receiver terminated > [63999.130602] d-con ms03: Restarting receiver thread > [63999.130603] d-con ms03: receiver (re)started > [63999.130614] d-con ms03: conn( Unconnected -> WFConnection ) > [64000.116691] d-con ms03: initial packet S crossed > [64009.195530] d-con ms03: Handshake successful: Agreed network protocol version 101 > [64009.195807] d-con ms03: Peer authenticated using 64 bytes HMAC > [64009.195834] d-con ms03: conn( WFConnection -> WFReportParams ) > [64009.195843] d-con ms03: Starting asender thread (from drbd_r_ms03 [6378]) > [ ... and continues to sync ... ] > > > Is this a real bug in drbd? > > Thanks > -Marc > > > > [1] https://forums.grsecurity.net/viewtopic.php?f=3&t=3786&p=13558&hilit=REFCOUNT#p13558 > -- -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-19 14:48 ` Lars Ellenberg @ 2014-09-19 15:16 ` Marc Schiffbauer 2014-09-23 11:03 ` Lars Ellenberg 0 siblings, 1 reply; 16+ messages in thread From: Marc Schiffbauer @ 2014-09-19 15:16 UTC (permalink / raw) To: drbd-dev * Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: >On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: >> Hi, >> > >If you resolve that to a code line, >I may be able to figure out what PAX is talking about. > >But from this stack trace alone, I have absolutely no idea what PAX >is trying to say, which refcount could possibly be meant there, >let alone why it could possibly overflow or. > >Ah, ok. Looking at [1], "PaX Team" says: >.--- >| after having looked at the drbd code a bit i think this could be a >| real bug in drbd but only upstream can tell for sure so you'll have to >| contact them. you can show them the following that i figured out so far: >| >| the refcount overflow was detected in >| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the >| >| atomic_add(len >> 9, &mdev->rs_sect_ev) > >Well, yes, why would it not overflow. >It is *not* a refcount. >It is an atomic counter. >It is meant to overflow. Ok, then I can report this back and it should be fixed in PaX as a false postive. Thanks for clarifying this. > >| statement. rs_sect_ev is an atomic_t in struct drbd_conf declared in >| drivers/block/drbd/drbd_int.h (i'll note here that i think the >| rs_sect_in field is simiarly affected by this problem). >| >| based on the code, these two fields don't look like refcounts, nor are >| they free-running counters or statistics either (the usual cases for >| false positives). instead they're some sector counts that get reset on >| certain events (the details of which i can't tell as i don't know the >| drbd code). therefore my feeling is that these counts are not supposed >| to overflow as they'd otherwise lead to incorrect calculations in >| drbd_rs_should_slow_down and drbd_rs_controller (the latter reads >| rs_sect_in into an unsigned int btw, this is mixing up signed/unsigned >| integers, that can't be good...). > >But yes, it *is* a "free running counter". >For IO that has to be accounted to the resyncer. >They are reset whenever a new resync starts. > >Apparently you are syncing more than 2*41 byte. >That's ok. Others do too. >Having a lot of storage is no reason to drop the connection. ack > >| so what happened to you is that somehow rs_sect_ev reached 2G (that >| corresponds to about 1TB of traffic between two counter resets or >| 'events') and the signed overflow detection triggered on it (if that's >| too unrealistic traffic for drbd then there was some other problem >| calculating the sector counts that resulted in some big enough value to >| trigger a signed overflow, though at the moment of the overflow 'len' >| had a value of 8 only). in any case it looks that an atomic_t is not >| enough to store real life sector counts and will have to be enlarged >| probably (or the counters will have to be reset more frequently). >`--- > >It is perfectly ok for rs_sect_ev to overflow, it is meant to overflow. >It is used in some signed modulo 32bit calculation only. > >> [63999.116913] [<ffffffffa006f617>] ? drbd_thread_setup+0x4e/0x117 [drbd] >> [63999.116917] [<ffffffffa006f5c9>] ? conn_destroy+0x86/0x86 [drbd] >> [63999.116922] [<ffffffff8107fbfc>] ? kthread+0xd5/0xdd >> [63999.116924] [<ffffffff8107fb27>] ? kthread_worker_fn+0xf9/0xf9 >> [63999.116929] [<ffffffff81535f74>] ? ret_from_fork+0x74/0xa0 >> [63999.116930] [<ffffffff8107fb27>] ? kthread_worker_fn+0xf9/0xf9 >> [63999.116931] Code: 48 89 de 4c 89 ff e8 c3 80 00 00 85 c0 0f 85 b2 00 00 00 8b 54 24 1c f0 41 01 97 d0 04 00 00 71 0a f0 41 29 97 d0 04 00 00 cd 04 <bb> 03 00 00 00 f0 41 ff 87 24 02 00 00 71 0a f0 41 ff 8f 24 02 >> >> >> and drbd itself says: >> >> [63999.116965] block drbd0: drbd_alloc_pages interrupted! > >Um, I'd guess it does say some things before that, too. >Also, the other node will likely have something to say. > >Can you give some more context? > >Interrupted by whom? by the PAX-System in the Kernel to protect the system becaus ethis *might* be an evil overflow. >Is PAX delivering some signal? >Which? I don't know, sorry :-/ >Why would it do that? see above >If so, stop it :-) OK ;) If you are still interested inmore context because you think there might be something wrong in the drbd code, then I can boot the previous kernel with PAX_REFCOUNT enabled to reproduce the proplem again. But it would be some effort to do zhis again, so I want to be sure its of any use for you now that it is clear that this must be a false positive. Thanks -Marc -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-19 15:16 ` Marc Schiffbauer @ 2014-09-23 11:03 ` Lars Ellenberg 2014-09-23 17:08 ` Marc Schiffbauer 2014-09-23 18:14 ` Marc Schiffbauer 0 siblings, 2 replies; 16+ messages in thread From: Lars Ellenberg @ 2014-09-23 11:03 UTC (permalink / raw) To: drbd-dev On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote: > * Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: > >On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: > >>Hi, > >> > > > >If you resolve that to a code line, > >I may be able to figure out what PAX is talking about. > > > >But from this stack trace alone, I have absolutely no idea what PAX > >is trying to say, which refcount could possibly be meant there, > >let alone why it could possibly overflow or. > > > >Ah, ok. Looking at [1], "PaX Team" says: > >.--- > >| after having looked at the drbd code a bit i think this could be a > >| real bug in drbd but only upstream can tell for sure so you'll have to > >| contact them. you can show them the following that i figured out so far: > >| > >| the refcount overflow was detected in > >| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the > >| > >| atomic_add(len >> 9, &mdev->rs_sect_ev) > > > >Well, yes, why would it not overflow. > >It is *not* a refcount. > >It is an atomic counter. > >It is meant to overflow. > > Ok, then I can report this back and it should be fixed in PaX as a > false postive. Thanks for clarifying this. I still don't get why PAX is even sending a signal there. How could sending a signal possibly help against an overflowing counter? Ah well. I don't need to get everything ... Thanks, Lars ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-23 11:03 ` Lars Ellenberg @ 2014-09-23 17:08 ` Marc Schiffbauer 2014-09-24 10:04 ` Lars Ellenberg 2014-09-23 18:14 ` Marc Schiffbauer 1 sibling, 1 reply; 16+ messages in thread From: Marc Schiffbauer @ 2014-09-23 17:08 UTC (permalink / raw) To: drbd-dev * Lars Ellenberg schrieb am 23.09.14 um 13:03 Uhr: >On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote: >> * Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: >> >On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: >> >>Hi, >> >> >> > >> >If you resolve that to a code line, >> >I may be able to figure out what PAX is talking about. >> > >> >But from this stack trace alone, I have absolutely no idea what PAX >> >is trying to say, which refcount could possibly be meant there, >> >let alone why it could possibly overflow or. >> > >> >Ah, ok. Looking at [1], "PaX Team" says: >> >.--- >> >| after having looked at the drbd code a bit i think this could be a >> >| real bug in drbd but only upstream can tell for sure so you'll have to >> >| contact them. you can show them the following that i figured out so far: >> >| >> >| the refcount overflow was detected in >> >| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the >> >| >> >| atomic_add(len >> 9, &mdev->rs_sect_ev) >> > >> >Well, yes, why would it not overflow. >> >It is *not* a refcount. >> >It is an atomic counter. >> >It is meant to overflow. >> >> Ok, then I can report this back and it should be fixed in PaX as a >> false postive. Thanks for clarifying this. > >I still don't get why PAX is even sending a signal there. >How could sending a signal possibly help against an overflowing counter? PAX is a patchset for the linux kernel that hardens the kernel in several ways. The REFCOUNT feature will detect and prevent overflowing of refcounters. This is a good introduction of what it does: http://wiki.gentoo.org/wiki/Hardened/PaX_Quickstart -Marc -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-23 17:08 ` Marc Schiffbauer @ 2014-09-24 10:04 ` Lars Ellenberg 0 siblings, 0 replies; 16+ messages in thread From: Lars Ellenberg @ 2014-09-24 10:04 UTC (permalink / raw) To: drbd-dev On Tue, Sep 23, 2014 at 07:08:24PM +0200, Marc Schiffbauer wrote: > * Lars Ellenberg schrieb am 23.09.14 um 13:03 Uhr: > >On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote: > >>* Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: > >>>On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: > >>>>Hi, > >>>> > >>> > >>>If you resolve that to a code line, > >>>I may be able to figure out what PAX is talking about. > >>> > >>>But from this stack trace alone, I have absolutely no idea what PAX > >>>is trying to say, which refcount could possibly be meant there, > >>>let alone why it could possibly overflow or. > >>> > >>>Ah, ok. Looking at [1], "PaX Team" says: > >>>.--- > >>>| after having looked at the drbd code a bit i think this could be a > >>>| real bug in drbd but only upstream can tell for sure so you'll have to > >>>| contact them. you can show them the following that i figured out so far: > >>>| > >>>| the refcount overflow was detected in > >>>| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the > >>>| > >>>| atomic_add(len >> 9, &mdev->rs_sect_ev) > >>> > >>>Well, yes, why would it not overflow. > >>>It is *not* a refcount. > >>>It is an atomic counter. > >>>It is meant to overflow. > >> > >>Ok, then I can report this back and it should be fixed in PaX as a > >>false postive. Thanks for clarifying this. > > > >I still don't get why PAX is even sending a signal there. > >How could sending a signal possibly help against an overflowing counter? > > > PAX is a patchset for the linux kernel that hardens the kernel in > several ways. > > The REFCOUNT feature will detect and prevent overflowing of > refcounters. > > This is a good introduction of what it does: > http://wiki.gentoo.org/wiki/Hardened/PaX_Quickstart Yeah. Why would sending a signal to the current thread prevent a counter from overflowing. How could that possibly improve the overall situation, even assuming that the overflow would have bad side effects? Never mind. This won't be a fruitful discussion... -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-23 11:03 ` Lars Ellenberg 2014-09-23 17:08 ` Marc Schiffbauer @ 2014-09-23 18:14 ` Marc Schiffbauer 2014-09-24 10:14 ` Lars Ellenberg 1 sibling, 1 reply; 16+ messages in thread From: Marc Schiffbauer @ 2014-09-23 18:14 UTC (permalink / raw) To: drbd-dev * Lars Ellenberg schrieb am 23.09.14 um 13:03 Uhr: >On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote: >> * Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: >> >On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: >> >>Hi, >> >> >> > >> >If you resolve that to a code line, >> >I may be able to figure out what PAX is talking about. >> > >> >But from this stack trace alone, I have absolutely no idea what PAX >> >is trying to say, which refcount could possibly be meant there, >> >let alone why it could possibly overflow or. >> > >> >Ah, ok. Looking at [1], "PaX Team" says: >> >.--- >> >| after having looked at the drbd code a bit i think this could be a >> >| real bug in drbd but only upstream can tell for sure so you'll have to >> >| contact them. you can show them the following that i figured out so far: >> >| >> >| the refcount overflow was detected in >> >| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the >> >| >> >| atomic_add(len >> 9, &mdev->rs_sect_ev) >> > >> >Well, yes, why would it not overflow. >> >It is *not* a refcount. >> >It is an atomic counter. >> >It is meant to overflow. Another question PaX-Team is asking: what about rs_sect_in? Thanks -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-23 18:14 ` Marc Schiffbauer @ 2014-09-24 10:14 ` Lars Ellenberg 2014-09-24 12:50 ` Lars Ellenberg 0 siblings, 1 reply; 16+ messages in thread From: Lars Ellenberg @ 2014-09-24 10:14 UTC (permalink / raw) To: drbd-dev On Tue, Sep 23, 2014 at 08:14:21PM +0200, Marc Schiffbauer wrote: > * Lars Ellenberg schrieb am 23.09.14 um 13:03 Uhr: > >On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote: > >>* Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: > >>>On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: > >>>>Hi, > >>>> > >>> > >>>If you resolve that to a code line, > >>>I may be able to figure out what PAX is talking about. > >>> > >>>But from this stack trace alone, I have absolutely no idea what PAX > >>>is trying to say, which refcount could possibly be meant there, > >>>let alone why it could possibly overflow or. > >>> > >>>Ah, ok. Looking at [1], "PaX Team" says: > >>>.--- > >>>| after having looked at the drbd code a bit i think this could be a > >>>| real bug in drbd but only upstream can tell for sure so you'll have to > >>>| contact them. you can show them the following that i figured out so far: > >>>| > >>>| the refcount overflow was detected in > >>>| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the > >>>| > >>>| atomic_add(len >> 9, &mdev->rs_sect_ev) > >>> > >>>Well, yes, why would it not overflow. > >>>It is *not* a refcount. > >>>It is an atomic counter. > >>>It is meant to overflow. > > > Another question PaX-Team is asking: > > what about rs_sect_in? That usually should not overflow, as it is typically regularly (several times per second) reset to zero (and for other reasons). If you manage to transfer more 2 TiByte in subseconds via a single TCP connection, more power to you. Still, if it should overflow (for whatever reason), no real harm done. Arbitrarily sending a signal or terminating processes in that case would be the only actually disturbing thing. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 10:14 ` Lars Ellenberg @ 2014-09-24 12:50 ` Lars Ellenberg 2014-09-24 15:57 ` PaX Team 0 siblings, 1 reply; 16+ messages in thread From: Lars Ellenberg @ 2014-09-24 12:50 UTC (permalink / raw) To: drbd-dev On Wed, Sep 24, 2014 at 12:14:51PM +0200, Lars Ellenberg wrote: > On Tue, Sep 23, 2014 at 08:14:21PM +0200, Marc Schiffbauer wrote: > > * Lars Ellenberg schrieb am 23.09.14 um 13:03 Uhr: > > >On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote: > > >>* Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr: > > >>>On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote: > > >>>>Hi, > > >>>> > > >>> > > >>>If you resolve that to a code line, > > >>>I may be able to figure out what PAX is talking about. > > >>> > > >>>But from this stack trace alone, I have absolutely no idea what PAX > > >>>is trying to say, which refcount could possibly be meant there, > > >>>let alone why it could possibly overflow or. > > >>> > > >>>Ah, ok. Looking at [1], "PaX Team" says: > > >>>.--- > > >>>| after having looked at the drbd code a bit i think this could be a > > >>>| real bug in drbd but only upstream can tell for sure so you'll have to > > >>>| contact them. you can show them the following that i figured out so far: > > >>>| > > >>>| the refcount overflow was detected in > > >>>| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the > > >>>| > > >>>| atomic_add(len >> 9, &mdev->rs_sect_ev) > > >>> > > >>>Well, yes, why would it not overflow. > > >>>It is *not* a refcount. > > >>>It is an atomic counter. > > >>>It is meant to overflow. > > > > > > Another question PaX-Team is asking: > > > > what about rs_sect_in? > > That usually should not overflow, as it is typically regularly (several > times per second) reset to zero (and for other reasons). > > If you manage to transfer more 2 TiByte in subseconds via a single TCP > connection, more power to you. > > Still, if it should overflow (for whatever reason), no real harm done. > Arbitrarily sending a signal or terminating processes in that case would > be the only actually disturbing thing. Ok. So what PAX really is doing is redefine "atomic_add" and similar to basically become a no-op, if it would overflow. typedef struct { int counter } atomic_t; void atomic_add(int i, atomic_t *v) { v->counter += i; if (that_caused_a_counter_wrap_in_any_direction) { /* oops, overflow */ SCREAM("help me, overflow..."); v->counter -= i; } } If that *is* really an object refcount, and somewhere would be if (atomic_dec_and_test(that_count)) free(some_object); then ok, you have replace one bug with an error message and different bug. Might help with debugging. Not with much else. But really. Precautionary changing (x + y) to be silently identical to (x + 0), "just in case", will surely generally improve program flow... D'oh. Anyways, now that I know PAX is really just keeping that counter at a fixed value of INT_MAX in this case, and nothing else, what would have caused DRBD to disconnect/reconnect? Could that have been you? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 12:50 ` Lars Ellenberg @ 2014-09-24 15:57 ` PaX Team 2014-09-24 16:31 ` Lars Ellenberg 0 siblings, 1 reply; 16+ messages in thread From: PaX Team @ 2014-09-24 15:57 UTC (permalink / raw) To: drbd-dev, Lars Ellenberg On 24 Sep 2014 at 14:50, Lars Ellenberg wrote: > So what PAX really is doing is redefine "atomic_add" and similar to > basically become a no-op, if it would overflow. correct. > typedef struct { int counter } atomic_t; > void atomic_add(int i, atomic_t *v) > { > v->counter += i; > if (that_caused_a_counter_wrap_in_any_direction) { > /* oops, overflow */ > SCREAM("help me, overflow..."); > v->counter -= i; first, to reduce the window of exploitation, the refcount operation is undone *before* the reaction (which is more than just a log btw). second, on many archs (pretty much all but x86) the operation isn't even committed to memory (hence doesn't need to be undone). > } > } > > If that *is* really an object refcount, > and somewhere would be > if (atomic_dec_and_test(that_count)) > free(some_object); > then ok, you have replace one bug > with an error message and different bug. > Might help with debugging. Not with much else. not correct ;). first of all, some stats based on 3.16.3 (not exact counts, e.g., only a few archs are included as that's what i had in my gtags database): func call-count ----------------------------- normal unchecked inc 2100 300 add 180 45 add_return 80 35 kref_get 380 n/a long_inc 60 50 long_add 30 20 so the overwhelming majority of atomic_t users is refcounts in fact, not stats and other exceptional cases where overflow is normal and acceptable. second, for the refcount case any code path which leaves the refcount unbalanced tends to be an exploitable security bug since it directly ends up in a use-after-free situation when the refcount wraps around 0 and back (or whatever the trigger value for freeing is, 0 is typical). stopping the refcount halfway means that such bugs are not exploitable. the price you pay is a memory leak but you already pay that without the overflow protection *and* get a nice root hole at the same time. clearly this is not a debugging aid but a very efficient runtime exploit prevention measure. in fact in practice this protection triggers on actual attacks only, the refcount leaking paths don't get executed nearly enough (we had some false positives that were only detected after some 200 days of uptime, unfortunately finding them by static analysis is a hard problem). so in short: this is not for debugging, this doesn't replace one bug with another, but it does prevent real life exploitation of refcount overflow bugs. > But really. > Precautionary changing (x + y) to be silently identical to (x + 0), > "just in case", will surely generally improve program flow... D'oh. 'just in case' you didn't know, signed overflow is undefined behaviour in C. > Anyways, now that I know PAX is really just keeping that counter > at a fixed value of INT_MAX in this case, and nothing else, > what would have caused DRBD to disconnect/reconnect? perhaps it's a consequence of the reaction from the kernel on the overflow which is equivalent to a SIGKILL with all that it implies (files and network connections get closed, etc). cheers, PaX Team ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 15:57 ` PaX Team @ 2014-09-24 16:31 ` Lars Ellenberg 2014-09-24 18:07 ` PaX Team 0 siblings, 1 reply; 16+ messages in thread From: Lars Ellenberg @ 2014-09-24 16:31 UTC (permalink / raw) To: PaX Team; +Cc: drbd-dev On Wed, Sep 24, 2014 at 05:57:42PM +0200, PaX Team wrote: > On 24 Sep 2014 at 14:50, Lars Ellenberg wrote: > > > So what PAX really is doing is redefine "atomic_add" and similar to > > basically become a no-op, if it would overflow. > > correct. > > Might help with debugging. Not with much else. > > not correct ;) > so in short: this is not for debugging, this doesn't replace one bug > with another, but it does prevent real life exploitation of refcount > overflow bugs. It won't make things "work". It probably makes things crash in less obscure ways, though, I give you that. > > Anyways, now that I know PAX is really just keeping that counter > > at a fixed value of INT_MAX in this case, and nothing else, > > what would have caused DRBD to disconnect/reconnect? > > perhaps it's a consequence of the reaction from the kernel on the overflow > which is equivalent to a SIGKILL with all that it implies (files and network > connections get closed, etc). That would be the result of the _ASM_EXTABLE()? or what causes that "reaction"? As the process in question in *this* case is a drbd kernel thread, it does not much care about that KILL. It notices, clears it, and lives on. But how would KILL'ing an innocent userland process improve the overall situation? Being a user land process, it cannot possibly be blamed for an in-kernel counter overflow, so why even kill it? Lars ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 16:31 ` Lars Ellenberg @ 2014-09-24 18:07 ` PaX Team 2014-09-24 21:50 ` Lars Ellenberg 0 siblings, 1 reply; 16+ messages in thread From: PaX Team @ 2014-09-24 18:07 UTC (permalink / raw) To: Lars Ellenberg; +Cc: drbd-dev On 24 Sep 2014 at 18:31, Lars Ellenberg wrote: > On Wed, Sep 24, 2014 at 05:57:42PM +0200, PaX Team wrote: > > so in short: this is not for debugging, this doesn't replace one bug > > with another, but it does prevent real life exploitation of refcount > > overflow bugs. > > It won't make things "work". It probably makes things crash in less > obscure ways, though, I give you that. not sure what you mean by making things 'work' but the goal of exploit prevention is to prevent the exploitation of (even unknown) bugs, not to find or fix them (of course these are (desirable) sideeffects once an attacker is unfortunate enough to try his exploit on a protected system. see more below about the reaction part. > > perhaps it's a consequence of the reaction from the kernel on the overflow > > which is equivalent to a SIGKILL with all that it implies (files and network > > connections get closed, etc). > > That would be the result of the _ASM_EXTABLE()? > or what causes that "reaction"? no, the extable mechanism is only used to re-enter the kernel in a known way to be able to report back on the detected refcount overflow. the actual reaction is in pax_report_refcount_overflow (you'll need a grsec or PaX tree to see its body, it's not in the upstream kernel). it basically logs details about the overflow (registers, process info, etc) then forces a SIGKILL into the task. you can see its output in the original report in this thread in fact, this is what enabled me to figure out which atomic variable was involved and start a discussion about this case (FYI, i've since turned both variables into the 'unchecked' type). > As the process in question in *this* case is a drbd kernel thread, it > does not much care about that KILL. It notices, clears it, and lives on. grsecurity handles kernel tasks too via gr_handle_kernel_exploit but for the refcount overflow detection we specifically chose to ignore them for two reasons. first, in the typical exploit scenario of these kinds of bugs it's a userland process in whose context the refcount overflow triggers. second, since this is an early detection (i.e., before any damage could have been done by an attack), the kernel state isn't corrupted yet and is thus recoverable, so it's not urgent to halt the system (which is otherwise necessary when unrecoverable state change occurs, think various forms of memory corruption, etc). > But how would KILL'ing an innocent userland process improve the overall > situation? Being a user land process, it cannot possibly be blamed for > an in-kernel counter overflow, so why even kill it? notwithstanding the very few false positives that arise due to our 'secure by default' choice in handling atomic_t accessors (i actually blame the kernel's lack of a proper abstraction layer on top atomic_t ;), an exploit is anything but an innocent userland process and the proper way to handle it is to kill it and also ban the user account (all this is a configurable choice in grsecurity). cheers, PaX Team ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 18:07 ` PaX Team @ 2014-09-24 21:50 ` Lars Ellenberg 2014-09-24 23:25 ` PaX Team 0 siblings, 1 reply; 16+ messages in thread From: Lars Ellenberg @ 2014-09-24 21:50 UTC (permalink / raw) To: PaX Team; +Cc: drbd-dev On Wed, Sep 24, 2014 at 08:07:11PM +0200, PaX Team wrote: > > > perhaps it's a consequence of the reaction from the kernel on the overflow > > > which is equivalent to a SIGKILL with all that it implies (files and network > > > connections get closed, etc). > > > > That would be the result of the _ASM_EXTABLE()? > > or what causes that "reaction"? > > no, the extable mechanism is only used to re-enter the kernel in a known > way to be able to report back on the detected refcount overflow. the actual > reaction is in pax_report_refcount_overflow Which is registered in the corresponding place in the exception table. So yes. > (you'll need a grsec or PaX tree I browsed some PaX patch instead. > to see its body, it's not in the upstream kernel). it basically logs details > about the overflow (registers, process info, etc) then forces a SIGKILL into > the task. > > you can see its output in the original report in this thread in fact, this > is what enabled me to figure out which atomic variable was involved and start > a discussion about this case (FYI, i've since turned both variables into the > 'unchecked' type). > > > As the process in question in *this* case is a drbd kernel thread, it > > does not much care about that KILL. It notices, clears it, and lives on. > > grsecurity handles kernel tasks too via gr_handle_kernel_exploit but for > the refcount overflow detection we specifically chose to ignore them for > two reasons. first, in the typical exploit scenario of these kinds of bugs > it's a userland process in whose context the refcount overflow triggers. Then I guess Marc is a very lucky guy... Otherwise you had killed the whole box just because it managed to sync the first TiB ;-) > second, since this is an early detection (i.e., before any damage could > have been done by an attack), the kernel state isn't corrupted yet and is > thus recoverable, so it's not urgent to halt the system (which is otherwise > necessary when unrecoverable state change occurs, think various forms of > memory corruption, etc). > > > But how would KILL'ing an innocent userland process improve the overall > > situation? Being a user land process, it cannot possibly be blamed for > > an in-kernel counter overflow, so why even kill it? > > notwithstanding the very few false positives that arise due to our 'secure > by default' choice in handling atomic_t accessors (i actually blame the > kernel's lack of a proper abstraction layer on top atomic_t ;), an exploit > is anything but an innocent userland process and the proper way to handle > it is to kill it and also ban the user account (all this is a configurable > choice in grsecurity). Carefully crafter exploits may be able to exploit PaX for a nice DoS, provoking it to kill someone else instead, no? I predicted earlier that this would not be a fruitful discussion. Because where you come from, a dead system is better than "suspicious behavior", and anyone that even only happens to be in the vicinity of "suspicious behavior" will get shot as a precautionary measure -- "collateral damage, should not have been there in the first place, really his own fault, what was he thinking" ;-) (For arbitrary values^W^W empirically sampled values of suspicious) And even though I sure can flex my mind, go those places, think that way, I rather not. Anyways, if it helps make the world a better place... At least it's all just bits and entropy :) Cheers, Lars ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 21:50 ` Lars Ellenberg @ 2014-09-24 23:25 ` PaX Team 2014-09-25 0:07 ` Lars Ellenberg 0 siblings, 1 reply; 16+ messages in thread From: PaX Team @ 2014-09-24 23:25 UTC (permalink / raw) To: Lars Ellenberg; +Cc: spender, drbd-dev On 24 Sep 2014 at 23:50, Lars Ellenberg wrote: > On Wed, Sep 24, 2014 at 08:07:11PM +0200, PaX Team wrote: > > > > perhaps it's a consequence of the reaction from the kernel on the overflow > > > > which is equivalent to a SIGKILL with all that it implies (files and network > > > > connections get closed, etc). > > > > > > That would be the result of the _ASM_EXTABLE()? > > > or what causes that "reaction"? > > > > no, the extable mechanism is only used to re-enter the kernel in a known > > way to be able to report back on the detected refcount overflow. the actual > > reaction is in pax_report_refcount_overflow > > Which is registered in the corresponding place in the exception table. > So yes. no, what is registered in the exception table is the address of the continuation address after the arch-specific insn that triggers on the overflow. then the arch specific trap handler will call pax_report_refcount_overflow which does the actual reaction. the reason i didn't just do a direct call to pax_report_refcount_overflow from within atomic ops is that it'd bloat the code a lot more and would also make it harder to report the register context, etc in an arch independent way. so no, once again the extable mechanism is not an inherently needed mechanism for reaction, it's an implementation choice only. > > grsecurity handles kernel tasks too via gr_handle_kernel_exploit but for > > the refcount overflow detection we specifically chose to ignore them for > > two reasons. first, in the typical exploit scenario of these kinds of bugs > > it's a userland process in whose context the refcount overflow triggers. > > Then I guess Marc is a very lucky guy... > Otherwise you had killed the whole box just because > it managed to sync the first TiB ;-) yes, that's the nature of both false positives and real life exploits as well. it's called taking a risk and the world out there in general believes that an owned box is a much worse case than a halted one. > > notwithstanding the very few false positives that arise due to our 'secure > > by default' choice in handling atomic_t accessors (i actually blame the > > kernel's lack of a proper abstraction layer on top atomic_t ;), an exploit > > is anything but an innocent userland process and the proper way to handle > > it is to kill it and also ban the user account (all this is a configurable > > choice in grsecurity). > > Carefully crafted exploits may be able to exploit PaX for a nice DoS, > provoking it to kill someone else instead, no? given a refcount overflow bug they can already do that (remember they cause a use-after-free bug which isn't hard to exploit for arbitrary code execution or just trashing kernel memory in practical cases), so what's the issue? same holds for any other kind of exploitation method that we catch, an attacker already starts with some kind of arbitrary code execution (and privilege escalation) capability on unprotected boxes, we improve (=downgrade) that in certain cases to DoS. nobody has any better ways to handle this. > I predicted earlier that this would not be a fruitful discussion. > > Because where you come from, a dead system is better than "suspicious > behavior", and anyone that even only happens to be in the vicinity of > "suspicious behavior" will get shot as a precautionary measure -- > "collateral damage, should not have been there in the first place, > really his own fault, what was he thinking" ;-) > > (For arbitrary values^W^W empirically sampled values of suspicious) you're conflating the general concept of reaction to exploits with very rare cases of false positives in certain prevention techniques. the latter occur for refcount overflow prevention only because the kernel hasn't provided an API to separate out refcount uses from other cases where overflow is benign. this is not my fault and we did our best to uncover these cases but manual and static analysis can only go so far. there're also exploit prevention methods that produce no false positives (e.g., preventing the kernel from executing code from userland). > And even though I sure can flex my mind, go those places, > think that way, I rather not. > > Anyways, if it helps make the world a better place... > At least it's all just bits and entropy :) every major operating system vendor and cpu manufacturer have adopted protection techniques that were pioneered by us, i think that speaks more than a lone doubting Thomas on the drbd list ;). cheers, PaX Team ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-24 23:25 ` PaX Team @ 2014-09-25 0:07 ` Lars Ellenberg 2014-09-27 0:45 ` PaX Team 0 siblings, 1 reply; 16+ messages in thread From: Lars Ellenberg @ 2014-09-25 0:07 UTC (permalink / raw) To: PaX Team; +Cc: spender, drbd-dev On Thu, Sep 25, 2014 at 01:25:04AM +0200, PaX Team wrote: > On 24 Sep 2014 at 23:50, Lars Ellenberg wrote: > > > On Wed, Sep 24, 2014 at 08:07:11PM +0200, PaX Team wrote: > > > > > perhaps it's a consequence of the reaction from the kernel on the overflow > > > > > which is equivalent to a SIGKILL with all that it implies (files and network > > > > > connections get closed, etc). > > > > > > > > That would be the result of the _ASM_EXTABLE()? > > > > or what causes that "reaction"? > > > > > > no, the extable mechanism is only used to re-enter the kernel in a known > > > way to be able to report back on the detected refcount overflow. the actual > > > reaction is in pax_report_refcount_overflow > > > > Which is registered in the corresponding place in the exception table. > > So yes. > > no, what is registered in the exception table is the address of the continuation ... > it's an implementation choice only. And that was what I wanted to know about. So yes ;-) > > I predicted earlier that this would not be a fruitful discussion. > you're conflating the general concept of reaction to exploits Absolutely, yes, I do. In this overflow case, I still don't see why it is supposed to help to shoot the process that happens to be "current" when the overflow occurs, respectively why it would be a valid assumption that this process would be "the exploit", provoking the overflow on purpose... But never mind, if you say that's so, fine. > every major operating system vendor and cpu manufacturer have adopted protection > techniques that were pioneered by us, i think that speaks more than a lone doubting > Thomas on the drbd list ;). I know. I'm not doubting. I get it. I just don't like it (precautionary shooting in general), is all. Though I dislike it even more that it (implementing protection techniques) is even necessary, if you have to assume they are out to get you. Which is more a fact than an assumption, in more scenarios than I'd like it to be. So you keep preventing existing bugs elsewhere from becoming exploitable, and I'll shut up :) Lars ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync 2014-09-25 0:07 ` Lars Ellenberg @ 2014-09-27 0:45 ` PaX Team 0 siblings, 0 replies; 16+ messages in thread From: PaX Team @ 2014-09-27 0:45 UTC (permalink / raw) To: Lars Ellenberg; +Cc: spender, drbd-dev On 25 Sep 2014 at 2:07, Lars Ellenberg wrote: > In this overflow case, I still don't see why it is supposed to help to > shoot the process that happens to be "current" when the overflow occurs, > respectively why it would be a valid assumption that this process would > be "the exploit", provoking the overflow on purpose... for an exploit to be useful it has to be both successful (be able to exploit the bug(s)) and also reliable (do so 100% of the time, or at least without any ill side-effects in case of both success and failure). for refcount overflow based exploits it means precise control over the refcount. since the exploit is a userland process, in practice only those refcount bugs are relevant that execute in syscalls. therefore 'current' is actually the exploit process in practice and more strategically, it runs under the uid that must have been owned a priori. so killing the triggering process does not only clean up an otherwise cpu-hungry process (taking 4G refcounts takes time) but also generates an event about a potential breach of security with actionable details (at least grsec has such capabilities). > Though I dislike it even more that it (implementing protection > techniques) is even necessary, if you have to assume they are out to get > you. Which is more a fact than an assumption, in more scenarios than > I'd like it to be. So you keep preventing existing bugs elsewhere from > becoming exploitable, and I'll shut up :) our protection mechanisms (will continue to) cover all kernel code, including drbd ;). speaking of that, you could do some hardening yourself in fact, e.g., constify data_cmd and asender_cmd. we do this automatically with a gcc plugin but it'd be a trivial source patch. cheers, PaX Team ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2014-09-29 18:31 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-09-19 9:49 [Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync Marc Schiffbauer 2014-09-19 14:48 ` Lars Ellenberg 2014-09-19 15:16 ` Marc Schiffbauer 2014-09-23 11:03 ` Lars Ellenberg 2014-09-23 17:08 ` Marc Schiffbauer 2014-09-24 10:04 ` Lars Ellenberg 2014-09-23 18:14 ` Marc Schiffbauer 2014-09-24 10:14 ` Lars Ellenberg 2014-09-24 12:50 ` Lars Ellenberg 2014-09-24 15:57 ` PaX Team 2014-09-24 16:31 ` Lars Ellenberg 2014-09-24 18:07 ` PaX Team 2014-09-24 21:50 ` Lars Ellenberg 2014-09-24 23:25 ` PaX Team 2014-09-25 0:07 ` Lars Ellenberg 2014-09-27 0:45 ` PaX Team
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.