PROBLEM: Kernel 2.6.10 crashing repeatedly and hard

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
@ 2004-12-30  0:31 Georg C. F. Greve
  2004-12-30 16:23 ` Georg C. F. Greve
  0 siblings, 1 reply; 172+ messages in thread
From: Georg C. F. Greve @ 2004-12-30  0:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: dm-crypt


[-- Attachment #1.1: Type: text/plain, Size: 496 bytes --]

Hi all,

I've been moving things on my server to use software RAID5 of the LVM,
trying out to use the device mapper (dm-crypt) on top of that with an
ext3 filesystem and have seen repeated hard crashes. The machine was
entirely dead.

Since the machine was quite stable for a couple of days with 2.6.10
before moving to the software RAID setup, my suspicion is that it is
related to either LVM, DM, EXT3 or any combination of the three.

This is what I could preserve in output from the crashes:

[-- Attachment #1.2: crash1 --]
[-- Type: text/plain, Size: 1069 bytes --]

Call Trace:
 [<c0141131>] cache_flusharray+0x41/0xb0
 [<c0141398>] kmem_cache_free+0x38/0x40
 [<c0158c4f>] free_buffer_head+0x1f/0x70
 [<c0158a19>] try_to_free_buffers+0x59/0x90
 [<c019e65d>] journal_try_to_free_buffers+0xbd/0x130
 [<c018fe20>] ext3_releasepage+0x30/0x60
 [<c0156d99>] try_to_release_page+0x39/0x50
 [<c014349f>] shrink_list+0x35f/0x440
 [<c0143707>] shrink_cache+0x187/0x430
 [<c017cf97>] mb_cache_shrink_fn+0x167/0x170
 [<c0142e62>] shrink_slab+0x82/0x1b0
 [<c0143ff2>] shrink_zone+0xb2/0xe0
 [<c014447e>] balance_pgdat+0x24e/0x2d0
 [<c01445dc>] kswapd+0xdc/0x100
 [<c0130c30>] autoremove_wake_function+0x0/0x40
 [<c0102dc2>] ret_from_fork+0x6/0x14
 [<c0130c30>] autoremove_wake_function+0x0/0x40
 [<c0144500>] kswapd+0x0/0x100
 [<c010129d>] kernel_thread_helper+0x5/0x18
Code: 7e 8d 46 38 89 04 24 8b 44 24 1c 8b 15 10 c0 53 c0 8b 0c b8 8d 81 00 00 00
 40 c1 e8 0c c1 e0 05 8b 5c 02 1c 8b 53 04 8b ß3 89 02 <89> 50 04 8b 43 0c c7 03
 00 01 10 00 29 c1 c7 43 04 00 02 20 00
 <6>note: kswapd0[196] exited with preempt_count 1

[-- Attachment #1.3: crash2 --]
[-- Type: text/plain, Size: 1140 bytes --]

 [<c01565fe>] alloc_page_buffers+0x1e/0x90
 [<c0156ea8>] create_empty_buffers+0x18/0x90
 [<c0157613>] __block_prepare_write+0x373/0x3c0
 [<c0157d70>] block_prepare_write+0x20/0x30
 [<c018f0f0>] ext3_get_block+0x0/0x70
 [<c018f5b8>] ext3_prepare_write+0x58/0x110
 [<c018f0f0>] ext3_get_block+0x0/0x70
 [<c013a59f>] generic_file_buffered_write+0x19f/0x600
 [<c0130c30>] autoremove_wake_function+0x0/0x40
 [<c013ac53>] __generic_file_aio_write_nolock+0x37/0x90
 [<c013b0f0>] generic_file_aio_write_nolock+0x37/0x90
 [<c013b0f0>] generic_file_aio_write+0x60/0xc0
 [<c018d18a>] ext3_file_write+0x2a/0xa0
 [<c01547db>] do_sync_write+0xab/0xe0
 [<c01383b4>] wait_on_page_writeback_range+0x74/0x120
 [<c0130c30>] autoremove_wake_function+0x0/0x40
 [<c018d2b7>] ext3_sync_file+0xb7/0xc0
 [<c015489c>] vfs_write+0x8c/0xd0
 [<c015498d>] sys_write+0x3d/0x70
 [<c0102eef>] syscall_call+0x7/0xb
Code: 14 42 25 ff ff 00 00 89 51 10 8b 3c 24 66 8b 04 47 66 89 41 14 8b 44 24 24
 3b 50 58 73 06 4e 83 fe ff 75 b5 8b 51 04 8b 01 89 02 <89> 50 04 c7 01 00 01 10
 00 c7 41 04 00 02 20 00 66 83 79 14 ff
 <6>note: mythbackend[16084] exited with preempt_count 1

[-- Attachment #1.4: crash3 --]
[-- Type: text/plain, Size: 1165 bytes --]

EFLAGS: 00010002   (2.6.10)
EIP is at free_block+0x45/0xd0
eax: 46484849   ebx: df2b1000   ecx: df2b1050   edx: df2ab000
esi: c183cd80   edi: 00000001   ebp: 00000018   esp: c188fef8
ds: 007b   es: 007b   ss: 0068
Process events/0 (pid: 6, threadinfo:c188e000 task=c185ca20)
Stack: c183cdb8 c1858810 c1858800 00000018 c183cd80 c0141724 c183cd80 c1858810
       00000018 c183ccb8 c183cd80 00000002 c183cce0 c01417c6 c183cd80 c1858800
       00000000 c183ccb8 c183ce10 00000003 c170fc20 c183b000 c170fc24 00000000
Call Trace:
 [<c0141724>] drain_array_locked+0x54&0x80
 [<c01417c6>] cache_reap+0x75/0x1e0
 [<c012cb07>] worker_thread+0x197/0x230
 [<c0141750>] cache_reap+0x0/0x1e0
 [<c011a590>] default_wake_function+0x0/0x20
 [<c011a590>] default_wake_function+0x0/0x20
 [<c012c970>] worker_thread+0x0/0x230
 [<c0130827>] kthread+0xa7/0xb0
 [<c0130790>] kthread+0x0/0xb0
 [<c010129d>] kernel_thread_helper+0x5/0x18
Code: 7e 8d 46 38 89 04 24 8b 44 24 1c 8b 15 10 c0 53 c0 8b 0c b8 8d 81 00 00 00
 40 c1 e8 0c c1 e0 05 8b 5c 02 1c 8b 53 04 8b 03 89 02 <89> 50 04 8b 43 0c c7 03
 00 01 10 00 29 c1 c7 43 04 00 02 20 00
 <6>note: events/0[6] exited with preempt_count 1

[-- Attachment #1.5: Type: text/plain, Size: 394 bytes --]


All of them have in common the notice of some process having

 "exited with preempt_count 1"

and all of them happened within three hours -- this is the first time
that a mainline kernel has been behaving so consistently unstable for
me, in fact.

The machine is a P4 Xeon 2.8GHz running Debian GNU/Linux (sarge) and
here are the lspci and lsusb -vvv output and the Kernel configuration
file:

[-- Attachment #1.6: lspci.vvv --]
[-- Type: text/plain, Size: 13457 bytes --]

0000:00:00.0 Host bridge: Intel Corp. 82875P/E7210 Memory Controller Hub (rev 02)
	Subsystem: ASUSTeK Computer Inc.: Unknown device 80f6
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
	Latency: 0
	Region 0: Memory at f4000000 (32-bit, prefetchable) [size=64M]
	Capabilities: [e4] #09 [2106]
	Capabilities: [a0] AGP version 3.0
		Status: RQ=32 Iso- ArqSz=2 Cal=2 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8
		Command: RQ=1 ArqSz=0 Cal=2 SBA+ AGP+ GART64- 64bit- FW- Rate=x8

0000:00:01.0 PCI bridge: Intel Corp. 82875P Processor to AGP Controller (rev 02) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=64
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: fc900000-fe9fffff
	Prefetchable memory behind bridge: dfe00000-efdfffff
	BridgeCtl: Parity- SERR- NoISA- VGA+ MAbort- >Reset- FastB2B-

0000:00:1d.0 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) (prog-if 00 [UHCI])
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 4: I/O ports at eec0 [size=32]

0000:00:1d.1 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) (prog-if 00 [UHCI])
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin B routed to IRQ 19
	Region 4: I/O ports at ef00 [size=32]

0000:00:1d.2 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #3 (rev 02) (prog-if 00 [UHCI])
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin C routed to IRQ 18
	Region 4: I/O ports at ef20 [size=32]

0000:00:1d.3 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 (rev 02) (prog-if 00 [UHCI])
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 4: I/O ports at ef40 [size=32]

0000:00:1d.7 USB Controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) (prog-if 20 [EHCI])
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin D routed to IRQ 23
	Region 0: Memory at febffc00 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] #0a [20a0]

0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev c2) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Bus: primary=00, secondary=02, subordinate=02, sec-latency=64
	I/O behind bridge: 0000d000-0000dfff
	Memory behind bridge: fea00000-feafffff
	Prefetchable memory behind bridge: efe00000-efefffff
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-

0000:00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
	Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0

0000:00:1f.1 IDE interface: Intel Corp. 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) (prog-if 8a [Master SecP PriP])
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at <unassigned>
	Region 1: I/O ports at <unassigned>
	Region 2: I/O ports at <unassigned>
	Region 3: I/O ports at <unassigned>
	Region 4: I/O ports at fc00 [size=16]
	Region 5: Memory at 40000000 (32-bit, non-prefetchable) [size=1K]

0000:00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) SATA Controller (rev 02) (prog-if 8f [Master SecP SecO PriP PriO])
	Subsystem: ASUSTeK Computer Inc.: Unknown device 80a6
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at efe0 [size=8]
	Region 1: I/O ports at efac [size=4]
	Region 2: I/O ports at efa0 [size=8]
	Region 3: I/O ports at efa8 [size=4]
	Region 4: I/O ports at ef90 [size=16]

0000:00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Interrupt: pin B routed to IRQ 10
	Region 4: I/O ports at 0400 [size=32]

0000:00:1f.5 Multimedia audio controller: Intel Corp. 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller (rev 02)
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin B routed to IRQ 17
	Region 0: I/O ports at e800 [size=256]
	Region 1: I/O ports at ee80 [size=64]
	Region 2: Memory at febff800 (32-bit, non-prefetchable) [size=512]
	Region 3: Memory at febff400 (32-bit, non-prefetchable) [size=256]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:01:00.0 VGA compatible controller: nVidia Corporation NV34 [GeForce FX 5200] (rev a1) (prog-if 00 [VGA])
	Subsystem: ASUSTeK Computer Inc.: Unknown device 80cf
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 248 (1250ns min, 250ns max)
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at e0000000 (32-bit, prefetchable) [size=128M]
	Expansion ROM at fe9e0000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [44] AGP version 3.0
		Status: RQ=32 Iso- ArqSz=0 Cal=3 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8
		Command: RQ=32 ArqSz=2 Cal=0 SBA+ AGP+ GART64- 64bit- FW- Rate=x8

0000:02:03.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 80) (prog-if 10 [OHCI])
	Subsystem: ASUSTeK Computer Inc.: Unknown device 808a
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (8000ns max), Cache Line Size: 0x04 (16 bytes)
	Interrupt: pin A routed to IRQ 20
	Region 0: Memory at feaff800 (32-bit, non-prefetchable) [size=2K]
	Region 1: I/O ports at dc00 [size=128]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:02:05.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T [Marvell] (rev 12)
	Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 0x04 (16 bytes)
	Interrupt: pin A routed to IRQ 22
	Region 0: Memory at feaf8000 (32-bit, non-prefetchable) [size=16K]
	Region 1: I/O ports at d800 [size=256]
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data

0000:02:09.0 Multimedia video controller: Brooktree Corporation Bt878 Video Capture (rev 11)
	Subsystem: Hauppauge computer works Inc. WinTV Series
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (4000ns min, 10000ns max)
	Interrupt: pin A routed to IRQ 21
	Region 0: Memory at efefe000 (32-bit, prefetchable) [size=4K]
	Capabilities: [44] Vital Product Data
	Capabilities: [4c] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:02:09.1 Multimedia controller: Brooktree Corporation Bt878 Audio Capture (rev 11)
	Subsystem: Hauppauge computer works Inc. WinTV Series
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (1000ns min, 63750ns max)
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at efeff000 (32-bit, prefetchable) [size=4K]
	Capabilities: [44] Vital Product Data
	Capabilities: [4c] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:02:0a.0 Multimedia controller: Philips Semiconductors SAA7146 (rev 01)
	Subsystem: Technotrend Systemtechnik GmbH Technotrend/Hauppauge DVB card rev2.1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (3750ns min, 9500ns max)
	Interrupt: pin A routed to IRQ 22
	Region 0: Memory at feaff400 (32-bit, non-prefetchable) [size=512]

0000:02:0b.0 SCSI storage controller: Adaptec AHA-2940U/UW/D / AIC-7881U
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (2000ns min, 2000ns max), Cache Line Size: 0x04 (16 bytes)
	Interrupt: pin A routed to IRQ 23
	Region 0: I/O ports at d400 [disabled] [size=256]
	Region 1: Memory at feafe000 (32-bit, non-prefetchable) [size=4K]
	Expansion ROM at feae0000 [disabled] [size=64K]

0000:02:0c.0 Unknown mass storage controller: Promise Technology, Inc. PDC20518 SATAII 150 IDE Controller (rev 02)
	Subsystem: Promise Technology, Inc. PDC20518 SATAII 150 IDE Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 72 (1000ns min, 4500ns max), Cache Line Size: 0x01 (4 bytes)
	Interrupt: pin A routed to IRQ 20
	Region 0: I/O ports at d080 [size=128]
	Region 2: I/O ports at de00 [size=256]
	Region 3: Memory at feafd000 (32-bit, non-prefetchable) [size=4K]
	Region 4: Memory at feac0000 (32-bit, non-prefetchable) [size=128K]
	Expansion ROM at feab0000 [disabled] [size=32K]
	Capabilities: [60] Power Management version 2
		Flags: PMEClk- DSI+ D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
	Subsystem: Realtek Semiconductor Co., Ltd. RT8139
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (8000ns min, 16000ns max)
	Interrupt: pin A routed to IRQ 21
	Region 0: I/O ports at dd00 [size=256]
	Region 1: Memory at feaff000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-


[-- Attachment #1.7: lsusb.vvv --]
[-- Type: text/plain, Size: 12785 bytes --]


Bus 005 Device 001: ID 0000:0000  
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            9 Hub
  bDeviceSubClass         0 Unused
  bDeviceProtocol         1 Single TT
  bMaxPacketSize0         8
  idVendor           0x0000 
  idProduct          0x0000 
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.10-lirc ehci_hcd
  iProduct                2 Intel Corp. 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller
  iSerial                 1 0000:00:1d.7
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0xe0
      Self Powered
      Remote Wakeup
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 Unused
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0002  1x 2 bytes
        bInterval              12
Hub Descriptor:
  bLength              11
  bDescriptorType      41
  nNbrPorts             8
  wHubCharacteristic 0x0008
    Ganged power switching
    Per-port overcurrent protection
    TT think time 8 FS bits
  bPwrOn2PwrGood       10 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00 0x01
  PortPwrCtrlMask    0x00  0x00 

Bus 004 Device 001: ID 0000:0000  
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 Unused
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x0000 
  idProduct          0x0000 
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.10-lirc uhci_hcd
  iProduct                2 Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #4
  iSerial                 1 0000:00:1d.3
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0xc0
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 Unused
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0002  1x 2 bytes
        bInterval             255
Hub Descriptor:
  bLength               9
  bDescriptorType      41
  nNbrPorts             2
  wHubCharacteristic 0x000a
    No power switching (usb 1.0)
    Per-port overcurrent protection
  bPwrOn2PwrGood        1 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00
  PortPwrCtrlMask    0x01 

Bus 003 Device 001: ID 0000:0000  
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 Unused
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x0000 
  idProduct          0x0000 
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.10-lirc uhci_hcd
  iProduct                2 Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #3
  iSerial                 1 0000:00:1d.2
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0xc0
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 Unused
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0002  1x 2 bytes
        bInterval             255
Hub Descriptor:
  bLength               9
  bDescriptorType      41
  nNbrPorts             2
  wHubCharacteristic 0x000a
    No power switching (usb 1.0)
    Per-port overcurrent protection
  bPwrOn2PwrGood        1 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00
  PortPwrCtrlMask    0x01 

Bus 002 Device 003: ID 04e6:5115 SCM Microsystems, Inc. 
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        16
  idVendor           0x04e6 SCM Microsystems, Inc.
  idProduct          0x5115 
  bcdDevice            4.16
  iManufacturer           1 SCM Microsystems Inc.
  iProduct                2 SCR33x USB Smart Card Reader
  iSerial                 5 504057B8
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           93
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          3 CCID Class
    bmAttributes         0xa0
      Remote Wakeup
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           3
      bInterfaceClass        11 Chip/SmartCard
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              4 CCID Interface
      ChipCard Interface Descriptor:
        bLength                54
        bDescriptorType        33
        bcdCCID              1.00
        nMaxSlotIndex           0
        bVoltageSupport         1  5.0V
        dwProtocols             3  T=0 T=1
        dwDefaultClock       4000
        dwMaxiumumClock     12000
        bNumClockSupported      0
        dwDataRate           9600 bps
        dwMaxDataRate      115200 bps
        bNumDataRatesSupp.      0
        dwMaxIFSD             252
        dwSyncProtocols  00000000 
        dwMechanical     00000000 
        dwFeatures       000100BA
          Auto configuration based on ATR
          Auto voltage selection
          Auto clock change
          Auto baud rate change
          Auto PPS made by CCID
          TPDU level exchange
        dwMaxCCIDMsgLen       263
        bClassGetResponse    echo
        bClassEnvelope       echo
        wlcdLayout           none
        bPINSupport             0 
        bMaxCCIDBusySlots       1
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x01  EP 1 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x83  EP 3 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0010  1x 16 bytes
        bInterval              16

Bus 002 Device 001: ID 0000:0000  
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 Unused
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x0000 
  idProduct          0x0000 
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.10-lirc uhci_hcd
  iProduct                2 Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #2
  iSerial                 1 0000:00:1d.1
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0xc0
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 Unused
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0002  1x 2 bytes
        bInterval             255
Hub Descriptor:
  bLength               9
  bDescriptorType      41
  nNbrPorts             2
  wHubCharacteristic 0x000a
    No power switching (usb 1.0)
    Per-port overcurrent protection
  bPwrOn2PwrGood        1 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00
  PortPwrCtrlMask    0x01 

Bus 001 Device 001: ID 0000:0000  
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               1.10
  bDeviceClass            9 Hub
  bDeviceSubClass         0 Unused
  bDeviceProtocol         0 
  bMaxPacketSize0         8
  idVendor           0x0000 
  idProduct          0x0000 
  bcdDevice            2.06
  iManufacturer           3 Linux 2.6.10-lirc uhci_hcd
  iProduct                2 Intel Corp. 82801EB/ER (ICH5/ICH5R) USB UHCI #1
  iSerial                 1 0000:00:1d.0
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           25
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0xc0
      Self Powered
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 Unused
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0002  1x 2 bytes
        bInterval             255
Hub Descriptor:
  bLength               9
  bDescriptorType      41
  nNbrPorts             2
  wHubCharacteristic 0x000a
    No power switching (usb 1.0)
    Per-port overcurrent protection
  bPwrOn2PwrGood        1 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00
  PortPwrCtrlMask    0x01 

[-- Attachment #1.8: config-2.6.10 --]
[-- Type: text/plain, Size: 40065 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.10-lirc
# Wed Dec 29 17:33:55 2004
#
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_UID16=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_CLEAN_COMPILE=y
CONFIG_LOCK_KERNEL=y

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_LOG_BUF_SHIFT=15
CONFIG_HOTPLUG=y
CONFIG_KOBJECT_UEVENT=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_EMBEDDED is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_FUTEX=y
CONFIG_EPOLL=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SHMEM=y
CONFIG_CC_ALIGN_FUNCTIONS=0
CONFIG_CC_ALIGN_LABELS=0
CONFIG_CC_ALIGN_LOOPS=0
CONFIG_CC_ALIGN_JUMPS=0
# CONFIG_TINY_SHMEM is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_OBSOLETE_MODPARM=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
CONFIG_MPENTIUM4=y
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_SMP=y
CONFIG_NR_CPUS=8
CONFIG_SCHED_SMT=y
CONFIG_PREEMPT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_TSC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
CONFIG_MICROCODE=m
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
CONFIG_IRQBALANCE=y
CONFIG_HAVE_DEC_LOCK=y
# CONFIG_REGPARM is not set

#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
# CONFIG_SOFTWARE_SUSPEND is not set

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_BOOT=y
CONFIG_ACPI_INTERPRETER=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_ASUS=m
CONFIG_ACPI_IBM=m
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_BUS=y
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_PCI=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y

#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
# CONFIG_CPU_FREQ is not set

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCI_MSI is not set
CONFIG_PCI_LEGACY_PROC=y
CONFIG_PCI_NAMES=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set

#
# PC-card bridges
#
CONFIG_PCMCIA_PROBE=y

#
# PCI Hotplug Support
#
CONFIG_HOTPLUG_PCI=m
CONFIG_HOTPLUG_PCI_FAKE=m
CONFIG_HOTPLUG_PCI_COMPAQ=m
# CONFIG_HOTPLUG_PCI_COMPAQ_NVRAM is not set
CONFIG_HOTPLUG_PCI_IBM=m
CONFIG_HOTPLUG_PCI_ACPI=m
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
CONFIG_HOTPLUG_PCI_PCIE=m
# CONFIG_HOTPLUG_PCI_PCIE_POLL_EVENT_MODE is not set
CONFIG_HOTPLUG_PCI_SHPC=m
# CONFIG_HOTPLUG_PCI_SHPC_POLL_EVENT_MODE is not set

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=m

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
CONFIG_PARPORT=y
CONFIG_PARPORT_PC=y
CONFIG_PARPORT_PC_CML1=m
# CONFIG_PARPORT_SERIAL is not set
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_OTHER is not set
CONFIG_PARPORT_1284=y

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
# CONFIG_ISAPNP is not set
# CONFIG_PNPBIOS is not set
CONFIG_PNPACPI=y

#
# Block devices
#
CONFIG_BLK_DEV_FD=y
# CONFIG_BLK_DEV_XD is not set
CONFIG_PARIDE=m
CONFIG_PARIDE_PARPORT=y

#
# Parallel IDE high-level drivers
#
CONFIG_PARIDE_PD=m
CONFIG_PARIDE_PCD=m
CONFIG_PARIDE_PF=m
CONFIG_PARIDE_PT=m
CONFIG_PARIDE_PG=m

#
# Parallel IDE protocol modules
#
# CONFIG_PARIDE_ATEN is not set
# CONFIG_PARIDE_BPCK is not set
# CONFIG_PARIDE_BPCK6 is not set
# CONFIG_PARIDE_COMM is not set
# CONFIG_PARIDE_DSTR is not set
# CONFIG_PARIDE_FIT2 is not set
# CONFIG_PARIDE_FIT3 is not set
# CONFIG_PARIDE_EPAT is not set
# CONFIG_PARIDE_EPIA is not set
# CONFIG_PARIDE_FRIQ is not set
# CONFIG_PARIDE_FRPW is not set
# CONFIG_PARIDE_KBIC is not set
# CONFIG_PARIDE_KTTI is not set
# CONFIG_PARIDE_ON20 is not set
# CONFIG_PARIDE_ON26 is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_LBD is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y

#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=y
CONFIG_BLK_DEV_IDETAPE=m
CONFIG_BLK_DEV_IDEFLOPPY=m
CONFIG_BLK_DEV_IDESCSI=y
# CONFIG_IDE_TASK_IOCTL is not set

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_BLK_DEV_GENERIC=y
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
# CONFIG_IDE_CHIPSETS is not set
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_IDEDMA_IVB=y
CONFIG_IDEDMA_AUTO=y
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
CONFIG_SCSI=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN is not set
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y

#
# SCSI Transport Attributes
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set

#
# SCSI low-level drivers
#
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
CONFIG_SCSI_SATA=y
# CONFIG_SCSI_SATA_AHCI is not set
# CONFIG_SCSI_SATA_SVW is not set
CONFIG_SCSI_ATA_PIIX=y
# CONFIG_SCSI_SATA_NV is not set
CONFIG_SCSI_SATA_PROMISE=y
# CONFIG_SCSI_SATA_SX4 is not set
# CONFIG_SCSI_SATA_SIL is not set
# CONFIG_SCSI_SATA_SIS is not set
# CONFIG_SCSI_SATA_ULI is not set
# CONFIG_SCSI_SATA_VIA is not set
# CONFIG_SCSI_SATA_VITESSE is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_EATA_PIO is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_ISP is not set
# CONFIG_SCSI_QLOGIC_FC is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA2XXX=y
# CONFIG_SCSI_QLA21XX is not set
# CONFIG_SCSI_QLA22XX is not set
# CONFIG_SCSI_QLA2300 is not set
# CONFIG_SCSI_QLA2322 is not set
# CONFIG_SCSI_QLA6312 is not set
# CONFIG_SCSI_QLA6322 is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
CONFIG_SCSI_DEBUG=m

#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set

#
# Multi-device support (RAID and LVM)
#
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=m
CONFIG_MD_RAID5=y
CONFIG_MD_RAID6=m
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=y
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_MIRROR=m
CONFIG_DM_ZERO=m

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=m

#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
CONFIG_IEEE1394_OUI_DB=y
CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y
CONFIG_IEEE1394_CONFIG_ROM_IP1394=y

#
# Device Drivers
#
# CONFIG_IEEE1394_PCILYNX is not set
CONFIG_IEEE1394_OHCI1394=m

#
# Protocol Drivers
#
CONFIG_IEEE1394_VIDEO1394=m
CONFIG_IEEE1394_SBP2=m
CONFIG_IEEE1394_SBP2_PHYS_DMA=y
CONFIG_IEEE1394_ETH1394=m
CONFIG_IEEE1394_DV1394=m
CONFIG_IEEE1394_RAWIO=m
CONFIG_IEEE1394_CMP=m
CONFIG_IEEE1394_AMDTP=m

#
# I2O device support
#
CONFIG_I2O=m
# CONFIG_I2O_CONFIG is not set
CONFIG_I2O_BLOCK=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m

#
# Networking support
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_NETLINK_DEV=y
CONFIG_UNIX=y
CONFIG_NET_KEY=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
# CONFIG_IP_MULTIPLE_TABLES is not set
# CONFIG_IP_ROUTE_MULTIPATH is not set
# CONFIG_IP_ROUTE_VERBOSE is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=y
CONFIG_INET_ESP=y
CONFIG_INET_IPCOMP=y
CONFIG_INET_TUNNEL=y
CONFIG_IP_TCPDIAG=y
CONFIG_IP_TCPDIAG_IPV6=y

#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
CONFIG_IPV6=y
CONFIG_IPV6_PRIVACY=y
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_INET6_TUNNEL=m
CONFIG_IPV6_TUNNEL=m
CONFIG_NETFILTER=y
CONFIG_NETFILTER_DEBUG=y

#
# IP: Netfilter Configuration
#
CONFIG_IP_NF_CONNTRACK=m
CONFIG_IP_NF_CT_ACCT=y
CONFIG_IP_NF_CONNTRACK_MARK=y
CONFIG_IP_NF_CT_PROTO_SCTP=m
CONFIG_IP_NF_FTP=m
CONFIG_IP_NF_IRC=m
CONFIG_IP_NF_TFTP=m
CONFIG_IP_NF_AMANDA=m
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_LIMIT=m
CONFIG_IP_NF_MATCH_IPRANGE=m
CONFIG_IP_NF_MATCH_MAC=m
CONFIG_IP_NF_MATCH_PKTTYPE=m
CONFIG_IP_NF_MATCH_MARK=m
CONFIG_IP_NF_MATCH_MULTIPORT=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_DSCP=m
CONFIG_IP_NF_MATCH_AH_ESP=m
CONFIG_IP_NF_MATCH_LENGTH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_TCPMSS=m
CONFIG_IP_NF_MATCH_HELPER=m
CONFIG_IP_NF_MATCH_STATE=m
CONFIG_IP_NF_MATCH_CONNTRACK=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_REALM=m
CONFIG_IP_NF_MATCH_SCTP=m
CONFIG_IP_NF_MATCH_COMMENT=m
CONFIG_IP_NF_MATCH_CONNMARK=m
CONFIG_IP_NF_MATCH_HASHLIMIT=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_IP_NF_TARGET_TCPMSS=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_SAME=m
# CONFIG_IP_NF_NAT_LOCAL is not set
CONFIG_IP_NF_NAT_SNMP_BASIC=m
CONFIG_IP_NF_NAT_IRC=m
CONFIG_IP_NF_NAT_FTP=m
CONFIG_IP_NF_NAT_TFTP=m
CONFIG_IP_NF_NAT_AMANDA=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_DSCP=m
CONFIG_IP_NF_TARGET_MARK=m
CONFIG_IP_NF_TARGET_CLASSIFY=m
CONFIG_IP_NF_TARGET_CONNMARK=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
# CONFIG_IP_NF_RAW is not set
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m
# CONFIG_IP_NF_COMPAT_IPCHAINS is not set
# CONFIG_IP_NF_COMPAT_IPFWADM is not set

#
# IPv6: Netfilter Configuration
#
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_LIMIT=m
CONFIG_IP6_NF_MATCH_MAC=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_MULTIPORT=m
CONFIG_IP6_NF_MATCH_OWNER=m
CONFIG_IP6_NF_MATCH_MARK=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_AHESP=m
CONFIG_IP6_NF_MATCH_LENGTH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_MARK=m
# CONFIG_IP6_NF_RAW is not set
CONFIG_XFRM=y
CONFIG_XFRM_USER=y

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_NET_DIVERT is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set
CONFIG_NET_CLS_ROUTE=y

#
# Network testing
#
CONFIG_NET_PKTGEN=m
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_RX is not set
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_EQUALIZER=m
CONFIG_TUN=m
CONFIG_ETHERTAP=m
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set

#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
# CONFIG_E100 is not set
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
CONFIG_8139CP=m
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
CONFIG_8139TOO_TUNE_TWISTER=y
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
CONFIG_VIA_RHINE=m
# CONFIG_VIA_RHINE_MMIO is not set
# CONFIG_NET_POCKET is not set

#
# Ethernet (1000 Mbit)
#
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
CONFIG_DL2K=m
CONFIG_E1000=m
# CONFIG_E1000_NAPI is not set
CONFIG_NS83820=m
CONFIG_HAMACHI=m
CONFIG_YELLOWFIN=m
CONFIG_R8169=m
# CONFIG_R8169_NAPI is not set
CONFIG_SK98LIN=y
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=m

#
# Ethernet (10000 Mbit)
#
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
# CONFIG_NET_RADIO is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
CONFIG_PLIP=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPPOE=m
CONFIG_SLIP=m
# CONFIG_SLIP_COMPRESSED is not set
# CONFIG_SLIP_SMART is not set
# CONFIG_SLIP_MODE_SLIP6 is not set
# CONFIG_NET_FC is not set
CONFIG_SHAPER=m
CONFIG_NETCONSOLE=m

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_TSDEV=m
CONFIG_INPUT_TSDEV_SCREEN_X=240
CONFIG_INPUT_TSDEV_SCREEN_Y=320
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input I/O drivers
#
CONFIG_GAMEPORT=m
CONFIG_SOUND_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
CONFIG_GAMEPORT_EMU10K1=m
CONFIG_GAMEPORT_VORTEX=m
CONFIG_GAMEPORT_FM801=m
CONFIG_GAMEPORT_CS461x=m
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
CONFIG_SERIO_CT82C710=m
CONFIG_SERIO_PARKBD=m
CONFIG_SERIO_PCIPS2=m
# CONFIG_SERIO_RAW is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_SERIAL=m
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_UINPUT=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=m
# CONFIG_SERIAL_8250_ACPI is not set
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
# CONFIG_SERIAL_8250_MANY_PORTS is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
# CONFIG_SERIAL_8250_MULTIPORT is not set
# CONFIG_SERIAL_8250_RSA is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=m

#
# Linux InfraRed Controller
#
CONFIG_LIRC_SUPPORT=m
CONFIG_LIRC_MAX_DEV=2
# CONFIG_LIRC_I2C is not set
# CONFIG_LIRC_GPIO is not set
# CONFIG_LIRC_BT829 is not set
# CONFIG_LIRC_IT87 is not set
# CONFIG_LIRC_ATIUSB is not set
# CONFIG_LIRC_MCEUSB is not set
CONFIG_LIRC_SERIAL=m
CONFIG_LIRC_HOMEBREW=y
# CONFIG_LIRC_SERIAL_ANIMAX is not set
# CONFIG_LIRC_SERIAL_IRDEO is not set
# CONFIG_LIRC_SERIAL_IRDEO_REMOTE is not set
# CONFIG_LIRC_SERIAL_TRANSMITTER is not set
# CONFIG_LIRC_SERIAL_IGOR is not set
CONFIG_LIRC_SERIAL_COM1=y
# CONFIG_LIRC_SERIAL_COM2 is not set
# CONFIG_LIRC_SERIAL_COM3 is not set
# CONFIG_LIRC_SERIAL_COM4 is not set
# CONFIG_LIRC_SERIAL_OTHER is not set
CONFIG_LIRC_PORT_SERIAL=0x3f8
CONFIG_LIRC_IRQ_SERIAL=0x4
# CONFIG_LIRC_SIR is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_PRINTER=y
# CONFIG_LP_CONSOLE is not set
# CONFIG_PPDEV is not set
# CONFIG_TIPAR is not set

#
# IPMI
#
CONFIG_IPMI_HANDLER=y
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=y
# CONFIG_IPMI_SI is not set
CONFIG_IPMI_WATCHDOG=m
# CONFIG_IPMI_POWEROFF is not set

#
# Watchdog Cards
#
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
CONFIG_SOFT_WATCHDOG=m
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I8XX_TCO is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_SCx200_WDT is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_MACHZ_WDT is not set

#
# ISA-based Watchdog Cards
#
# CONFIG_PCWATCHDOG is not set
# CONFIG_MIXCOMWD is not set
# CONFIG_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_HW_RANDOM=y
CONFIG_NVRAM=m
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set

#
# Ftape, the floppy tape device driver
#
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
# CONFIG_AGP_INTEL is not set
# CONFIG_AGP_INTEL_MCH is not set
CONFIG_AGP_NVIDIA=m
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
CONFIG_HANGCHECK_TIMER=m

#
# I2C support
#
CONFIG_I2C=m
CONFIG_I2C_CHARDEV=m

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCF=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
CONFIG_I2C_I810=m
CONFIG_I2C_ISA=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
CONFIG_I2C_PIIX4=m
CONFIG_I2C_PROSAVAGE=m
CONFIG_I2C_SAVAGE4=m
CONFIG_SCx200_ACB=m
CONFIG_I2C_SIS5595=m
CONFIG_I2C_SIS630=m
CONFIG_I2C_SIS96X=m
CONFIG_I2C_STUB=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m
CONFIG_I2C_VOODOO3=m
CONFIG_I2C_PCA_ISA=m

#
# Hardware Sensors Chip support
#
CONFIG_I2C_SENSOR=m
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ASB100=m
CONFIG_SENSORS_DS1621=m
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SMSC47M1=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_W83781D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m

#
# Other I2C Chip support
#
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_RTC8564=m
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set

#
# Multimedia devices
#
CONFIG_VIDEO_DEV=y

#
# Video For Linux
#

#
# Video Adapters
#
CONFIG_VIDEO_BT848=m
CONFIG_VIDEO_PMS=m
CONFIG_VIDEO_BWQCAM=m
CONFIG_VIDEO_CQCAM=m
CONFIG_VIDEO_W9966=m
CONFIG_VIDEO_CPIA=m
CONFIG_VIDEO_CPIA_PP=m
CONFIG_VIDEO_CPIA_USB=m
CONFIG_VIDEO_SAA5246A=m
CONFIG_VIDEO_SAA5249=m
CONFIG_TUNER_3036=m
CONFIG_VIDEO_STRADIS=m
CONFIG_VIDEO_ZORAN=m
CONFIG_VIDEO_ZORAN_BUZ=m
CONFIG_VIDEO_ZORAN_DC10=m
CONFIG_VIDEO_ZORAN_DC30=m
CONFIG_VIDEO_ZORAN_LML33=m
CONFIG_VIDEO_ZORAN_LML33R10=m
CONFIG_VIDEO_SAA7134=m
CONFIG_VIDEO_MXB=m
CONFIG_VIDEO_DPC=m
CONFIG_VIDEO_HEXIUM_ORION=m
CONFIG_VIDEO_HEXIUM_GEMINI=m
CONFIG_VIDEO_CX88=m
CONFIG_VIDEO_OVCAMCHIP=m

#
# Radio Adapters
#
CONFIG_RADIO_CADET=m
CONFIG_RADIO_RTRACK=m
CONFIG_RADIO_RTRACK2=m
CONFIG_RADIO_AZTECH=m
CONFIG_RADIO_GEMTEK=m
CONFIG_RADIO_GEMTEK_PCI=m
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_MAESTRO=m
CONFIG_RADIO_SF16FMI=m
# CONFIG_RADIO_SF16FMR2 is not set
CONFIG_RADIO_TERRATEC=m
CONFIG_RADIO_TRUST=m
CONFIG_RADIO_TYPHOON=m
# CONFIG_RADIO_TYPHOON_PROC_FS is not set
CONFIG_RADIO_ZOLTRIX=m

#
# Digital Video Broadcasting Devices
#
CONFIG_DVB=y
CONFIG_DVB_CORE=y

#
# Supported SAA7146 based PCI Adapters
#
CONFIG_DVB_AV7110=m
CONFIG_DVB_AV7110_OSD=y
CONFIG_DVB_BUDGET=m
CONFIG_DVB_BUDGET_CI=m
CONFIG_DVB_BUDGET_AV=m
CONFIG_DVB_BUDGET_PATCH=m

#
# Supported USB Adapters
#
CONFIG_DVB_TTUSB_BUDGET=m
CONFIG_DVB_TTUSB_DEC=m
CONFIG_DVB_DIBUSB=m
CONFIG_DVB_DIBUSB_MISDESIGNED_DEVICES=y
CONFIG_DVB_DIBCOM_DEBUG=y
CONFIG_DVB_CINERGYT2=m
CONFIG_DVB_CINERGYT2_TUNING=y
CONFIG_DVB_CINERGYT2_STREAM_URB_COUNT=32
CONFIG_DVB_CINERGYT2_STREAM_BUF_SIZE=512
CONFIG_DVB_CINERGYT2_QUERY_INTERVAL=250
# CONFIG_DVB_CINERGYT2_ENABLE_RC_INPUT_DEVICE is not set

#
# Supported FlexCopII (B2C2) Adapters
#
CONFIG_DVB_B2C2_SKYSTAR=m
CONFIG_DVB_B2C2_USB=m

#
# Supported BT878 Adapters
#
CONFIG_DVB_BT8XX=m

#
# Supported DVB Frontends
#

#
# Customise DVB Frontends
#

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_STV0299=m
CONFIG_DVB_CX24110=m
CONFIG_DVB_TDA8083=m
CONFIG_DVB_TDA80XX=m
CONFIG_DVB_MT312=m
CONFIG_DVB_VES1X93=m

#
# DVB-T (terrestrial) frontends
#
CONFIG_DVB_SP8870=m
CONFIG_DVB_SP887X=m
CONFIG_DVB_CX22700=m
CONFIG_DVB_CX22702=m
CONFIG_DVB_L64781=m
CONFIG_DVB_TDA1004X=m
CONFIG_DVB_NXT6000=m
CONFIG_DVB_MT352=m
CONFIG_DVB_DIB3000MB=m
CONFIG_DVB_DIB3000MC=m

#
# DVB-C (cable) frontends
#
CONFIG_DVB_ATMEL_AT76C651=m
CONFIG_DVB_VES1820=m
CONFIG_DVB_TDA10021=m
CONFIG_DVB_STV0297=m
CONFIG_VIDEO_SAA7146=m
CONFIG_VIDEO_SAA7146_VV=m
CONFIG_VIDEO_VIDEOBUF=m
CONFIG_VIDEO_TUNER=m
CONFIG_VIDEO_BUF=m
CONFIG_VIDEO_BTCX=m
CONFIG_VIDEO_IR=m

#
# Graphics support
#
CONFIG_FB=y
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
CONFIG_VIDEO_SELECT=y
# CONFIG_FB_HGA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON_OLD is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_VIRTUAL is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

#
# Logo configuration
#
CONFIG_LOGO=y
CONFIG_LOGO_LINUX_MONO=y
CONFIG_LOGO_LINUX_VGA16=y
CONFIG_LOGO_LINUX_CLUT224=y

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_RTCTIMER=m
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
CONFIG_SND_MPU401_UART=m
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
CONFIG_SND_MTPAV=m
CONFIG_SND_SERIAL_U16550=m
CONFIG_SND_MPU401=m

#
# ISA devices
#
# CONFIG_SND_AD1848 is not set
# CONFIG_SND_CS4231 is not set
# CONFIG_SND_CS4232 is not set
# CONFIG_SND_CS4236 is not set
# CONFIG_SND_ES1688 is not set
# CONFIG_SND_ES18XX is not set
# CONFIG_SND_GUSCLASSIC is not set
# CONFIG_SND_GUSEXTREME is not set
# CONFIG_SND_GUSMAX is not set
# CONFIG_SND_INTERWAVE is not set
# CONFIG_SND_INTERWAVE_STB is not set
# CONFIG_SND_OPTI92X_AD1848 is not set
# CONFIG_SND_OPTI92X_CS4231 is not set
# CONFIG_SND_OPTI93X is not set
# CONFIG_SND_SB8 is not set
# CONFIG_SND_SB16 is not set
# CONFIG_SND_SBAWE is not set
# CONFIG_SND_WAVEFRONT is not set
# CONFIG_SND_CMI8330 is not set
# CONFIG_SND_OPL3SA2 is not set
# CONFIG_SND_SGALAXY is not set
# CONFIG_SND_SSCAPE is not set

#
# PCI devices
#
CONFIG_SND_AC97_CODEC=m
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=m
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VX222 is not set

#
# USB devices
#
CONFIG_SND_USB_AUDIO=m
CONFIG_SND_USB_USX2Y=m

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set

#
# USB support
#
CONFIG_USB=y
CONFIG_USB_DEBUG=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_OHCI_HCD=m
CONFIG_USB_UHCI_HCD=m
CONFIG_USB_SL811_HCD=m

#
# USB Device Class drivers
#
CONFIG_USB_AUDIO=m
CONFIG_USB_BLUETOOTH_TTY=m
CONFIG_USB_MIDI=m
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
CONFIG_USB_STORAGE_DEBUG=y
CONFIG_USB_STORAGE_RW_DETECT=y
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_DPCM=y
CONFIG_USB_STORAGE_HP8200e=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_USB_HIDINPUT=y
# CONFIG_HID_FF is not set
CONFIG_USB_HIDDEV=y
CONFIG_USB_AIPTEK=m
CONFIG_USB_WACOM=m
CONFIG_USB_KBTAB=m
CONFIG_USB_POWERMATE=m
CONFIG_USB_MTOUCH=m
CONFIG_USB_EGALAX=m
CONFIG_USB_XPAD=m
CONFIG_USB_ATI_REMOTE=m

#
# USB Imaging devices
#
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m
CONFIG_USB_HPUSBSCSI=m

#
# USB Multimedia devices
#
CONFIG_USB_DABUSB=m
CONFIG_USB_VICAM=m
CONFIG_USB_DSBR=m
CONFIG_USB_IBMCAM=m
CONFIG_USB_KONICAWC=m
CONFIG_USB_OV511=m
CONFIG_USB_SE401=m
CONFIG_USB_SN9C102=m
CONFIG_USB_STV680=m
CONFIG_USB_W9968CF=m

#
# USB Network Adapters
#
CONFIG_USB_CATC=m
CONFIG_USB_KAWETH=m
CONFIG_USB_PEGASUS=m
CONFIG_USB_RTL8150=m
CONFIG_USB_USBNET=m

#
# USB Host-to-Host Cables
#
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_GENESYS=y
CONFIG_USB_NET1080=y
CONFIG_USB_PL2301=y
CONFIG_USB_KC2190=y

#
# Intelligent USB Devices/Gadgets
#
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_ZAURUS=y
CONFIG_USB_CDCETHER=y

#
# USB Network Adapters
#
CONFIG_USB_AX8817X=y

#
# USB port drivers
#
CONFIG_USB_USS720=m

#
# USB Serial Converter support
#
CONFIG_USB_SERIAL=m
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
# CONFIG_USB_SERIAL_IPW is not set
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KEYSPAN_MPR=y
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_EZUSB=y

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_TIGL=m
CONFIG_USB_AUERSWALD=m
CONFIG_USB_RIO500=m
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_LED=m
CONFIG_USB_CYTHERM=m
CONFIG_USB_PHIDGETKIT=m
CONFIG_USB_PHIDGETSERVO=m
CONFIG_USB_TEST=m

#
# USB ATM/DSL drivers
#

#
# USB Gadget Support
#
CONFIG_USB_GADGET=m
# CONFIG_USB_GADGET_DEBUG_FILES is not set
CONFIG_USB_GADGET_NET2280=y
CONFIG_USB_NET2280=m
# CONFIG_USB_GADGET_PXA2XX is not set
# CONFIG_USB_GADGET_GOKU is not set
# CONFIG_USB_GADGET_SA1100 is not set
# CONFIG_USB_GADGET_LH7A40X is not set
# CONFIG_USB_GADGET_DUMMY_HCD is not set
# CONFIG_USB_GADGET_OMAP is not set
CONFIG_USB_GADGET_DUALSPEED=y
# CONFIG_USB_ZERO is not set
CONFIG_USB_ETH=m
CONFIG_USB_ETH_RNDIS=y
CONFIG_USB_GADGETFS=m
# CONFIG_USB_FILE_STORAGE is not set
# CONFIG_USB_G_SERIAL is not set

#
# MMC/SD Card support
#
# CONFIG_MMC is not set

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
# CONFIG_EXT3_FS_POSIX_ACL is not set
# CONFIG_EXT3_FS_SECURITY is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=m
CONFIG_REISERFS_CHECK=y
CONFIG_REISERFS_PROC_INFO=y
# CONFIG_REISERFS_FS_XATTR is not set
CONFIG_JFS_FS=m
CONFIG_JFS_POSIX_ACL=y
CONFIG_JFS_DEBUG=y
CONFIG_JFS_STATISTICS=y
CONFIG_FS_POSIX_ACL=y
CONFIG_XFS_FS=m
CONFIG_XFS_RT=y
CONFIG_XFS_QUOTA=y
# CONFIG_XFS_SECURITY is not set
CONFIG_XFS_POSIX_ACL=y
CONFIG_MINIX_FS=m
CONFIG_ROMFS_FS=m
CONFIG_QUOTA=y
CONFIG_QFMT_V1=m
CONFIG_QFMT_V2=m
CONFIG_QUOTACTL=y
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=m
CONFIG_NTFS_DEBUG=y
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_SYSFS=y
# CONFIG_DEVFS_FS is not set
# CONFIG_DEVPTS_FS_XATTR is not set
CONFIG_TMPFS=y
# CONFIG_TMPFS_XATTR is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y

#
# Miscellaneous filesystems
#
CONFIG_ADFS_FS=m
CONFIG_ADFS_FS_RW=y
CONFIG_AFFS_FS=m
CONFIG_HFS_FS=m
# CONFIG_HFSPLUS_FS is not set
CONFIG_BEFS_FS=m
CONFIG_BEFS_DEBUG=y
CONFIG_BFS_FS=m
CONFIG_EFS_FS=m
CONFIG_CRAMFS=m
CONFIG_VXFS_FS=m
CONFIG_HPFS_FS=m
CONFIG_QNX4FS_FS=m
CONFIG_QNX4FS_RW=y
CONFIG_SYSV_FS=m
CONFIG_UFS_FS=m
CONFIG_UFS_FS_WRITE=y

#
# Network File Systems
#
CONFIG_NFS_FS=y
# CONFIG_NFS_V3 is not set
# CONFIG_NFS_V4 is not set
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_SUNRPC=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-15"
CONFIG_NLS_CODEPAGE_437=m
CONFIG_NLS_CODEPAGE_737=m
CONFIG_NLS_CODEPAGE_775=m
CONFIG_NLS_CODEPAGE_850=y
CONFIG_NLS_CODEPAGE_852=y
CONFIG_NLS_CODEPAGE_855=m
CONFIG_NLS_CODEPAGE_857=m
CONFIG_NLS_CODEPAGE_860=m
CONFIG_NLS_CODEPAGE_861=m
CONFIG_NLS_CODEPAGE_862=m
CONFIG_NLS_CODEPAGE_863=m
CONFIG_NLS_CODEPAGE_864=m
CONFIG_NLS_CODEPAGE_865=m
CONFIG_NLS_CODEPAGE_866=m
CONFIG_NLS_CODEPAGE_869=m
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
CONFIG_NLS_CODEPAGE_949=m
CONFIG_NLS_CODEPAGE_874=m
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
CONFIG_NLS_ISO8859_2=m
CONFIG_NLS_ISO8859_3=m
CONFIG_NLS_ISO8859_4=m
CONFIG_NLS_ISO8859_5=m
CONFIG_NLS_ISO8859_6=m
CONFIG_NLS_ISO8859_7=m
CONFIG_NLS_ISO8859_9=m
CONFIG_NLS_ISO8859_13=m
CONFIG_NLS_ISO8859_14=m
CONFIG_NLS_ISO8859_15=y
CONFIG_NLS_KOI8_R=m
CONFIG_NLS_KOI8_U=m
CONFIG_NLS_UTF8=m

#
# Profiling support
#
# CONFIG_PROFILING is not set

#
# Kernel hacking
#
# CONFIG_DEBUG_KERNEL is not set
# CONFIG_FRAME_POINTER is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_4KSTACKS is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_DES=y
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES_586=y
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_TEST=m

#
# Library routines
#
CONFIG_CRC_CCITT=m
CONFIG_CRC32=y
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_PC=y

[-- Attachment #1.9: Type: text/plain, Size: 49 bytes --]


And for completeness, here are the lsmod output

[-- Attachment #1.10: lsmod --]
[-- Type: text/plain, Size: 2175 bytes --]

Module                  Size  Used by
w83627hf               26016  0 
lm75                    6932  0 
eeprom                  6680  0 
i2c_sensor              3968  3 w83627hf,lm75,eeprom
i2c_isa                 2816  0 
8139cp                 17152  0 
eth1394                18696  0 
dvb_ttpci              78864  3 
saa7146_vv             43520  1 dvb_ttpci
saa7146                15656  2 dvb_ttpci,saa7146_vv
ves1820                 6148  1 dvb_ttpci
stv0299                 9860  1 dvb_ttpci
tda8083                 6020  1 dvb_ttpci
stv0297                 7936  1 dvb_ttpci
sp8870                  7436  1 dvb_ttpci
ves1x93                 6788  1 dvb_ttpci
ttpci_eeprom            3328  1 dvb_ttpci
ohci1394               30724  0 
ieee1394              300344  2 eth1394,ohci1394
snd_intel8x0           27808  1 
snd_ac97_codec         66528  1 snd_intel8x0
snd_pcm_oss            47268  0 
snd_mixer_oss          17024  2 snd_pcm_oss
snd_pcm                80004  3 snd_intel8x0,snd_ac97_codec,snd_pcm_oss
snd_timer              21380  1 snd_pcm
snd                    46052  6 snd_intel8x0,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer
snd_page_alloc          8324  2 snd_intel8x0,snd_pcm
i2c_i801                8460  0 
ehci_hcd               41604  0 
uhci_hcd               30348  0 
shpchp                 90372  0 
pci_hotplug            11524  1 shpchp
8250_pnp                9088  0 
8250                   20708  1 8250_pnp
serial_core            18816  1 8250
pcspkr                  4300  0 
tsdev                   6592  0 
tuner                  20772  0 
tvaudio                20256  0 
msp3400                24104  0 
bttv                  142288  0 
video_buf              17540  2 saa7146_vv,bttv
firmware_class          8448  3 dvb_ttpci,sp8870,bttv
i2c_algo_bit            9608  1 bttv
btcx_risc               4872  1 bttv
i2c_core               18816  19 w83627hf,lm75,eeprom,i2c_sensor,i2c_isa,dvb_ttpci,ves1820,stv0299,tda8083,stv0297,sp8870,ves1x93,ttpci_eeprom,i2c_i801,tuner,tvaudio,msp3400,bttv,i2c_algo_bit
lirc_serial            12800  1 
lirc_dev               12040  2 lirc_serial
8139too                22784  0 

[-- Attachment #1.11: Type: text/plain, Size: 38 bytes --]

and the output of cat /proc/scsi/scsi

[-- Attachment #1.12: proc-scsi --]
[-- Type: text/plain, Size: 1282 bytes --]

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: TOSHIBA  Model: CD-ROM XM-3801TA Rev: 3386
  Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: IBM      Model: DCAS-34330       Rev: S60B
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 05 Lun: 00
  Vendor: HP       Model: C5110A           Rev: 3701
  Type:   Processor                        ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: _NEC     Model: DVD_RW ND-1300A  Rev: 1.05
  Type:   CD-ROM                           ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: Maxtor 6Y120M0   Rev: YAR5
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: Maxtor 6Y120M0   Rev: YAR5
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3200822AS      Rev: 3.01
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi7 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3200822AS      Rev: 3.01
  Type:   Direct-Access                    ANSI SCSI revision: 05

[-- Attachment #1.13: Type: text/plain, Size: 387 bytes --]



All help very much appreciated.

If you need more info, please let me know.

Regards,
Georg

-- 
Georg C. F. Greve                                 <greve@fsfeurope.org>
Free Software Foundation Europe	                 (http://fsfeurope.org)
GNU Business Network                        (http://mailman.gnubiz.org)
Brave GNU World	                           (http://brave-gnu-world.org)

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30  0:31 PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Georg C. F. Greve
@ 2004-12-30 16:23 ` Georg C. F. Greve
  2004-12-30 17:39   ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Georg C. F. Greve @ 2004-12-30 16:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: dm-crypt

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

[ update ]

Okay, tried to find out what is causing the kernel to crash and so I
replaced the dm-crypt part by cryptoloop: same effect.

Then I tried ext3 on top of LVM2 RAID5 with no encryption and it still
crashes. Not sure what is causing the problem exactly, but it does not
seem that dm-crypt is to blame anymore.

The message I saw on the remote console when it crashed with pure ext3
on raid5 was:

 Assertion failure in journal_start() at fs/jbd/transaction.c:271: "handle->h_transaction->t_journal == journal"


Hope this helps -- filed the bug as #3968 on buzilla.kernel.org, more
info at 

  http://bugzilla.kernel.org/show_bug.cgi?id=3968

Help appreciated, let me know if you have an idea.

Regards,
Georg

-- 
Georg C. F. Greve                                       <greve@gnu.org>
Free Software Foundation Europe	                 (http://fsfeurope.org)
Brave GNU World	                           (http://brave-gnu-world.org)

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30 16:23 ` Georg C. F. Greve
@ 2004-12-30 17:39   ` Peter T. Breuer
  2004-12-30 17:53     ` Sandro Dentella
                       ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Peter T. Breuer @ 2004-12-30 17:39 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, dm-crypt

In gmane.linux.raid Georg C. F. Greve <greve@fsfeurope.org> wrote:
> The message I saw on the remote console when it crashed with pure ext3
> on raid5 was:
> 
>  Assertion failure in journal_start() at fs/jbd/transaction.c:271: "handle->h_transaction->t_journal == journal"
> 

Yes, well, don't put the journal on the raid partition. Put it
elsewhere (anyway, journalling and raid do not mix, as write ordering
is not - deliberately - preserved in raid, as far as I can tell).

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30 17:39   ` Peter T. Breuer
@ 2004-12-30 17:53     ` Sandro Dentella
  2004-12-30 18:31       ` Peter T. Breuer
  2004-12-30 19:50     ` Michael Tokarev
  2005-01-07  6:21     ` PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Clemens Schwaighofer
  2 siblings, 1 reply; 172+ messages in thread
From: Sandro Dentella @ 2004-12-30 17:53 UTC (permalink / raw)
  To: linux-raid, linux-kernel

> Yes, well, don't put the journal on the raid partition. Put it
> elsewhere (anyway, journalling and raid do not mix, as write ordering
> is not - deliberately - preserved in raid, as far as I can tell).

???, do you mean it? which filesystem would you use for a 2TB RAID5 array? I
always used reiserfs for raid1/raid5 arrays...

sandro
*:-)


-- 
Sandro Dentella  *:-)
e-mail: sandro@e-den.it 
http://www.tksql.org                    TkSQL Home page - My GPL work

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30 17:53     ` Sandro Dentella
@ 2004-12-30 18:31       ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2004-12-30 18:31 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

In gmane.linux.raid Sandro Dentella <sandro@e-den.it> wrote:
> > Yes, well, don't put the journal on the raid partition. Put it
> > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > is not - deliberately - preserved in raid, as far as I can tell).
> 
> ???, do you mean it? which filesystem would you use for a 2TB RAID5 array? I

Whatever one you are using now. Just turn off journalling, or at least move
the journal somewhere else (and safe!).

Or perhaps use metadata-only journalling (but reiser does that by
default, does it not?). That should keep you happy.

> always used reiserfs for raid1/raid5 arrays...

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30 17:39   ` Peter T. Breuer
  2004-12-30 17:53     ` Sandro Dentella
@ 2004-12-30 19:50     ` Michael Tokarev
       [not found]       ` <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>
  2004-12-30 21:39       ` Peter T. Breuer
  2005-01-07  6:21     ` PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Clemens Schwaighofer
  2 siblings, 2 replies; 172+ messages in thread
From: Michael Tokarev @ 2004-12-30 19:50 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid, linux-kernel, dm-crypt

Peter T. Breuer wrote:
> In gmane.linux.raid Georg C. F. Greve <greve@fsfeurope.org> wrote:
> 
> Yes, well, don't put the journal on the raid partition. Put it
> elsewhere (anyway, journalling and raid do not mix, as write ordering
> is not - deliberately - preserved in raid, as far as I can tell).

This is a sort of a nonsense, really.  Both claims, it seems.
I can't say for sure whenever write ordering is preserved by
raid -- it should, and if it isn't, it's a bug and should be
fixed.  Nothing else is wrong with placing journal into raid
(the same as the filesystem in question).  Suggesting to remove
journal just isn't fair: the journal is here for a reason.
And, finally, the kernel should not crash.  If something like
this is unsupported, it should refuse to do so, instead of
crashing randomly.

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

[parent not found: <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>]

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
       [not found]       ` <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>
@ 2004-12-30 20:54         ` berk walker
  2005-01-01 13:39         ` Helge Hafting
  1 sibling, 0 replies; 172+ messages in thread
From: berk walker @ 2004-12-30 20:54 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Peter T. Breuer, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, dm-crypt-4q3lyFh4P1g

Michael Tokarev wrote:

> Peter T. Breuer wrote:
>
>> In gmane.linux.raid Georg C. F. Greve <greve-BSDwwRMYa8fNLxjTenLetw@public.gmane.org> wrote:
>>
>> Yes, well, don't put the journal on the raid partition. Put it
>> elsewhere (anyway, journalling and raid do not mix, as write ordering
>> is not - deliberately - preserved in raid, as far as I can tell).
>
>
> This is a sort of a nonsense, really.  Both claims, it seems.
> I can't say for sure whenever write ordering is preserved by
> raid -- it should, and if it isn't, it's a bug and should be
> fixed.  Nothing else is wrong with placing journal into raid
> (the same as the filesystem in question).  Suggesting to remove
> journal just isn't fair: the journal is here for a reason.
> And, finally, the kernel should not crash.  If something like
> this is unsupported, it should refuse to do so, instead of
> crashing randomly.
>
> /mjt
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
I might have missed some of this thread....
but have you tried this on a completely different box?  I have seen, and 
am fighting some problems such as yours, and having nothing to do with raid.

If you haven't, then try it.  You might get different results.  Hardware 
can sometimes be a dog to chase down, problemwise.
b-

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
       [not found]       ` <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>
  2004-12-30 20:54         ` berk walker
@ 2005-01-01 13:39         ` Helge Hafting
  1 sibling, 0 replies; 172+ messages in thread
From: Helge Hafting @ 2005-01-01 13:39 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Peter T. Breuer, linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, dm-crypt-4q3lyFh4P1g

On Thu, Dec 30, 2004 at 10:50:55PM +0300, Michael Tokarev wrote:
> Peter T. Breuer wrote:
> >In gmane.linux.raid Georg C. F. Greve <greve-BSDwwRMYa8fNLxjTenLetw@public.gmane.org> wrote:
> >
> >Yes, well, don't put the journal on the raid partition. Put it
> >elsewhere (anyway, journalling and raid do not mix, as write ordering
> >is not - deliberately - preserved in raid, as far as I can tell).
> 
> This is a sort of a nonsense, really.  Both claims, it seems.
> I can't say for sure whenever write ordering is preserved by
> raid -- it should, and if it isn't, it's a bug and should be
> fixed.  Nothing else is wrong with placing journal into raid
> (the same as the filesystem in question).  Suggesting to remove
> journal just isn't fair: the journal is here for a reason.
> And, finally, the kernel should not crash.  If something like
> this is unsupported, it should refuse to do so, instead of
> crashing randomly.

Write ordering trouble shouldn't crash the kernel, the way I
understand it.  Your journalled fs could be lost/inconsistent 
if the machine crashes for other reasons, due to bad write
ordering.  But the ordering trouble shouldn't cause a crash,
and all should be fine as soon as all the writes complete
without other incidents.

Helge Hafting

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30 19:50     ` Michael Tokarev
       [not found]       ` <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>
@ 2004-12-30 21:39       ` Peter T. Breuer
  2005-01-02 19:42         ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2004-12-30 21:39 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, dm-crypt

In gmane.linux.raid Michael Tokarev <mjt@tls.msk.ru> wrote:
> Peter T. Breuer wrote:
> > In gmane.linux.raid Georg C. F. Greve <greve@fsfeurope.org> wrote:
> > 
> > Yes, well, don't put the journal on the raid partition. Put it
> > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > is not - deliberately - preserved in raid, as far as I can tell).
> 
> This is a sort of a nonsense, really.  Both claims, it seems.

It's perfectly correct, as far as I know!

> I can't say for sure whenever write ordering is preserved by
> raid --

There is nothing that attempts expliciitly to maintain the ordering in
RAID (talking about mirroring here).  Mirror requests are submitted
_asynchronously_ to the block subsystem for each device in the mirror,
for each incoming request.  The kernel doesn't even have any way of
tracking in what order requests are emitted (it would need a counter
field in each request and there is not one), let alone in what order
they are emitted per device, under the device it is aiming at.

And then of course there is no way at all of telling the underlying
devices in what order to treat the requests either - and what about
request aggregation? Requests are normally aggregated by the kernel
before being sent to devices - ok, I think I recall that RAID turns
that off on itself by using its own make_request function, but it
doesn't control request aggregation in the sub-devices.

And I don't know what happens if you throw the extra resync thread into
the mix, but there certainly IS a RAID kernel thread that does nothing
else than retry failed requests (and do resyncs?) - which of course will
be out of order if ever they are successfully completed by the thread.

If we move on to RAID5, then the situation is simply even more
complicated because we no longer have to think about when solid,
physical, mirrored data is written, but when "virtual" redundant data is
written (and read).

I'm not even sure what in the kernel in general can possibly guarantee
that the sequence write-read-read-write-read can remain ordered that way
when an unplug event interrupts the sequence.

> it should, and if it isn't, it's a bug and should be
> fixed.  Nothing else is wrong with placing journal into raid

It's been that way forever.

> (the same as the filesystem in question). 

What's wrong is that the journal will be mirrored (if it's a mirror).
That means that (1) its data will be written twice, which is a big deal
since ALL the i/o goes through the journal first, and (2) the journal
is likely to be inconsistent (since it is so active) if you get one of
those creeping invisible RAID corruptions that can crop up inevitably
in RAID normal use.

> Suggesting to remove
> journal just isn't fair: the journal is here for a reason.

Well, I'd remove it: presumably his aim is to reduce fsck times after a
crash.  But consider - if he has had a crash, it is likely that his data is
corrupted, so he WANTS to check. 

All that a journal does is guarantee consistency of a FS, not
correctness.  Personally, I prefer to see the incorrectness.  If you
don't want to check the filesystem you can always just choose to not run
fsck!

And in this case the journal is a significant extra risk factor,
because it is ON the falied medium, and on the part that is most
active, moreover!

All you have to do to make things safer is take the journal OFF the
raid array. You immediately remove the potential for corruption IN the
journal (I believe that's what he has seen anyway - damage to the disk
under the journal), which is where we have deduced by the above argument
that the major source of likely corruptions must lie.

There's also no good sense in data-journalling, but I don't think
reiserfs does that anyway (it didn't use to, I know - ext3 was the
first to do data journalling, although even that's a misnomer, since
you try writing a 4GB file as an atomic operation ...).

Journals do no magic. You have to consider if they introduce more
benefits than dangers.

> And, finally, the kernel should not crash.

Well, I'm afraid that like everyone else it is dependent on hardware
and authors, both of which are fallible!

> If something like
> this is unsupported, it should refuse to do so, instead of
> crashing randomly.

???

Morality is so comforting :-).

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2004-12-30 21:39       ` Peter T. Breuer
@ 2005-01-02 19:42         ` Andy Smith
  2005-01-02 20:18           ` Peter T. Breuer
  2005-01-05  9:56           ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
  0 siblings, 2 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-02 19:42 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 843 bytes --]

On Thu, Dec 30, 2004 at 10:39:42PM +0100, Peter T. Breuer wrote:
> In gmane.linux.raid Michael Tokarev <mjt@tls.msk.ru> wrote:
> > Peter T. Breuer wrote:
> > > In gmane.linux.raid Georg C. F. Greve <greve@fsfeurope.org> wrote:
> > > 
> > > Yes, well, don't put the journal on the raid partition. Put it
> > > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > > is not - deliberately - preserved in raid, as far as I can tell).
> > 
> > This is a sort of a nonsense, really.  Both claims, it seems.
> 
> It's perfectly correct, as far as I know!

Not really wishing to get into the middle of a flame war, but I
didn't really see how this could be true so I asked for more info on
ext3-users.

I got the following response:

https://listman.redhat.com/archives/ext3-users/2005-January/msg00003.html


[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-02 19:42         ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
@ 2005-01-02 20:18           ` Peter T. Breuer
  2005-01-03  0:30             ` Andy Smith
  2005-01-05  9:56           ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-02 20:18 UTC (permalink / raw)
  To: linux-raid

Andy Smith <andy@strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 22 lines --]
> 
> On Thu, Dec 30, 2004 at 10:39:42PM +0100, Peter T. Breuer wrote:
> > In gmane.linux.raid Michael Tokarev <mjt@tls.msk.ru> wrote:
> > > Peter T. Breuer wrote:
> > > > In gmane.linux.raid Georg C. F. Greve <greve@fsfeurope.org> wrote:
> > > > 
> > > > Yes, well, don't put the journal on the raid partition. Put it
> > > > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > > > is not - deliberately - preserved in raid, as far as I can tell).
> > > 
> > > This is a sort of a nonsense, really.  Both claims, it seems.
> > 
> > It's perfectly correct, as far as I know!
> 
> Not really wishing to get into the middle of a flame war, but I
> didn't really see how this could be true so I asked for more info on
> ext3-users.
> 
> I got the following response:
> 
> https://listman.redhat.com/archives/ext3-users/2005-January/msg00003.html

Interesting - I'll post it (there is no flame war):

>     * From: "Stephen C. Tweedie" <sct redhat com>
>     * To: Andy Smith <andy lug org uk>
>     * Cc: Stephen Tweedie <sct redhat com>, ext3 users list <ext3-users
>     * redhat com>
>     * Subject: Re: ext3 journal on software raid
>     * Date: Sat, 01 Jan 2005 22:19:23 +0000

(snip)

> Disks and IO subsystems in general don't preserve IO ordering. 

This is true.

> ext3 is
> designed not to care.

That is surprising - write-order preservation is precisely the
condition that reiserfs requires for correct journal behaviour, and
Hans Reiser tld be so himself (sometime, some reiserfs mailing list, at
the time, etc).

It would be surprising if Stephen managed to do without it, but his
condition is definitely weaker.  He merely requires to be _told_
(synchronously) when each i/o has ended, in the order that it ends, I
think is what he says below.

I'm not sure that raid can guarrantee precisely that either.  There
might be minor disorderings on a SMP system with preemption, if for
example, one request is handled on one cpu and another on another, and
the acks are handled crosswise.  There might be a small temporal
displacement.  I haven't thought about it.

What I can say is that it makes no _attempt_ to respect that condition.
Whether it does or not I cannot exactly say.

> As long as the raid device tells the truth about
> when the data is actually committed to disk (all of the mirror volumes
> are uptodate) for a given IO, ext3 should be quite happy.

Uuff .. as I said, it is not quite clear to me that this (very weak)
condition is absolutely respected. Umm ... no, endio for the whole
request is sent back AFTER the mirror i/os have completed, but exactly
WHEN after is indeterminate on a preemptive (SMP) system. The mirrors
might have been factually updated for two requests in temporal order A
B, but might report endio in order B A. However, I think that he
probably is calling A then B in a single thread, which means that
B won't even be generated until A is acked.

OK - I think Stephen is probably saying that the ack must be sent back
AFTER the status of the writes on the mirror disks is known.

Yes, that is guarranteed (unless you apply an async raid patch ...).

> > What's wrong is that the journal will be mirrored (if it's a mirror).
> > That means that (1) its data will be written twice, which is a big deal
> > since ALL the i/o goes through the journal first
> 
> Not true; by default, only metadata goes through the journal, not data.

He is saying that data is not journalled by default on ext3.  I don't
see that as a comment about raid, and inasmuch as it means anything it
means that his comment "not true" is about as close to a rather strange
(political?) untruth as you can get in CS, since all the journal's data
WILL be written twice - it's up to you how much that is.  Whether you
pass the data through the journal or not, all the data you choose to
pass will be written twice, be it zero, some, or all. 

> 
> > and (2) the journal
> > is likely to be inconsistent (since it is so active) if you get one of
> > those creeping invisible RAID corruptions that can crop up inevitably
> > in RAID normal use.
> 
> Umm, if soft raid is expected to have silent invisible corruptions in
> normal use,

It is, just as is all types of RAID.  This is a very strange thing for
Stephen to say - I cannot believe that he is as naive as he makes
himself out to be about RAID here and I don't know why he should say
that (presuming that he really knows better).

> then you shouldn't be using it, period.  That's got zero to
> do with journaling.

It implies that one should not be doing journalling on top of it.

(The logic for why RAID corrupts silently is that errors accumulate at
n times the normal rate per sector, but none of them are detected by
RAID (no crc), and when a disk drops out then you get a good chance of
picking up a corrupted copy instead of a good copy, because nobody
has checked the copy meanwhiles to see if it matches the original).

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-02 20:18           ` Peter T. Breuer
@ 2005-01-03  0:30             ` Andy Smith
  2005-01-03  6:41               ` Neil Brown
  2005-01-03  8:03               ` Peter T. Breuer
  0 siblings, 2 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-03  0:30 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2117 bytes --]

On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> Stephen Tweedie wrote:
> > Umm, if soft raid is expected to have silent invisible corruptions in
> > normal use,
> 
> It is, just as is all types of RAID.  This is a very strange thing for
> Stephen to say - I cannot believe that he is as naive as he makes
> himself out to be about RAID here and I don't know why he should say
> that (presuming that he really knows better).
> 
> > then you shouldn't be using it, period.  That's got zero to
> > do with journaling.
> 
> It implies that one should not be doing journalling on top of it.
> 
> (The logic for why RAID corrupts silently is that errors accumulate at
> n times the normal rate per sector, but none of them are detected by
> RAID (no crc), and when a disk drops out then you get a good chance of
> picking up a corrupted copy instead of a good copy, because nobody
> has checked the copy meanwhiles to see if it matches the original).

I have no idea which of you to believe now. :(

I currently only have one system using software raid, and several of
my employer's machines using hardware raid, all of which have
various raid-1, -5 and -10 setups and all use only ext3.

Let's focus on the personal machine of mine for now since it uses
Linux software RAID and therefore on-topic here.  It has /boot on a
small RAID-1, and the rest of the system is on RAID-5 with an
additional RAID-0 just for temporary things.

There is nowhere that is not software RAID to put the journals, so
would you be recommending that I turn off journalling and basically
use it as ext2?

What I do know is that none of what you say is in the software raid
howto, and if you are right, it really should be.  Neither is it in
any ext3 documentation and there is no warning on any distribution
installer I have ever used (those that understand RAID and LVM and
are happy to set that up at install time with ext3).  Also everyone
that I have spoken to about this knows nothing about it, so what you
are saying, if correct, would seem to have far-reaching
implications.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03  0:30             ` Andy Smith
@ 2005-01-03  6:41               ` Neil Brown
  2005-01-03  8:37                 ` Peter T. Breuer
  2005-01-03  8:03               ` Peter T. Breuer
  1 sibling, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-03  6:41 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid

On Monday January 3, andy@strugglers.net wrote:
> 
> I have no idea which of you to believe now. :(

Well, how about I wade in.....

(almost*) No block storage device will guarantee that write ordering
is maintained.  Neither will read requests necessarily be ordered.

Any SCSI, IDE, or similar disc drive in Linux (or any other non-toy
OS) will have requests managed by an "elevator algorithm" which
coalesces adjacent blocks and  tries to re-order requests to make
optimal use of the device.

A RAID controller, whether software, firmware, or hardware, will also
re-order requests to make best use of the devices.

Any filesystem that assumes that requests will not be re-ordered is
broken, as the assumption is wrong.
I would be *very* surprised if Reiserfs makes this assumption.

Until relatively recently, the only assumption that could be made is
that a write request will be handled sometime between when it is made,
and when the request completes (i.e. the end_io callback is called).
If several requests are concurrent they could commit in any order.

With only this guarantee, the simplest approach for a journalling
filesystem is to write the content of a journal entry, wait for the
writes to complete, and then write a single block "header" which
describes and hence commits that journal entry.  The journal entry is
not "safe" until this second write completes.

This is equally applicable for IDE drives, SCSI drives, software
RAID1, software RAID5, hardware RAID etc.

More recently (2.6 only) Linux has had support for "write barriers".
The idea here is that you submit a number of write requests, then a
"barrier", then some more write requests. (The "barrier" might be a
flag on the last request of a list, I'm not sure of that detail).  The
meaning is that no write request submitted after the barrier will be
attempted until all requests submitted before the barrier are
complete.  Some drives support this concept natively so Linux simply
does not re-order requests across a barrier, and sends the barrier at
the appropriate time.  Drives can do their own re-ordering but will
not reorder across a barrier (if they support the barrier concept).

If Linux needs to write a barrier to a device that doesn't support
barriers (as the md/raid currently doesn't) it will (should) submit
all requests before the barrier, flush them out, wait for them to
complete, then allow other requests to be forwarded.

In short, md/raid provides the same guarantees as normal drives, and
any filesystem that expects more is broken.

Definitely put your journal on RAID  with at least as much redundancy
as your main filesystem (I put my filesystem on raid5 and my journal
on raid1).

NeilBrown

* I happen to know that the "umem" NVRAM driver will never re-order
  requests, as there is no value in re-ordering requests to RAM.  But
  it is the exception, not the rule.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03  6:41               ` Neil Brown
@ 2005-01-03  8:37                 ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03  8:37 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> Well, how about I wade in.....

Sure! :-)

> A RAID controller, whether software, firmware, or hardware, will also
> re-order requests to make best use of the devices.

Possibly.  I have written block device drivers that maintain write
order, however (or at least do so if you ask them to, with the right
switch), because ...

> Any filesystem that assumes that requests will not be re-ordered is
> broken, as the assumption is wrong.
> I would be *very* surprised if Reiserfs makes this assumption.

.. because that is EXACTLY what Hans Reiser has said to me. I don't
think I've kept the mail, but I remember it.  a quick google for
reiserfs + write ordering shows up some suggestive quotes:

  > We cannot use the buffer.c dirty list anyway because bdflush can write
  > those buffers to disk at any time.  Transactions have to control the
  > write ordering  ...

(hey, that was Hans quoting Stephen). From the Linux High Availability
website (http://linuxha.trick.ca/DataRedundancyByDrbd):

   Since later WRITE requests might depend on successful finished
   previous ones, this is needed to assure strict write ordering on
   both nodes. ...

Well, I'm not going to search now. Onecan simply ask HR and find out
what the current status is vis a vis reiserfs.

To be certain what I am talking about, I'll define write ordering as:

Writes are not reordered and reads may not be reordered beyond the
writes that bound them either side.

> Until relatively recently, the only assumption that could be made is
> that a write request will be handled sometime between when it is made,
> and when the request completes (i.e. the end_io callback is called).

This _appears_ to be what Stephen is saying he needs, from which I
deduce that he probably has a single-threaded implementation in ext3,
because:

> If several requests are concurrent they could commit in any order.

Yes.

> With only this guarantee, the simplest approach for a journalling
> filesystem is to write the content of a journal entry, wait for the
> writes to complete, and then write a single block "header" which
> describes and hence commits that journal entry.  The journal entry is
> not "safe" until this second write completes.
> 
> This is equally applicable for IDE drives, SCSI drives, software
> RAID1, software RAID5, hardware RAID etc.

I would agree. No matter what the hardware people say, or what claims
are made for equipment, I don't see how there can be any way of knowing
the order in which writes are committed internally.  I only hope that
reads are not reordered beyond writes :-).

(this would cause you to read the wrong data: you go W1 R1 W2 and yet
read in R1 what you wrote in W2!)

> More recently (2.6 only) Linux has had support for "write barriers".
> The idea here is that you submit a number of write requests, then a

Well, I seem to recall that at some point "request specials" were to
act as request barriers, but I don't know if that is still the case.
When a driver received one it had to flush all outstanding requests
before acking the special.

Perhaps Linus gave up on people implementing that and put the support
in the kernel core, so as to enforce it? It would be possible. Or maybe
he dropped it.

> "barrier", then some more write requests. (The "barrier" might be a
> flag on the last request of a list, I'm not sure of that detail).  The
> meaning is that no write request submitted after the barrier will be
> attempted until all requests submitted before the barrier are
> complete.  Some drives support this concept natively so Linux simply
> does not re-order requests across a barrier, and sends the barrier at
> the appropriate time.  Drives can do their own re-ordering but will
> not reorder across a barrier (if they support the barrier concept).

Yes.

> If Linux needs to write a barrier to a device that doesn't support
> barriers (as the md/raid currently doesn't) it will (should) submit
> all requests before the barrier, flush them out, wait for them to
> complete, then allow other requests to be forwarded.

But I at least don't know if "Linux" does that :-).

By "Linux" you either mean some part of the block subsystem, or fs's
acting on their own.

> In short, md/raid provides the same guarantees as normal drives, and
> any filesystem that expects more is broken.

Normal drives do not reorder writes. Their drivers also probably make
no attempt to do so, nor not to do so, but in the nature of things
(single input, single output) it is unlikely that they do.  Software
RAID on the other hand is fundamentally parallel so the intrinsic
liklihood that something somewhere gets reordered is much higher, and I
believe you agree with me that no attempt is made to either check on or
prevent it.

> Definitely put your journal on RAID  with at least as much redundancy
> as your main filesystem (I put my filesystem on raid5 and my journal
> on raid1).

:-) But I don't think you ought to put the journal on raid - what good
does it do you to do so?  (apart from testing out raid :).  After all,
the journal integrity is not itself guaranteed by a journal, and it is a
point of failure for the whole system, and it is a point where you have
doubled i/o density over and above the normal journal rate, which is
already extremely high if you do data journalling, since ALL the data on
the system flows through that point first. So you will stress the disk
there as well as making all your data vulnerable to anything that
happens there.  What extra benefit do you get from putting it there that
is not balanced by greater risks? I'm curious! Surely raid is about
"spreading your eggs out through several baskets"?

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03  0:30             ` Andy Smith
  2005-01-03  6:41               ` Neil Brown
@ 2005-01-03  8:03               ` Peter T. Breuer
  2005-01-03  8:58                 ` Guy
                                   ` (2 more replies)
  1 sibling, 3 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03  8:03 UTC (permalink / raw)
  To: linux-raid

Andy Smith <andy@strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 45 lines --]
> 
> On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> > Stephen Tweedie wrote:
> > > Umm, if soft raid is expected to have silent invisible corruptions in
> > > normal use,
> > 
> > It is, just as is all types of RAID.  This is a very strange thing for
> > Stephen to say - I cannot believe that he is as naive as he makes
> > himself out to be about RAID here and I don't know why he should say
> > that (presuming that he really knows better).
> > 
> > > then you shouldn't be using it, period.  That's got zero to
> > > do with journaling.
> > 
> > It implies that one should not be doing journalling on top of it.
> > 
> > (The logic for why RAID corrupts silently is that errors accumulate at
> > n times the normal rate per sector, but none of them are detected by
> > RAID (no crc), and when a disk drops out then you get a good chance of
> > picking up a corrupted copy instead of a good copy, because nobody
> > has checked the copy meanwhiles to see if it matches the original).
> 
> I have no idea which of you to believe now. :(

Both of us. We have not disagreed fundamentally. Read closely! Stephen
says "IF (my caps) soft raid is expected to have ...". Well, it is,
just like any RAID.

Similarly he didn't disagree that journal data is written twice, if you
read closely, he merely pointed out that the DEFAULT (my caps) setting
in ext3 is not to write data (as opposed to metadata) into the journal
at all.

So he avoided issues of substance there (and/but gave a strange spin to
them).

What he did claim that is factually interesting and new is that ext3
works if acks from the media are merely received after the fact. That's
a far weaker requirement than for reiserfs, for example. It seems to me
to imply that the implementation is single-threaded and highly
synchronous.

> I currently only have one system using software raid, and several of
> my employer's machines using hardware raid, all of which have
> various raid-1, -5 and -10 setups and all use only ext3.

All fine - as I said, the only thing I'd do is make sure that the
journal is not kept on the raid partition(s), and possibly turn off
data journalling in favour of metadata journalling only.

> Let's focus on the personal machine of mine for now since it uses
> Linux software RAID and therefore on-topic here.  It has /boot on a
> small RAID-1,

This is always a VERY bad idea. /boot and /root want to be on as simple
and uncomplicated a system as possible. Moreover, they never change, so
what is the point of having a real time mirror for them? It would be
sufficient to copy them every day (which is what I do) at file system
level to another partition, if you want a spare copy for emergencies.

> and the rest of the system is on RAID-5 with an
> additional RAID-0 just for temporary things.

That's fine.

> There is nowhere that is not software RAID to put the journals, so

Well, you can make somewhere. You only require an 8MB (one cylinder)
partition.

> would you be recommending that I turn off journalling and basically
> use it as ext2?

No, I'd be recommending that you make an 8MB partition for a journal.

This is also handy in case you "wear through" the disk under the
journal because of the high i/o there. Well, it's happened to me on two
disks, but doubtless people will cntest that! IF it happens, all you
have to do is use a cylinder somewhere else.

> What I do know is that none of what you say is in the software raid
> howto,

But nothing said is other than obvious, and is a matter of probabilities
and risk management, so I don't see why it should be in a howto!  That's
your business, not the howto's.

> and if you are right, it really should be.  Neither is it in

I don't think it should be. It should be somewhere in ext3 docs (there
was a time when ext3 wouldn't work on raid1 at all, butthat got fixed
somehow), but then documenting how your FS works on some particular
media is not really part of the documentation scope for the FS!

> any ext3 documentation and there is no warning on any distribution
> installer I have ever used (those that understand RAID and LVM and
> are happy to set that up at install time with ext3).  Also everyone
> that I have spoken to about this knows nothing about it, so what you

Everyone knows about it, because none of us is saying anything that is
not obvious. Yes, data is written through the journal twice. EVERYTHING
is written through the journal twice if the journal is on RAID1,
because everything on RAID1 is written twice. That is obvious, no?

And then you get i/o storms through the journal  in any case on
journalled raid, whenever you do data journalling. It is just doubled if
you do that on a raid system.

And there is a risk of silent corruption on all raid systems - that is
well known. DIfferent raid systems do different thigs to compensate,
such as periodically recalculating the parity on everything. But when
you have redundant data and a corruption occurs, which of the two
datasets do you believe? You have to choose one of them! You guess
wrong half the time, if you guess ("you" is a raid system). Hence
"silent corruption".

> are saying, if correct, would seem to have far-reaching
> implications.

I don't think so! Why? RAID protects you against certain sorts of risk.
It also exposes you to other sorts of risk. Where is the far-reaching
implication in that? It is for you to balance the risks and benefits.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03  8:03               ` Peter T. Breuer
@ 2005-01-03  8:58                 ` Guy
  2005-01-03 10:18                 ` Partiy error detection - was " Brad Campbell
  2005-01-03 12:11                 ` Michael Tokarev
  2 siblings, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-03  8:58 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

"Well, you can make somewhere. You only require an 8MB (one cylinder)
partition."

So, it is ok for your system to fail when this disk fails?
I don't want system failures when a disk fails, so mirror (or RAID5)
everything required to keep your system running.

"And there is a risk of silent corruption on all raid systems - that is
well known."
I question this....
I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1, you
know when a difference occurs, if you want.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Monday, January 03, 2005 3:03 AM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Andy Smith <andy@strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 45 lines --]
> 
> On Sun, Jan 02, 2005 at 09:18:13PM +0100, Peter T. Breuer wrote:
> > Stephen Tweedie wrote:
> > > Umm, if soft raid is expected to have silent invisible corruptions in
> > > normal use,
> > 
> > It is, just as is all types of RAID.  This is a very strange thing for
> > Stephen to say - I cannot believe that he is as naive as he makes
> > himself out to be about RAID here and I don't know why he should say
> > that (presuming that he really knows better).
> > 
> > > then you shouldn't be using it, period.  That's got zero to
> > > do with journaling.
> > 
> > It implies that one should not be doing journalling on top of it.
> > 
> > (The logic for why RAID corrupts silently is that errors accumulate at
> > n times the normal rate per sector, but none of them are detected by
> > RAID (no crc), and when a disk drops out then you get a good chance of
> > picking up a corrupted copy instead of a good copy, because nobody
> > has checked the copy meanwhiles to see if it matches the original).
> 
> I have no idea which of you to believe now. :(

Both of us. We have not disagreed fundamentally. Read closely! Stephen
says "IF (my caps) soft raid is expected to have ...". Well, it is,
just like any RAID.

Similarly he didn't disagree that journal data is written twice, if you
read closely, he merely pointed out that the DEFAULT (my caps) setting
in ext3 is not to write data (as opposed to metadata) into the journal
at all.

So he avoided issues of substance there (and/but gave a strange spin to
them).

What he did claim that is factually interesting and new is that ext3
works if acks from the media are merely received after the fact. That's
a far weaker requirement than for reiserfs, for example. It seems to me
to imply that the implementation is single-threaded and highly
synchronous.

> I currently only have one system using software raid, and several of
> my employer's machines using hardware raid, all of which have
> various raid-1, -5 and -10 setups and all use only ext3.

All fine - as I said, the only thing I'd do is make sure that the
journal is not kept on the raid partition(s), and possibly turn off
data journalling in favour of metadata journalling only.

> Let's focus on the personal machine of mine for now since it uses
> Linux software RAID and therefore on-topic here.  It has /boot on a
> small RAID-1,

This is always a VERY bad idea. /boot and /root want to be on as simple
and uncomplicated a system as possible. Moreover, they never change, so
what is the point of having a real time mirror for them? It would be
sufficient to copy them every day (which is what I do) at file system
level to another partition, if you want a spare copy for emergencies.

> and the rest of the system is on RAID-5 with an
> additional RAID-0 just for temporary things.

That's fine.

> There is nowhere that is not software RAID to put the journals, so

Well, you can make somewhere. You only require an 8MB (one cylinder)
partition.

> would you be recommending that I turn off journalling and basically
> use it as ext2?

No, I'd be recommending that you make an 8MB partition for a journal.

This is also handy in case you "wear through" the disk under the
journal because of the high i/o there. Well, it's happened to me on two
disks, but doubtless people will cntest that! IF it happens, all you
have to do is use a cylinder somewhere else.

> What I do know is that none of what you say is in the software raid
> howto,

But nothing said is other than obvious, and is a matter of probabilities
and risk management, so I don't see why it should be in a howto!  That's
your business, not the howto's.

> and if you are right, it really should be.  Neither is it in

I don't think it should be. It should be somewhere in ext3 docs (there
was a time when ext3 wouldn't work on raid1 at all, butthat got fixed
somehow), but then documenting how your FS works on some particular
media is not really part of the documentation scope for the FS!

> any ext3 documentation and there is no warning on any distribution
> installer I have ever used (those that understand RAID and LVM and
> are happy to set that up at install time with ext3).  Also everyone
> that I have spoken to about this knows nothing about it, so what you

Everyone knows about it, because none of us is saying anything that is
not obvious. Yes, data is written through the journal twice. EVERYTHING
is written through the journal twice if the journal is on RAID1,
because everything on RAID1 is written twice. That is obvious, no?

And then you get i/o storms through the journal  in any case on
journalled raid, whenever you do data journalling. It is just doubled if
you do that on a raid system.

And there is a risk of silent corruption on all raid systems - that is
well known. DIfferent raid systems do different thigs to compensate,
such as periodically recalculating the parity on everything. But when
you have redundant data and a corruption occurs, which of the two
datasets do you believe? You have to choose one of them! You guess
wrong half the time, if you guess ("you" is a raid system). Hence
"silent corruption".

> are saying, if correct, would seem to have far-reaching
> implications.

I don't think so! Why? RAID protects you against certain sorts of risk.
It also exposes you to other sorts of risk. Where is the far-reaching
implication in that? It is for you to balance the risks and benefits.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Partiy error detection - was Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03  8:03               ` Peter T. Breuer
  2005-01-03  8:58                 ` Guy
@ 2005-01-03 10:18                 ` Brad Campbell
  2005-01-03 12:11                 ` Michael Tokarev
  2 siblings, 0 replies; 172+ messages in thread
From: Brad Campbell @ 2005-01-03 10:18 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:

> And there is a risk of silent corruption on all raid systems - that is
> well known. DIfferent raid systems do different thigs to compensate,
> such as periodically recalculating the parity on everything. But when
> you have redundant data and a corruption occurs, which of the two
> datasets do you believe? You have to choose one of them! You guess
> wrong half the time, if you guess ("you" is a raid system). Hence
> "silent corruption".

Just on this point. With RAID-6 I guess you can get a majority rules type of ruling on which data 
block is the dud. Unless you come up with 3 different results in which case something is really not 
right in lego land.

I wonder perhaps about a userspace app that you can run to check all the parity blocks. On RAID-5 it 
should be able to tell you, you have a naff stripe, but on RAID-6 in theory if it's only a single 
drive that is a problem you should be able to correct the block.

Regards,
Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03  8:03               ` Peter T. Breuer
  2005-01-03  8:58                 ` Guy
  2005-01-03 10:18                 ` Partiy error detection - was " Brad Campbell
@ 2005-01-03 12:11                 ` Michael Tokarev
  2005-01-03 14:23                   ` Peter T. Breuer
  2 siblings, 1 reply; 172+ messages in thread
From: Michael Tokarev @ 2005-01-03 12:11 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:
[]
>>Let's focus on the personal machine of mine for now since it uses
>>Linux software RAID and therefore on-topic here.  It has /boot on a
>>small RAID-1,
> 
> This is always a VERY bad idea. /boot and /root want to be on as simple
> and uncomplicated a system as possible. Moreover, they never change, so
> what is the point of having a real time mirror for them? It would be
> sufficient to copy them every day (which is what I do) at file system
> level to another partition, if you want a spare copy for emergencies.

Raid1 (mirror) is the most "trivial" raid level out there, especially
having in mind that the underlying devices -- all of them -- contains
(or should, in theory -- modulo the "50% chance of any difference
being unnoticied" etc) exact copy of the filesystem.  Also, root (and
/boot -- i for one have both /boot in root in a single small filesystem)
do change -- not that often but often enouth so that "newaliases problem"
(when you "forgot" to backup it after a change) happens from time to time.

After several years of expirience with alot of systems (and alot of various
disk failure scenarios too: when you have many systems, you have good
chances to see a failure ;), I now use very simple and (so far) reliable
approach, which I explained here on this list before.  You have several
(we use 2, 3 or 4) disks which are the same (or almost: eg some 36Gb
disks are really 35Gb or 37Gb; in case they're differ, "extra" space
on large disk isn't used); root and /boot are on small raid1 partition
which is mirrored on *every* disk; swap is on raid1; the rest (/usr,
/home, /var etc) are on raid5 arrays (maybe also raid0 for some "scratch"
space).  This way, you have "equal" drives, and *any* drive, including
boot one, may fail at any time and the system will continue working
as if all where working, including reboot (except of a (very rare in
fact) failure scenario when your boot disk has failed MBR or other
sectors required to boot, but "the rest" of that disk is working,
in which case you'll need physical presence to bring the machine up).
All the drives are "symmetrical", usage patterns for all drives are
the same, and due to usage of raid arrays, load is spread among them
quite nicely.  You're free to reorder the drives in any way you want,
to replace any of them (maybe rearranging the rest if you're
replacing the boot drive) and so on.

Yes, root fs does not changes often, and yes it is small enouth
(I use 1Gb, or 512Mb, or even 256Mb for root fs - not a big deal
to allocate that space on every of 2 or 3 or 4 or 5 disks).  So
it isn't quite relevant how fast the filesystem will be on writes,
and hence it's ok to place it on raid1 composed from 5 components.
The stuff just works, it is very simple to administer/support,
and does all the "backups" automatically.  In case of some problem
(yes I dislike any additional layers for critical system components
as any layer may fail to start during boot etc), you can easily
bring the system up by booting off the underlying root-raid partiton
to repair the system -- all the utilities are here.  More, you can
boot from one disk (without raid) and try to repair root fs on
another drive (if things are really screwed up), and when you're
done, bring the raid up on that repaired partition and add other
drives to the array.

To summarize: having /boot and root on raid1 is a very *good* idea. ;)
It saved our data alot of times in the past few years already.

If you're worried about "silent data corruption" due to different
data being read from different components of the raid array.. Well,
first of all, we never saw that yet (we have quite good "testcase")
(and no, I'm not saying it's impossible ofcourse).  On rarely-changed
filesystem, with real drives which does no silent remapping of an
undeadable blocks to new place with the data on them becoming all-0s,
without drives with uncontrollable write caching (quite common for
IDE drives) and things like that, and with real memory (ECC I mean),
where you *know* what you're writing to each disk (yes, there's also
another possible cause of a problem: software errors aka bugs ;),
that case with different data on different drives becomes quite..
rare.  In order to be really sure, one can mount -o remount,ro /
and just compare all components of the root raid, periodically.
When there's more than 2 components on that array, it should be
easy to determine which drive is "lying" in case of any difference.
I do similar procedure on my systems during boot.

>>There is nowhere that is not software RAID to put the journals, so
> 
> Well, you can make somewhere. You only require an 8MB (one cylinder)
> partition.

Note scsi disks in linux only supports up to 14 partitions, which
isn't sometimes sufficient even without additional partitions for
journal.  When you have large amount of disks (so having that
"fully-symmetrical" layout as I described above becomes impractical),
you can use one set of drives for data and another set of drives
for journal for that data.  When you only have 4 (or less) drives...

And yes I'm aware of mdp devices (partitions inside the raid
arrays).. but that's just another layer "which may fail": if
raid5 array won't start, I at least can reconstruct filesystem
image by reading chunks of data from appropriate places from
all drives and try to recover that image; with any additional
structure inside the array (and the lack of "loopP" aka partitioned
loop devices) it becomes more and more tricky to recover any
data (from this point of view, raid1 is the niciest raid level ;)

Again: instead of using a partition for the journal, use (another?)
raid array.  This way, the system will work if the drive wich
contains the journal fails.  Note above about swap: in all my
systems, swap is also on raid (raid1 in this case).  At the first
look, that can be a nonsense: having swap on raid.  But we had
enouth cases when due to a failed drive swap becomes corrupt
(unreadable really), and the system goes havoc, *damaging*
other data which was unaffected by the disk failure!  With
swap on raid1, the system continues working if any drive
fails, which is good.  (Older kernels, esp. 2.2.* series,
had several probs with swap on raid, but that has been fixed
now; there where other bugs fixed too (incl. bugs in ext3fs)
so there should be no such damage to other data due to
unreadable swap.. hopefully.  But I can't trust my systems
anymore after seeing (2 times in 4 years) what can happen with
the data...)

[]

And I also want to "re-reply" to the first your message in this
thread, where I was saying that "it's a nonsense that raid does
not preserve write ordering".  Ofcourse I mean not write ordering
but working write barriers (as Neil pointed out, md subsystem does
not implement write barriers directly but the concept is "emulated"
by linux block subsystem).  Write barriers should be sufficient to
implement journalling safely.

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 12:11                 ` Michael Tokarev
@ 2005-01-03 14:23                   ` Peter T. Breuer
  2005-01-03 18:30                     ` maarten
                                       ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03 14:23 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> Peter T. Breuer wrote:
> > This is always a VERY bad idea. /boot and /root want to be on as simple
> > and uncomplicated a system as possible. Moreover, they never change, so
> > what is the point of having a real time mirror for them? It would be
> > sufficient to copy them every day (which is what I do) at file system
> > level to another partition, if you want a spare copy for emergencies.
> 
> Raid1 (mirror) is the most "trivial" raid level out there, especially

Hi!

> having in mind that the underlying devices -- all of them -- contains
> (or should, in theory -- modulo the "50% chance of any difference
> being unnoticied" etc) exact copy of the filesystem.  Also, root (and
> /boot -- i for one have both /boot in root in a single small filesystem)
> do change -- not that often but often enouth so that "newaliases problem"
> (when you "forgot" to backup it after a change) happens from time to time.

Well, my experience is that anything "unusual" is bad:  sysadmins change
over the years;  the guy who services the system may not be the one that
built it;  the "rescue" cd or floppy he has may not have MD support
built into the kernel (and he probably will need a rescue cd just to get
support for a raid card, if the machine has hardware raid as well as or
instead of software raid).

Therefore, I have learned not to build a system that is more complicated
than the most simple human being that may administer it. This always
works - if it breaks AND they cannot fix it, then THEY get the blame.

So I "prefer" to not have a raided boot partition, but instead to rsync
the root partition every day to a spare on a differet disk, or/and at the
other end of the same disk. This also saves the system from sysadmin
gaffes - I don't WANT an instantaneous copy of every error made by the
humans.

This is not to say that I do not like your ideas, expressed here. I do.
I even agree with them.

It is just that when they mess up the root partition, I can point to
the bootloader entry that says "boot from spare root partition".

And let's not get into what they can do to the labelling on the
partition types - FD? Must be a mistake!

> After several years of expirience with alot of systems (and alot of various
> disk failure scenarios too: when you have many systems, you have good
> chances to see a failure ;), I now use very simple and (so far) reliable
> approach, which I explained here on this list before.  You have several
> (we use 2, 3 or 4) disks which are the same (or almost: eg some 36Gb

Well, whenever I buy anything, I buy two. I buy two _controller_ cards,
and tape the extra one inside the case. But of course I buy two
machines, so that is four cards ... .

And I betcha softraid sb has changed format ver the years. I am still
running P100s!

> disks are really 35Gb or 37Gb; in case they're differ, "extra" space
> on large disk isn't used); root and /boot are on small raid1 partition
> which is mirrored on *every* disk; swap is on raid1; the rest (/usr,

I like this - except of course that I rsync them, not raid them. I
don't mind if I have to reboot a server. Nobody will notice the tcp
outage and the other one of the pair will failover for it, albeit in
readonly mode, for the maximum of the few minutes required.

Your swap idea is crazy, but crazy enough to be useful. YES, there used
to be a swap bug which corrupted swap every so often (in 2.0? 2.2?) and
meant one had to swapoff and swapon again, having first cleared all
processes by an init 1 and back. Obviously that bug would bite whatever
you had as media, but it still is a nice idea to have raided memory
:-).

> /home, /var etc) are on raid5 arrays (maybe also raid0 for some "scratch"

I don't put /var on raid if I can help it. But there is nothing
particularly bad about it. It is just that /var is the most active
place and therefore the most likely to suffer damage of some kind, somehow.
And damaged raided partitions are really not nice. Raid does not
protect you against hardware corruption - on the contrary, it makes it
more difficult to spot and doubles the probabilities of it happening.

> space).  This way, you have "equal" drives, and *any* drive, including
> boot one, may fail at any time and the system will continue working
> as if all where working, including reboot (except of a (very rare in
> fact) failure scenario when your boot disk has failed MBR or other
> sectors required to boot, but "the rest" of that disk is working,
> in which case you'll need physical presence to bring the machine up).

That's actually not so. Over new year I accidently booted my home
server (222 days uptime!) and discovered its boot sector had evaporated.
Well, maybe I moved the kernels ..  anyway, it has no floppy and the
nearest boot cd was an hour's journey away in the cold, on new year.  Uh
uh.  It took me about 8 hrs, but I booted it via PXE DHCP TFTP
wake-on-lan and the wireless network, from my laptop, without leaving
the warm.

Next time I may even know how to do it beforehand :).

> All the drives are "symmetrical", usage patterns for all drives are
> the same, and due to usage of raid arrays, load is spread among them
> quite nicely.  You're free to reorder the drives in any way you want,
> to replace any of them (maybe rearranging the rest if you're
> replacing the boot drive) and so on.

You can do this hot? How? Oh, you must mean at reboot.

> Yes, root fs does not changes often, and yes it is small enouth
> (I use 1Gb, or 512Mb, or even 256Mb for root fs - not a big deal

Mine are always under 256MB, but I give 512MB.

> to allocate that space on every of 2 or 3 or 4 or 5 disks).  So
> it isn't quite relevant how fast the filesystem will be on writes,
> and hence it's ok to place it on raid1 composed from 5 components.

That is, uh, paranoid.

> The stuff just works, it is very simple to administer/support,
> and does all the "backups" automatically. 

Except that  it doesn't - backups are not raid images. Backups are
snapshots. Maybe you mean that.

> In case of some problem
> (yes I dislike any additional layers for critical system components
> as any layer may fail to start during boot etc), you can easily
> bring the system up by booting off the underlying root-raid partiton
> to repair the system -- all the utilities are here.  More, you can

Well, you could, and I could, but I doubt if the standard tech could.

> boot from one disk (without raid) and try to repair root fs on
> another drive (if things are really screwed up), and when you're
> done, bring the raid up on that repaired partition and add other
> drives to the array.

But why bother? If you didn't have raid there on root you wouldn't
need to repair it. Nothing is quite as horrible as having a
fubarred root partition.  That's why I also always have two! But I
don't see that having the copy made by raid rather than rsync wins
you anything in the situaton where you have to  reboot - rather, it
puts off that moment to a moment of your choosing, which may be good, 
but is not an unqualified bonus, given the cons.

> To summarize: having /boot and root on raid1 is a very *good* idea. ;)
> It saved our data alot of times in the past few years already.

No - it saved you from taking the system down at that moment in time.
You could always have rebooted it from a spare root partition whether
you had raid there or not.

> If you're worried about "silent data corruption" due to different
> data being read from different components of the raid array.. Well,
> first of all, we never saw that yet (we have quite good "testcase")

It's hard to see, and youhave to crash and come back up quite a lot to
make it probable. A funky scsi cable would help you see it!

> (and no, I'm not saying it's impossible ofcourse).  On rarely-changed
> filesystem, with real drives which does no silent remapping of an
> undeadable blocks to new place with the data on them becoming all-0s,

Yes, I agree. On rarely changing systems raid is a benefit, because
it enables you to carry on in case the unthinkable happens and one disk
vaporizes (while letting the rest of the system carry on, with much luck).
On rapidly changing systems like /var I start to get a little uneasy.
On /home I am quite happy with it. I wouldn't have it any other way
there..

> without drives with uncontrollable write caching (quite common for
> IDE drives) and things like that, and with real memory (ECC I mean),
> where you *know* what you're writing to each disk (yes, there's also
> another possible cause of a problem: software errors aka bugs ;),

Indeed, and very frequent they are too.

> that case with different data on different drives becomes quite..
> rare.  In order to be really sure, one can mount -o remount,ro /
> and just compare all components of the root raid, periodically.
> When there's more than 2 components on that array, it should be
> easy to determine which drive is "lying" in case of any difference.
> I do similar procedure on my systems during boot.

Well, voting is one possible procedure. I don't know if softraid does
that anywhere, or attempts repairs.

Neil?

> >>There is nowhere that is not software RAID to put the journals, so
> > 
> > Well, you can make somewhere. You only require an 8MB (one cylinder)
> > partition.
> 
> Note scsi disks in linux only supports up to 14 partitions, which

You can use lvm (device mapper). Admittedly I was thinking of IDE.

If you like I can patch scsi for 63 partitions?

> isn't sometimes sufficient even without additional partitions for
> journal.  When you have large amount of disks (so having that
> "fully-symmetrical" layout as I described above becomes impractical),
> you can use one set of drives for data and another set of drives
> for journal for that data.  When you only have 4 (or less) drives...
> 
> And yes I'm aware of mdp devices (partitions inside the raid
> arrays).. but that's just another layer "which may fail": if
> raid5 array won't start, I at least can reconstruct filesystem
> image by reading chunks of data from appropriate places from
> all drives and try to recover that image; with any additional

Now that is just perverse.

> structure inside the array (and the lack of "loopP" aka partitioned
> loop devices) it becomes more and more tricky to recover any
> data (from this point of view, raid1 is the niciest raid level ;)

Agree.

> Again: instead of using a partition for the journal, use (another?)
> raid array.  This way, the system will work if the drive wich
> contains the journal fails.

But the journal will also contain corruptions if the whole system
crashes, and is rebooted. You just spent several paragraphs (?) arguing
so. Do you really want those rolled forward to complete? I would
rather they were rolled back! I.e. that the journal were not there -
I am in favour of a zero size journal, in other words, which only acts
to guarantee atomicity of FS ops (FS code on its own may do that), but
which does not contain data.

> Note above about swap: in all my
> systems, swap is also on raid (raid1 in this case).  At the first
> look, that can be a nonsense: having swap on raid.  But we had
> enouth cases when due to a failed drive swap becomes corrupt
> (unreadable really), and the system goes havoc, *damaging*
> other data which was unaffected by the disk failure!  With

Yes, this used to be quite common when swap had that size bug.

> swap on raid1, the system continues working if any drive
> fails, which is good.  (Older kernels, esp. 2.2.* series,
> had several probs with swap on raid, but that has been fixed
> now; there where other bugs fixed too (incl. bugs in ext3fs)
> so there should be no such damage to other data due to
> unreadable swap.. hopefully.  But I can't trust my systems
> anymore after seeing (2 times in 4 years) what can happen with
> the data...)
> 
> []
> 
> And I also want to "re-reply" to the first your message in this
> thread, where I was saying that "it's a nonsense that raid does
> not preserve write ordering".  Ofcourse I mean not write ordering
> but working write barriers (as Neil pointed out, md subsystem does
> not implement write barriers directly but the concept is "emulated"
> by linux block subsystem).  Write barriers should be sufficient to
> implement journalling safely.

I am not confident that Neil did say so. I have not reexamined his
post, but I got the impression that he hummed and hawed over that.
I do not recall that he said that raid implements write barriers -
perhaps he did. Anyway, I do not recall any code to handle "special"
requests, which USED to be the kernel's barrier mechanism. Has that
mechanism changed (it could have!)?

What is the write barrier mechanism in the 2.6 series (and what was it
in 2.4? I don't recall one at all)?

I seem to recall that Neil said instead that raid acks writes only after
they have been carried out on all components, which Stephen said was
sufficient for ext3. OTOH we do not know if it is sufficient for
reiserfs, xfs, jfs, etc.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 14:23                   ` Peter T. Breuer
@ 2005-01-03 18:30                     ` maarten
  2005-01-03 21:36                     ` Michael Tokarev
  2005-01-05  5:50                     ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
  2 siblings, 0 replies; 172+ messages in thread
From: maarten @ 2005-01-03 18:30 UTC (permalink / raw)
  To: linux-raid

On Monday 03 January 2005 15:23, Peter T. Breuer wrote:
> Michael Tokarev <mjt@tls.msk.ru> wrote:
> > Peter T. Breuer wrote:

> Therefore, I have learned not to build a system that is more complicated
> than the most simple human being that may administer it. This always
> works - if it breaks AND they cannot fix it, then THEY get the blame.
>
> So I "prefer" to not have a raided boot partition, but instead to rsync
> the root partition every day to a spare on a differet disk, or/and at the
> other end of the same disk. This also saves the system from sysadmin
> gaffes - I don't WANT an instantaneous copy of every error made by the
> humans.

There certainly is something to be said for that...  
However, I do expect an admin to know about the raid system that's being used, 
else they would have no business being near that server in the first place. 

> > disks are really 35Gb or 37Gb; in case they're differ, "extra" space
> > on large disk isn't used); root and /boot are on small raid1 partition
> > which is mirrored on *every* disk; swap is on raid1; the rest (/usr,
>
> I like this - except of course that I rsync them, not raid them. I
> don't mind if I have to reboot a server. Nobody will notice the tcp
> outage and the other one of the pair will failover for it, albeit in
> readonly mode, for the maximum of the few minutes required.

I tend to agree, but it varies widely with the circumstances.  I've had 
servers in unattended colo facilties, and your approach will not work too 
well there.

> That's actually not so. Over new year I accidently booted my home
> server (222 days uptime!) and discovered its boot sector had evaporated.

We've all been there...  :-(

> Well, maybe I moved the kernels ..  anyway, it has no floppy and the
> nearest boot cd was an hour's journey away in the cold, on new year.  Uh
> uh.  It took me about 8 hrs, but I booted it via PXE DHCP TFTP
> wake-on-lan and the wireless network, from my laptop, without leaving
> the warm.

Congrats, but I do hope you did that for your home server...!  Cause I'd have 
severe moral and practical difficulties selling that to a paying customer:  
"So instead of billing me a cab fare and two hours, you spent eight hours to 
fix this.  And you seriously expect me to pay for those extra hours ?" 

> > to allocate that space on every of 2 or 3 or 4 or 5 disks).  So
> > it isn't quite relevant how fast the filesystem will be on writes,
> > and hence it's ok to place it on raid1 composed from 5 components.
>
> That is, uh, paranoid.

We also did use three-way raid-1 mirrors as a rule.
(but I am indeed somewhat paranoid ;-)

> > In case of some problem
> > (yes I dislike any additional layers for critical system components
> > as any layer may fail to start during boot etc), you can easily
> > bring the system up by booting off the underlying root-raid partiton
> > to repair the system -- all the utilities are here.  More, you can
>
> Well, you could, and I could, but I doubt if the standard tech could.

I've said it before and I'll say it again:  An admin has to be competent. If 
not, there is little you can do.  You can't have fresh MCSE people fix linux 
problems, and you cannot have a carpenter selling stock on wall street.

A "standard tech" as you say, has a skill level that enables him to swap a 
drive of a hotswap server if so directed, but anything beyond that is 
unrealistic, and he will need adequate help (be it remote by telephone, or 
whatever means).  Or very extensive onsite step by step documentation.

> But why bother? If you didn't have raid there on root you wouldn't
> need to repair it. Nothing is quite as horrible as having a
> fubarred root partition.  That's why I also always have two! But I
> don't see that having the copy made by raid rather than rsync wins
> you anything in the situaton where you have to  reboot - rather, it
> puts off that moment to a moment of your choosing, which may be good,
> but is not an unqualified bonus, given the cons.

Both approaches have their merits.  In one case the danger lies in not having 
updated the rsync mirror recently enough, in the other a rogue change will 
affect all your mirrors.  Without further info on the specific circumstances 
no choice can be made, it really depends on too many factors. 

> > And yes I'm aware of mdp devices (partitions inside the raid
> > arrays).. but that's just another layer "which may fail": if
> > raid5 array won't start, I at least can reconstruct filesystem
> > image by reading chunks of data from appropriate places from
> > all drives and try to recover that image; with any additional
>
> Now that is just perverse.

Not neccessarily.  I've had to rely on using dd_rescue to get data back at 
some point is time. In such scenarios, any additional layer can quickly 
complicate things beyond reasonable recourse.
As you noted yourself, keeping a backup stategy can be hard work. ;-|

> > Note above about swap: in all my
> > systems, swap is also on raid (raid1 in this case).  At the first
> > look, that can be a nonsense: having swap on raid.  But we had
> > enouth cases when due to a failed drive swap becomes corrupt
> > (unreadable really), and the system goes havoc, *damaging*
> > other data which was unaffected by the disk failure!  With
>
> Yes, this used to be quite common when swap had that size bug.

When you have swap on a failed disk, often the safer way is to stop the 
machine by using the reset button instead of attempting a shutdown.
The shutdown would probably fail halfway through anyway...

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 14:23                   ` Peter T. Breuer
  2005-01-03 18:30                     ` maarten
@ 2005-01-03 21:36                     ` Michael Tokarev
  2005-01-05  5:50                     ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
  2 siblings, 0 replies; 172+ messages in thread
From: Michael Tokarev @ 2005-01-03 21:36 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:
> Michael Tokarev <mjt@tls.msk.ru> wrote:
> 
>>Peter T. Breuer wrote:
>>
>>>This is always a VERY bad idea. /boot and /root want to be on as simple
>>>and uncomplicated a system as possible....
[]
> Well, my experience is that anything "unusual" is bad:  sysadmins change
> over the years;  the guy who services the system may not be the one that
> built it;  the "rescue" cd or floppy he has may not have MD support
> built into the kernel (and he probably will need a rescue cd just to get
> support for a raid card, if the machine has hardware raid as well as or
> instead of software raid).

It is trivial to learn (or teach) that you can boot with root=/dev/sda1
instead of root=/dev/md1.  All our technicians knows that.  Indeed, most
will not be able to recover the system in most cases anyway IF root raid
will not start "by its own", but that's a different matter entirely.

Sometimes it may be a trivial case (like we had just a few months ago:
I asked our guy to go to the remote office to replace a drive and gave
him a replacement, which contained raid component in the first partition..
and "stupid boot code" (Mine!.. sort of ;) descided to bring THAT array
instead of the real raid array on original disks because the event
counter was greather.  So we ended up booting with
   root=/dev/sda1 ro init=/bin/sh
and just zeroing the partition in the new drive.. which I simple forgot
to do in the first place.  All by phone, it took about 5 minutes to
complete the whole task and bring the server up after the reboot that
he performed when went there.. total downtime was about 15 minutes.
After which, I logged into the system remotely, verified data integrity
of existing arrays and added the new disk to all the arrays -- while
the system was in production (just a bit slow) and our guy was having
some tea (or whatever... ;) before going back.

Having several root partitions is a good thing.  I rarely need to
tweak the boot process (that to say: it rarely fails, except of
several stupid cases like the above;  And when it fails, it is
even possible to bring the system up while on phone with any non-
technicial "monkey" from their remote office who don't even know
latin letters (we're in russia so english isn't our native language;
when describing what to do I sometimes tell to press "cyrillic"
keys on the keyboard to produce latin characters... ;)

And yes, any bug that "crashes" this damn root system will mess
with the whole thing, with all mirrors.. which is a fatal error,
so to say.  Well.. if you can't trust the processor for example
to correctly multiple 2*2 and want to use two (or more) processors
just to be sure one isn't lying.. such systems do exists too, but
hey, they aren't cheap at all... ;)

> Therefore, I have learned not to build a system that is more complicated
> than the most simple human being that may administer it. This always
> works - if it breaks AND they cannot fix it, then THEY get the blame.

Perhaps our conditions are a bit different.. who kows.
Raid1 IS simple -- both in implementation and in usage.
Our guys are trained very hard to ensure they will NOT try to mess things
up unless they're absolutely sure what they're doing.  And I don't care
who to blame (me or my technicians or whatever): we all will lose money
in case of serious problems (we're managing networks/servers for $customers,
they're paying us for the system to work with at most one business day
downtime in case of any problem, with all their data intact -- we're
speaking of remote locations here)... etc.. ;)

If you don't trust your guys to do right (or to ask and *understand*
when they don't know), or if your guys are doing mistakes all over --
perhaps it's a time to hire them off? ;)

> And let's not get into what they can do to the labelling on the
> partition types - FD? Must be a mistake!

BTW, letting the kernel to start arrays is somewhat.. wrong (i mean the
auto-detection and that "FD" type).  Again, I learned it the hard way ;)
The best is to have initrd and pass GUUID of your raid array into it
(this is what I *don't* do currently ;)

[]
> Well, whenever I buy anything, I buy two. I buy two _controller_ cards,
> and tape the extra one inside the case. But of course I buy two
> machines, so that is four cards ... .

Oh well... that all depends on the amount of $money.  We have several
"double-machines" too here, but only several.  The rest of systems
are single, with a single (usually onboard) scsi controller.

> And I betcha softraid sb has changed format ver the years. I am still
> running P100s!

It isn't difficult to re-create the arrays, even remotely (if you're
accurate enouth ofcourse).  Not that it is needed too often either... ;)

>>disks are really 35Gb or 37Gb; in case they're differ, "extra" space
>>on large disk isn't used); root and /boot are on small raid1 partition
>>which is mirrored on *every* disk; swap is on raid1; the rest (/usr,
> 
> I like this - except of course that I rsync them, not raid them. I
> don't mind if I have to reboot a server. Nobody will notice the tcp
> outage and the other one of the pair will failover for it, albeit in
> readonly mode, for the maximum of the few minutes required.

Depends alot of your usage patterns, or tasks running, or other
conditions (esp. physical presense).  I for one can't afford
rebooting many of them at all, for a very simple reason: many
of them are on a very bad dialups, and some are several 100s KMs
(or miles for that matter) away... ;)  And also most of the
boxes are running oracle with quite complex business rules,
they're calculating some reports which sometimes takes quite
some time to complete, esp at the end of the year.

For this very reason -- they're all quite far away from me,
out of reach -- I tend to ensure they WILL boot and will
be able to dial out and bring that damn tunnel so I can
log in and repair whatever is needed... ;)

> Your swap idea is crazy, but crazy enough to be useful. YES, there used
> to be a swap bug which corrupted swap every so often (in 2.0? 2.2?) and
> meant one had to swapoff and swapon again, having first cleared all
> processes by an init 1 and back. Obviously that bug would bite whatever
> you had as media, but it still is a nice idea to have raided memory
> :-).

It sounds crazy at the first look, I wrote just that.  But it
helps to ensure the system is running as if nothing happened.
I just receive email telling me node X has one array degraded,
I log in there remotely, diagnose the problem and do whatever
is needed (remapping the bad block, or arranging to send a guy
there with a replacement drive when there will be such a chance..
whatever).  The system continues working just fine for all that
time (UNLESS another drive fails too ofcourse -- all the raid5
arrays will be dead and "urgent help" will be needed in that
case -- at *that* time only it will be possible to reboot as
many times as needed).

The key point with my setup is that the system will continue
working in case of any single drive failure.  If I need more
protection, I'll use raid6 or raid10 (or raid1 on several
drives, whatever) so the system will continue working in case
of multiple drive failures.  But it will still be running,
giving me a time/chance to diagnose the prob and find good
schedule for our guys to come and fix things if needed --
maybe "right now", maybe "next week" or even "next month",
depending on the exact problem.

>>/home, /var etc) are on raid5 arrays (maybe also raid0 for some "scratch"
> 
> I don't put /var on raid if I can help it. But there is nothing
> particularly bad about it. It is just that /var is the most active
> place and therefore the most likely to suffer damage of some kind, somehow.
> And damaged raided partitions are really not nice. Raid does not
> protect you against hardware corruption - on the contrary, it makes it
> more difficult to spot and doubles the probabilities of it happening.

Heh.. In our case, the most active (and largest) filesystem is /oracle ;)
And yes I know using raid5 for a database isn't quite a good idea..
but that's entirely different topic (and nowadays, when raid5 checksumming
is very cheap in terms of cpu, maybe it isn't that bad anymore ;)

Yes raid5 is complex beast compared to raid1.  Yes, raid5 may not be
appropriate for some workloads.  And still -- yes, this all is about
a compromise in money one have and what he can afford to lose and
for what duration, and how much (money again, but in this our case
that'll be our money, not $client money) one can spend to fix the
problem IF it will need to be fixed the "hard way" (so far we have
two cases which required restoration the hard way, one was because
$client used cheap hardware instead of our recommendations and
non-ECC memory failed -- in that case, according to contract, it
was $client who paid for the restore; and second was due to
software error (oracle bug, now fixed), but we had "remote backup"
of the database (it's a large distributed system and the data is
replicated among several nodes), so I exported the "backup" and
just re-initialized their database.  And yes, we simulated various
recovery scenarios in our own office on a toy data, to be sure
we will be able to recover the thing the "hard way" if that'll
be needed).

>>space).  This way, you have "equal" drives, and *any* drive, including
>>boot one, may fail at any time and the system will continue working
>>as if all where working, including reboot (except of a (very rare in
>>fact) failure scenario when your boot disk has failed MBR or other
>>sectors required to boot, but "the rest" of that disk is working,
>>in which case you'll need physical presence to bring the machine up).
> 
> That's actually not so. Over new year I accidently booted my home
> server (222 days uptime!) and discovered its boot sector had evaporated.

Uh-oh.  So just replace the first and second (or third, 4th) disks
and boot from that... ;)  Yes that can happen -- after all, lilo may
have a bug, or a bios, or mbr code...  But that's again about whenever
you can afford to "trust the processor", above.  Yes again, there are
humans which tend to make mistakes (I made alot of mistakes in my
life, oh, alot of them!, once I even formatted the "wrong disk" and
lost half a year of our work.. and started doing some backups finally ;).
I don't think there's anything that can protect against human mistakes --
I mean humans who manage the system, not who use it.

> Well, maybe I moved the kernels ..  anyway, it has no floppy and the
> nearest boot cd was an hour's journey away in the cold, on new year.  Uh
> uh.  It took me about 8 hrs, but I booted it via PXE DHCP TFTP
> wake-on-lan and the wireless network, from my laptop, without leaving
> the warm.

Heh.
Well, I always have bootable cd or a floppy, just in case.  Not to say
I really used that at least once (but I do know it contains all tools
needed for boot and recovery).  Yes, shit happens (tm) too... ;)

>>All the drives are "symmetrical", usage patterns for all drives are
>>the same, and due to usage of raid arrays, load is spread among them
>>quite nicely.  You're free to reorder the drives in any way you want,
>>to replace any of them (maybe rearranging the rest if you're
>>replacing the boot drive) and so on.
> 
> You can do this hot? How? Oh, you must mean at reboot.

Yes -- here I was speaking about the "worst case", when boot fails
for whatever reason.  I never needed the "boot floppy" just because
of this: I can make any drive bootable just by changing the SCSI IDs,
and the system will not notice anything changed (it will in fact:
somewhere in dmesg you'll find "device xx was yy before" message
from md, that's basically all).  99% of the systems we manage don't
have hot-swap drives, so in case a drive have to be replaced, reboot
is needed anyway.  The nice thing is that I don't care which drive
is being replaced (except of the boot one -- in that case our guys
knows they have to set up another - any - drive as bootable), and
when system boots, I just do
   for f in 1 2 3 5 6 7 8 9; do
     mdadm --add /dev/md$f /dev/sdX$f
   done
(note the device numbering too: mdN is built of sd[abc..]N)
and be done with that (not really *that* simple, I prefer to
verify integrity of other drives before adding the new one,
but that's just details).

>>Yes, root fs does not changes often, and yes it is small enouth
>>(I use 1Gb, or 512Mb, or even 256Mb for root fs - not a big deal
> 
> Mine are always under 256MB, but I give 512MB.
> 
>>to allocate that space on every of 2 or 3 or 4 or 5 disks).  So
>>it isn't quite relevant how fast the filesystem will be on writes,
>>and hence it's ok to place it on raid1 composed from 5 components.
> 
> That is, uh, paranoid.

The point isn't about paranoia, it's about simplicitly.  Or symmetry,
which leads to that same simplicity again.  They're all the same and
can be used interchangeable, period.  For larger amount of disks,
such a layout may be not as practical, but it should work find with
up to, say, 6 disks (but I'm somewhat afraid to use raid5 with 6
components, as the chance to have two failed drives, which is "fatal"
for raid5, increases).

>>The stuff just works, it is very simple to administer/support,
>>and does all the "backups" automatically. 
> 
> Except that  it doesn't - backups are not raid images. Backups are
> snapshots. Maybe you mean that.

"Live" backups ;)... with all the human errors on them too.
Raid1 can't manage snapshots.

>>In case of some problem
>>(yes I dislike any additional layers for critical system components
>>as any layer may fail to start during boot etc), you can easily
>>bring the system up by booting off the underlying root-raid partiton
>>to repair the system -- all the utilities are here.  More, you can
[]
>>boot from one disk (without raid) and try to repair root fs on
>>another drive (if things are really screwed up), and when you're
>>done, bring the raid up on that repaired partition and add other
>>drives to the array.
> 
> But why bother? If you didn't have raid there on root you wouldn't
> need to repair it.

See above for a (silly) example -- wrong replacement disk ;)
And indeed that's silly example.  I once had another "test case"
to deal with, when my root raid was composed of 3 components
and each had different event counter, for whatever reason (I
don't remember the details already) -- raid1 was refusing to
start.  It was with 2.2.something kernel -- things changed
since that time alot, but the "testcase" still was here
(it happened on our test machine in office).

> Nothing is quite as horrible as having a
> fubarred root partition.  That's why I also always have two! But I
> don't see that having the copy made by raid rather than rsync wins
> you anything in the situaton where you have to  reboot - rather, it
> puts off that moment to a moment of your choosing, which may be good, 
> but is not an unqualified bonus, given the cons.

It helps keeping the machine running *and* bootable even after
losing the boot drive (for some reason our drives fails completely
most of the time, instead of developing bad sectors; so next reboot
(our remote offices have somewhat unstable power and gets rebooted
from time to time) will be from the 2nd drive...).  It saves you
from the "newaliases" problem ("forgot to rsync" maybe silly, but
even after a small change, remembering/repeating it after the recovery
again (if the crash happened before rsync but after the change) --
I'm lazy ;)  Yes this technique places more "load" on the administrator,
because every his mistake gets mirrored automatically and immediately...

There was a funny case with another box I installed for myself.
It's in NY, USA (I'm in Russia, remember?), and there's noone
at the colo facility who knows linux well enouth, and the box
in question has no serial console (esp bios support).  After
installing the system (there was some quick-n-dirty linux
"preinstalled" by the colo guys -- thanks god they created
two partitions (2nd one was swap), -- I loaded another distro
to the swap partition, rebooted and re-partitioned the rest
moving /var etc into real place... Fun by itself, but that's
not the point.  After successeful install, "in a hurry", I
did some equivalent of... rm -rf /* !  Just a typo, but WHAT
typo! ;)

I was doing all that from home over a dialup.  I hit Ctrl-C
when it wiped out /boot, /bin (mknod! chmod! cat!), /etc, /dev,
and started removing /home (which was quite large).  Damn fast
system!.. ;)  I had only one ssh window opened...

With the help from uudecode (wich is in /usr/bin) and alot of
cut-n-pasteing, I was able to create basic utilities on that
system -- took them from asmutils site.  Restored /dev somehow,
cut-n-pasted small wget, and slowly reinstalled the whole
system again (it's debian).  Took me 3 damn hours of a very
hard work to reinstall and configure and to ensure everything
is ok to reboot -- I had no other chance to correct any boot
mistakes, and, having in mind our bad phone lines, the whole
procedure was looking almost impossible.

I asked a friend of mine to log in before the reboot and to
check if everything looks ok and it will actually boot.  But
it just worked.  After that excersise, I was sleeping for more
than 12 hours in a row, because I was really tired.

That to say: ugh-oh, damn humans, there's nothing here to
protect the poor machines from them, they always will find
their ways to screw things up... ;)

Yes, having non-raid rsynced backup helps in that case,
and yes, such a case is damn rare...

Dunno whichever is "right".  After all, nothing stops to
have BOTH mirrored AND backed-up root filesystem... ;)

>>To summarize: having /boot and root on raid1 is a very *good* idea. ;)
>>It saved our data alot of times in the past few years already.
> 
> No - it saved you from taking the system down at that moment in time.
> You could always have rebooted it from a spare root partition whether
> you had raid there or not.

Quite a problem when the system is away from you... ;)

>>If you're worried about "silent data corruption" due to different
>>data being read from different components of the raid array.. Well,
>>first of all, we never saw that yet (we have quite good "testcase")
> 
> It's hard to see, and youhave to crash and come back up quite a lot to
> make it probable. A funky scsi cable would help you see it!

We did alot of testing by our own too.  Sure that's not cover every
possible case.  Yet all of the 200+ systems we manage are working
just fine since 1999, with no single "bad" failure so far (I already
mentioned 2 cases which aren't really count for obvious reasons).

[]
>>without drives with uncontrollable write caching (quite common for
>>IDE drives) and things like that, and with real memory (ECC I mean),
>>where you *know* what you're writing to each disk (yes, there's also
>>another possible cause of a problem: software errors aka bugs ;),
> 
> Indeed, and very frequent they are too.
> 
>>that case with different data on different drives becomes quite..
>>rare.  In order to be really sure, one can mount -o remount,ro /
>>and just compare all components of the root raid, periodically.
>>When there's more than 2 components on that array, it should be
>>easy to determine which drive is "lying" in case of any difference.
>>I do similar procedure on my systems during boot.
> 
> Well, voting is one possible procedure. I don't know if softraid does
> that anywhere, or attempts repairs.
> 
> Neil?

It does not do that.. yet.

>>>>There is nowhere that is not software RAID to put the journals, so
>>>
>>>Well, you can make somewhere. You only require an 8MB (one cylinder)
>>>partition.
>>
>>Note scsi disks in linux only supports up to 14 partitions, which
> 
> You can use lvm (device mapper). Admittedly I was thinking of IDE.

The same problem as with partitionable raid arrays, and with your
statement about simplicity: lvm layout may be quite complex and
quite difficult to repair *if* something goes really wrong.

And.. oh, no IDE please, thanks alot!.. :)

> If you like I can patch scsi for 63 partitions?

I did that once myself - patched 2.2.something to have 63 partitions
on scsi disks.  But had alot of problems with other software after
that, because some software assumes device 8,16 is sdb, and because
I always have to remember to boot the "right" kernel.

Nowadays, for me at least, things aren't that bad anymore.  I was
using (trying to anyway) raw devices with oracle instead of using
the filesystem (oracle works better that way because of no double-
caching in oracle and in the filesystem).  Now there's such a thing
as O_DIRECT which works just as good.  Still, we use 8 partitions
on most systems, and having single "journal partition" means we
will need 16 partitions which is more than linux allows.

>>[] I at least can reconstruct filesystem
>>image by reading chunks of data from appropriate places from
>>all drives and try to recover that image; with any additional
> 
> Now that is just perverse.

*If* things really goes wrong.  I managed to restore fubared
raid5 once this way, several years ago.  Not that the approach
is "practical" or "easy", but if you must and the bad has already
happened... ;)

[]
>>Again: instead of using a partition for the journal, use (another?)
>>raid array.  This way, the system will work if the drive wich
>>contains the journal fails.
> 
> But the journal will also contain corruptions if the whole system
> crashes, and is rebooted. You just spent several paragraphs (?) arguing
> so. Do you really want those rolled forward to complete? I would
> rather they were rolled back! I.e. that the journal were not there -
> I am in favour of a zero size journal, in other words, which only acts
> to guarantee atomicity of FS ops (FS code on its own may do that), but
> which does not contain data.

That's another case again.  Trust your cpu?  Trust the kernel?
If the system can go havoc for some random reason and throw your
(overwise prefectly valid) data away, there's nothing to protect
it.. except of good backup, and, ofcourse, fixing the damn bug.
And it's really irrelevant in this case whenever we have journal
at all or not.

If there IS a journal (my main reason to use it is that sometimes
ext2fs can't repair on reboot without prompting (which is about to
unacceptable to me because the system is remote and I need it to
boot and "phone home" for repair), while with ext3 we had no case
(yet) when it was unable to boot without human intervention), again,
in my "usage case", it should be safe against disk failures, ie,
the system should continue working the the drive where the journal
is gets lost or develops bad sectors.  For the same reason... ;)

>>[]
>>
>>And I also want to "re-reply" to the first your message in this
>>thread, where I was saying that "it's a nonsense that raid does
>>not preserve write ordering".  Ofcourse I mean not write ordering
>>but working write barriers (as Neil pointed out, md subsystem does
>>not implement write barriers directly but the concept is "emulated"
>>by linux block subsystem).  Write barriers should be sufficient to
>>implement journalling safely.
> 
> I am not confident that Neil did say so. I have not reexamined his
> post, but I got the impression that he hummed and hawed over that.
> I do not recall that he said that raid implements write barriers -
> perhaps he did. Anyway, I do not recall any code to handle "special"
> requests, which USED to be the kernel's barrier mechanism. Has that
> mechanism changed (it could have!)?

"Too bad" I haven't looked at the code *at all* (almost, really) ;)
I saw numerous discussions here and there, but it's difficult to
understand *that* amount of code with all the "edge cases".
I just "believe" there's some way to know the data has been written
and that it indeed has been written; and I know this is sufficient
to build a "safe" (in some sense) filesystem (or whatever) based on
this; and what ext3 IS that "safe" filesystem.  Just believe, that's
all... ;)

BTW, thanks for a good discussion.  Seriously.  It's very rare one
can see this level of expirience as you demonstrate.

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Debian Sarge mdadm raid 10 assembling at boot problem
  2005-01-03 14:23                   ` Peter T. Breuer
  2005-01-03 18:30                     ` maarten
  2005-01-03 21:36                     ` Michael Tokarev
@ 2005-01-05  5:50                     ` Roger Ellison
  2005-01-05 13:41                       ` Michael Tokarev
  2 siblings, 1 reply; 172+ messages in thread
From: Roger Ellison @ 2005-01-05  5:50 UTC (permalink / raw)
  To: linux-raid

I've been having an enjoyable time tinkering with software raid with
Sarge and the RC2 installer.  The system boots fine with Raid 1 for
/boot and Raid 5 for /.  I decided to experiment with Raid 10 for /opt
since there's nothing there to destroy :).  Using mdadm to create a Raid
0 array from two Raid 1 arrays was simple enough, but getting the Raid
10 array activated at boot isn't working well.  I used update-rc.d to
add the symlinks to mdadm-raid using the defaults, but the Raid 10 array
isn't assembled at boot time.  After getting kicked to a root shell, if
I check /proc/mdstat only md1 (/) is started.  After running mdadm-raid
start, md0 (/boot), md2, and md3 start.  If I run mdadm-raid start again
md4 (/opt) starts.  Fsck'ing the newly assembled arrays before
successfully issuing 'mount -a' shows no filesystem errors.  I'm at a
loss and haven't found any similar issue mentions on this list or the
debian-users list.  Here's mdadm.conf:

DEVICE partitions
DEVICE /dev/md*
ARRAY /dev/md4 level=raid0 num-devices=2
UUID=bf3456d3:2af15cc9:18d816bf:d630c183
   devices=/dev/md2,/dev/md3
ARRAY /dev/md3 level=raid1 num-devices=2
UUID=a51da14e:41eb27ad:b6eefb94:21fcdc95
   devices=/dev/sdb5,/dev/sde5
ARRAY /dev/md2 level=raid1 num-devices=2
UUID=ac25a75b:3437d397:c00f83a3:71ea45de
   devices=/dev/sda5,/dev/sdc5
ARRAY /dev/md1 level=raid5 num-devices=4 spares=1
UUID=efec4ae2:1e74d648:85582946:feb98f0c
   devices=/dev/sda3,/dev/sdb3,/dev/sdc3,/dev/sde3,/dev/sdd3
ARRAY /dev/md0 level=raid1 num-devices=4 spares=1
UUID=04209b62:6e46b584:06ec149f:97128bfb
   devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sde1,/dev/sdd1

Roger

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Debian Sarge mdadm raid 10 assembling at boot problem
  2005-01-05  5:50                     ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
@ 2005-01-05 13:41                       ` Michael Tokarev
  2005-01-05 13:57                         ` [help] [I2O] Adaptec 2400A on FC3 Angelo Piraino
  2005-01-05 19:15                         ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
  0 siblings, 2 replies; 172+ messages in thread
From: Michael Tokarev @ 2005-01-05 13:41 UTC (permalink / raw)
  To: Roger Ellison; +Cc: linux-raid

[Please don't start new thread by replying to another message.]

Roger Ellison wrote:
> I've been having an enjoyable time tinkering with software raid with
> Sarge and the RC2 installer.  The system boots fine with Raid 1 for
> /boot and Raid 5 for /.  I decided to experiment with Raid 10 for /opt
> since there's nothing there to destroy :).  Using mdadm to create a Raid
> 0 array from two Raid 1 arrays was simple enough, but getting the Raid

Note there's special raid10 module in recent kernels (and supported by
recent mdadm).  I don't think it's very stable yet, but.. JFYI ;)

> 10 array activated at boot isn't working well.  I used update-rc.d to
> add the symlinks to mdadm-raid using the defaults, but the Raid 10 array

Shouldn't the links be made automatically when installing mdadm?
Also, make sure you're using recent debian package of mdadm --
earlier versions was.. with issues concerning assembling the
arrays (a problem specific to debian mdadm-raid scripts).
Make sure AUTOSTART is set to true in /etc/default/mdadm
(I'm not sure if that's really `AUTOSTART' and `true',
you can look at the file and/or use dpkg-reconfigure mdadm
to set that).

> isn't assembled at boot time.  After getting kicked to a root shell, if
> I check /proc/mdstat only md1 (/) is started.  After running mdadm-raid
> start, md0 (/boot), md2, and md3 start.  If I run mdadm-raid start again
> md4 (/opt) starts.  Fsck'ing the newly assembled arrays before
> successfully issuing 'mount -a' shows no filesystem errors.  I'm at a
> loss and haven't found any similar issue mentions on this list or the
> debian-users list.  Here's mdadm.conf:

You have two problems.
First of all, mdadm-raid should be started at very early in the
boot process, and mdadm package post-install scripts ensures this
(you added mdadm-raid links at default order which is 20; but
it should run before filesystem mounts etc, nearly 01 or something).
Ditto for stop scripts -- at the very end, after umounting the
filesystems.  Take a look at /var/lib/dpkg/info/mdadm.postinst --
it sets up the links properly.  When you correct this, your system
will go further in boot process... ;)

And second problem is the order of lines in mdadm.conf.

[edited a bit]
> DEVICE partitions
> DEVICE /dev/md*
> ARRAY /dev/md4 level=raid0 num-devices=2  devices=/dev/md2,/dev/md3
> ARRAY /dev/md3 level=raid1 num-devices=2  devices=/dev/sdb5,/dev/sde5
> ARRAY /dev/md2 level=raid1 num-devices=2  devices=/dev/sda5,/dev/sdc5

Re-order the lines so that md4 will be listed AFTER all it's
components.  Mdadm tries to assemble them in turn as they're
listed in mdadm.conf. But at the time when it tries to start
md4, it's components aren't here yet so it fails.

HTH.

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* [help] [I2O] Adaptec 2400A on FC3
  2005-01-05 13:41                       ` Michael Tokarev
@ 2005-01-05 13:57                         ` Angelo Piraino
  2005-01-05 19:15                         ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
  1 sibling, 0 replies; 172+ messages in thread
From: Angelo Piraino @ 2005-01-05 13:57 UTC (permalink / raw)
  To: linux-raid

I'm having some troubles with an Adaptec 2400A ATA Raid controller on FC3.
Everything worked fine during installation and Operating System recognized 
automatically the RAID card.
I also installed raidutils-0.0.4 from http://i2o.shadowconnect.com, but when I 
try to use the command raidutil this is the result:

> [root@omissis ~]# raidutil -L
> osdIOrequest : File /dev/dpti17 Could Not Be OpenedEngine connect 
> failed: COMPATIBILITY number

I need to manage the raid card at least for display the status of the RAID 
array.
Can anybody help me?
Thanks

Angelo Piraino

Linux FC3 on PIII 450MHz - chipset Intel P2BF - 256Mb Ram - Ata Raid
Adaptec 2400A

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: Debian Sarge mdadm raid 10 assembling at boot problem
  2005-01-05 13:41                       ` Michael Tokarev
  2005-01-05 13:57                         ` [help] [I2O] Adaptec 2400A on FC3 Angelo Piraino
@ 2005-01-05 19:15                         ` Roger Ellison
  1 sibling, 0 replies; 172+ messages in thread
From: Roger Ellison @ 2005-01-05 19:15 UTC (permalink / raw)
  To: linux-raid

On Wed, 2005-01-05 at 08:41, Michael Tokarev wrote:
> [Please don't start new thread by replying to another message.]
> 
> Roger Ellison wrote:
> > I've been having an enjoyable time tinkering with software raid with
> > Sarge and the RC2 installer.  The system boots fine with Raid 1 for
> > /boot and Raid 5 for /.  I decided to experiment with Raid 10 for /opt
> > since there's nothing there to destroy :).  Using mdadm to create a Raid
> > 0 array from two Raid 1 arrays was simple enough, but getting the Raid
> 
> Note there's special raid10 module in recent kernels (and supported by
> recent mdadm).  I don't think it's very stable yet, but.. JFYI ;)

I didn't see Raid 10 mentioned in the mdadm man page.  Do you have the
command line syntax handy?  I don't have access to the machine now, but
I believe it's mdadm version 1.7 (from August '04).
> 
> > 10 array activated at boot isn't working well.  I used update-rc.d to
> > add the symlinks to mdadm-raid using the defaults, but the Raid 10 array
> 
> Shouldn't the links be made automatically when installing mdadm?
> Also, make sure you're using recent debian package of mdadm --
> earlier versions was.. with issues concerning assembling the
> arrays (a problem specific to debian mdadm-raid scripts).
> Make sure AUTOSTART is set to true in /etc/default/mdadm
> (I'm not sure if that's really `AUTOSTART' and `true',
> you can look at the file and/or use dpkg-reconfigure mdadm
> to set that).
> 
The symlinks on Sarge installation were:
	rc0.d -->  S50mdadm-raid
	rc6.d -->  S50mdadm-raid	
	rcS.d -->  S25mdadm-raid
I removed them before running 'update-rc.d'.
> > isn't assembled at boot time.  After getting kicked to a root shell, if
> > I check /proc/mdstat only md1 (/) is started.  After running mdadm-raid
> > start, md0 (/boot), md2, and md3 start.  If I run mdadm-raid start again
> > md4 (/opt) starts.  Fsck'ing the newly assembled arrays before
> > successfully issuing 'mount -a' shows no filesystem errors.  I'm at a
> > loss and haven't found any similar issue mentions on this list or the
> > debian-users list.  Here's mdadm.conf:
> 
> You have two problems.
> First of all, mdadm-raid should be started at very early in the
> boot process, and mdadm package post-install scripts ensures this
> (you added mdadm-raid links at default order which is 20; but
> it should run before filesystem mounts etc, nearly 01 or something).
> Ditto for stop scripts -- at the very end, after umounting the
> filesystems.  Take a look at /var/lib/dpkg/info/mdadm.postinst --
> it sets up the links properly.  When you correct this, your system
> will go further in boot process... ;)
> 
> And second problem is the order of lines in mdadm.conf.
> 
> [edited a bit]
> > DEVICE partitions
> > DEVICE /dev/md*
> > ARRAY /dev/md4 level=raid0 num-devices=2  devices=/dev/md2,/dev/md3
> > ARRAY /dev/md3 level=raid1 num-devices=2  devices=/dev/sdb5,/dev/sde5
> > ARRAY /dev/md2 level=raid1 num-devices=2  devices=/dev/sda5,/dev/sdc5
> 
> Re-order the lines so that md4 will be listed AFTER all it's
> components.  Mdadm tries to assemble them in turn as they're
> listed in mdadm.conf. But at the time when it tries to start
> md4, it's components aren't here yet so it fails.
> 
Reordering the ARRAY entries has no effect.  'mdadm-raid start' has to
be run twice, same as before.  Changing the start symlinks from '20' to
'13' (just after syslogd) doesn't help. (sigh).  
> HTH.
> 
> /mjt
> -
Roger


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-02 19:42         ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
  2005-01-02 20:18           ` Peter T. Breuer
@ 2005-01-05  9:56           ` Andy Smith
  2005-01-05 10:44             ` Alvin Oga
  1 sibling, 1 reply; 172+ messages in thread
From: Andy Smith @ 2005-01-05  9:56 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2767 bytes --]

As a small request please could those who are posting opinions on
the following:

 - What level of schooling any person's mathematical ability resembles

 - Whether Peter's computing environment is sane or not

 - IQ level and admin skills of various persons

 - Issues of p2p, rootkits, unruly students, etc. etc.

Plus various other personal attacks and ad hominem:

Please consider if what you are writing is relevant to this list
(linux-raid) or the subject of this thread (whether it is wise to
put the journal for an ext3 filesystem internal to the filesystem
when it is on a RAID-1 mirror).

Obviously I am not a moderator or even anyone of any influence, but
the majority of text I am now seeing in this thread is not useful to
read and (since it is archived) may actually be giving a bad
impression of its poster for all time.

From what I can understand of the thread so far, Peter is saying the
following:

        RAID mirrors are susceptible to increasing undetectable
        inconsistencies because, as we all know, filesystems sustain
        corruption over time.
        
        On a filesystem that runs from one disk, corruption serious
        enough to affect the stability of the file system will do so
        and so will be detected.  As more disks are added to the
        mirror, the probability of that corruption never being seen
        naturally goes up.

        Peter personally does not put the journal inside the mirror
        because if he ever came to need to use the journal and found
        that it was corrupted, it could risk his whole filesystem.
        Peter prefers to put the journal on a separate device that
        is not mirrored.

I am not trying to put words into your mouth Peter, just trying to
summarise what your points are.  If I haven't represented your views
correctly then by all means correct me but please try to do so
succinctly and informatively.

Now, others are saying in response to this, things like:

        Spontaneous corruption is rare compared to outright or
        catastrophic device failure, and although it is more
        likely to go unnoticed with RAID mirrors, while it IS
        unnoticed, this presumably correct data is also being rewritten
        back to the filesystem.
        
        Mirrors help protect against the more common complete device
        failure and so a journal should surely be on a mirror since
        if you lose the journal then the machine needs to go down
        anyway.  It is unavailability of the server we're trying to
        avoid; consistency of the data can be protected with regular
        backups and possibly measured with other methods like
        md5sum.

Discuss? ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  9:56           ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
@ 2005-01-05 10:44             ` Alvin Oga
  2005-01-05 10:56               ` Brad Campbell
  0 siblings, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 10:44 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid


hi ya andy

good summary ... thanx..

one more item :-)

On Wed, 5 Jan 2005, Andy Smith wrote:

..

> From what I can understand of the thread so far, Peter is saying the
> following:
> 
>         RAID mirrors are susceptible to increasing undetectable
>         inconsistencies because, as we all know, filesystems sustain
>         corruption over time.
>         
>         On a filesystem that runs from one disk, corruption serious
>         enough to affect the stability of the file system will do so
>         and so will be detected.  As more disks are added to the
>         mirror, the probability of that corruption never being seen
>         naturally goes up.
> 
>         Peter personally does not put the journal inside the mirror
>         because if he ever came to need to use the journal and found
>         that it was corrupted, it could risk his whole filesystem.
>         Peter prefers to put the journal on a separate device that
>         is not mirrored.
> 
> I am not trying to put words into your mouth Peter, just trying to
> summarise what your points are.  If I haven't represented your views
> correctly then by all means correct me but please try to do so
> succinctly and informatively.
> 
> Now, others are saying in response to this, things like:
> 
>         Spontaneous corruption is rare compared to outright or
>         catastrophic device failure, and although it is more
>         likely to go unnoticed with RAID mirrors, while it IS
>         unnoticed, this presumably correct data is also being rewritten
>         back to the filesystem.
>         
>         Mirrors help protect against the more common complete device
>         failure and so a journal should surely be on a mirror since
>         if you lose the journal then the machine needs to go down
>         anyway.  It is unavailability of the server we're trying to
>         avoid; consistency of the data can be protected with regular
>         backups and possibly measured with other methods like
>         md5sum.

some other issues ...

	how one can detect failures, errors would be completely
	up to the tools they use ... various tools does specific
	functions and cannot tell you anything about any other
	causes of the problems

	for swap ... i personally don't see any reason to mirror
	swap partitions ...
		- once the system dies, ( power off ), all temp
		data is useless unless one continues from a coredump 
		( from the same state as when it went down initially )

	if a disk did fail, die, error, hiccup, then whatever cause the
	problem can also affect the data and the metadata and the parity 
	and the "mirror"
		- which set of "bytes" on the disk "raid" trust to
		restore from is up to the code and its predefined
		set of assumptions of various failure modes

		- partially written data is very bad thing to have

	unless you know "exactly why and how if failed/eerrored,
	there is no sane way to bet the house on which data is more
	correct than the other 
		- i'm excluding bad memory, bad cpu, bad power supply
		from the lsit of possible problems

		and yes, bad (generic) memory has corrupted my systems
		once in 10 yrs...

have fun
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 10:44             ` Alvin Oga
@ 2005-01-05 10:56               ` Brad Campbell
  2005-01-05 11:39                 ` Alvin Oga
                                   ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Brad Campbell @ 2005-01-05 10:56 UTC (permalink / raw)
  To: Alvin Oga; +Cc: Andy Smith, linux-raid

Alvin Oga wrote:
> 
> 	for swap ... i personally don't see any reason to mirror
> 	swap partitions ...
> 		- once the system dies, ( power off ), all temp
> 		data is useless unless one continues from a coredump 
> 		( from the same state as when it went down initially )

I beg to differ on this one. Having spend several weeks tracking down random processes dying on a 
machine that turned out to be a bad sector in the swap partition, I have had great results by 
running swap on a RAID-1. If you develop a bad sector in a non-mirrored swap, bad things happen 
indeterminately and can be a royal PITA to chase down. It's just a little extra piece of mind.

Regards,
Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 10:56               ` Brad Campbell
@ 2005-01-05 11:39                 ` Alvin Oga
  2005-01-05 12:02                   ` Brad Campbell
  2005-01-05 14:12                 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Erik Mouw
  2005-01-05 15:17                 ` Guy
  2 siblings, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 11:39 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Andy Smith, linux-raid



hi ya brad

On Wed, 5 Jan 2005, Brad Campbell wrote:

> Alvin Oga wrote:
> > 
> > 	for swap ... i personally don't see any reason to mirror
> > 	swap partitions ...
> > 		- once the system dies, ( power off ), all temp
> > 		data is useless unless one continues from a coredump 
> > 		( from the same state as when it went down initially )
> 
> I beg to differ on this one. Having spend several weeks tracking down random processes dying on a 
> machine that turned out to be a bad sector in the swap partition, I have had great results by 
> running swap on a RAID-1. If you develop a bad sector in a non-mirrored swap, bad things happen 
> indeterminately and can be a royal PITA to chase down. It's just a little extra piece of mind.

okay .... if the parts of disks is bad that is used for swap, 
mirroring might help ... 

but, i wonder, how/why the system used that portion of swap in the first
place
	- even for raid, if sector-10 in swap is bad, why would raid keep
	trying to write there instead of to sector-1000
	( the systems should be writing data, and read it back to verify
	( what it wrote to disk, not the disk cache, is the same as what
	( it just read back, before it says, "data is written"

and i spent days trackign down a bad memory ... that killed the raid ...

- various backups  the only reasonable way to make sure
  [supposedly good] data is not lost where backups does NOT overwrite
  previous[ly good] backups

c ya
alvin
 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 11:39                 ` Alvin Oga
@ 2005-01-05 12:02                   ` Brad Campbell
  2005-01-05 13:23                     ` Alvin Oga
  0 siblings, 1 reply; 172+ messages in thread
From: Brad Campbell @ 2005-01-05 12:02 UTC (permalink / raw)
  To: Alvin Oga; +Cc: Andy Smith, linux-raid

Alvin Oga wrote:
> 
> hi ya brad
> 
> On Wed, 5 Jan 2005, Brad Campbell wrote:
> 
> 
>>Alvin Oga wrote:
>>
>>>	for swap ... i personally don't see any reason to mirror
>>>	swap partitions ...
>>>		- once the system dies, ( power off ), all temp
>>>		data is useless unless one continues from a coredump 
>>>		( from the same state as when it went down initially )
>>
>>I beg to differ on this one. Having spend several weeks tracking down random processes dying on a 
>>machine that turned out to be a bad sector in the swap partition, I have had great results by 
>>running swap on a RAID-1. If you develop a bad sector in a non-mirrored swap, bad things happen 
>>indeterminately and can be a royal PITA to chase down. It's just a little extra piece of mind.
> 
> 
> okay .... if the parts of disks is bad that is used for swap, 
> mirroring might help ... 
> 
> but, i wonder, how/why the system used that portion of swap in the first
> place
> 	- even for raid, if sector-10 in swap is bad, why would raid keep
> 	trying to write there instead of to sector-1000

Picture lpd gets swapped out on a friday night. Over the weekend it is not used and the drive 
develops a bad sector in the middle of the file. Monday morning I want to print and the system tries 
to page lpd back in again. *boom*.

I have not looked at the system swap algorithms, but I doubt they include automatic bad block 
management and read after write verification. I'm making big assumptions here, but I'm assuming they 
rely on a bad block table created by mkswap and an otherwise clean, functioning swap area.

Regards,
Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 12:02                   ` Brad Campbell
@ 2005-01-05 13:23                     ` Alvin Oga
  2005-01-05 13:33                       ` Brad Campbell
  2005-01-05 13:36                       ` Swap should be mirrored or not? (was Re: ext3 journal on software raid) Andy Smith
  0 siblings, 2 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 13:23 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid



On Wed, 5 Jan 2005, Brad Campbell wrote:

> Picture lpd gets swapped out on a friday night. Over the weekend it is not used and the drive 
> develops a bad sector in the middle of the file. Monday morning I want to print and the system tries 
> to page lpd back in again. *boom*.

admin issue ... user related services like printers should NOT be on
critical servers and take down what everybody will notice due to unrelated
printer problems

swap ...  swap partitions is by default checked for bad blocks during
formatting as swap ...
	- it is highly unlikely that you'd get a bad sector in swap space

- any normal bad things happening to user area of the disks will also
  happen to swap space
 	- but its unlikely that swap will have bad blocks, while its 
	more likely that users did nto do a badblock check during
	formatting across the 100GB or 300GB disks ... and even more
	time twiddling your thumbs on raid'd disks

c ya
alvin

> I have not looked at the system swap algorithms, but I doubt they include automatic bad block 
> management and read after write verification. I'm making big assumptions here, but I'm assuming they 
> rely on a bad block table created by mkswap and an otherwise clean, functioning swap area.
> 
> Regards,
> Brad
> 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 13:23                     ` Alvin Oga
@ 2005-01-05 13:33                       ` Brad Campbell
  2005-01-05 14:44                         ` parts -- " Alvin Oga
  2005-01-05 13:36                       ` Swap should be mirrored or not? (was Re: ext3 journal on software raid) Andy Smith
  1 sibling, 1 reply; 172+ messages in thread
From: Brad Campbell @ 2005-01-05 13:33 UTC (permalink / raw)
  To: Alvin Oga; +Cc: linux-raid

Alvin Oga wrote:
> 
> On Wed, 5 Jan 2005, Brad Campbell wrote:
> 
> 
>>Picture lpd gets swapped out on a friday night. Over the weekend it is not used and the drive 
>>develops a bad sector in the middle of the file. Monday morning I want to print and the system tries 
>>to page lpd back in again. *boom*.
> 
> 
> admin issue ... user related services like printers should NOT be on
> critical servers and take down what everybody will notice due to unrelated
> printer problems

Ok, perhaps bad example. Pick any service that may get swapped out. Hell.. any bad block in any swap 
space that develops while that block is in use is going to cause a problem.

> swap ...  swap partitions is by default checked for bad blocks during
> formatting as swap ...
> 	- it is highly unlikely that you'd get a bad sector in swap space

Yeah? I have 3 hard disks sitting on my desk that say you are wrong..

> 
> - any normal bad things happening to user area of the disks will also
>   happen to swap space

I agree.

>  	- but its unlikely that swap will have bad blocks, while its 
> 	more likely that users did nto do a badblock check during
> 	formatting across the 100GB or 300GB disks ... and even more
> 	time twiddling your thumbs on raid'd disks

I strongly disagree.
I have a 250GB drive I just swapped out that was "growing" bad sectors at the rate of 3 per day that 
did a clean badblocks 5 months ago when it was installed.

I refuse to argue this any more. If you want to tell the world that there is no benefit to swap on 
raid, go ahead.

Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* parts -- Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 13:33                       ` Brad Campbell
@ 2005-01-05 14:44                         ` Alvin Oga
  2005-01-19  4:46                           ` Clemens Schwaighofer
  0 siblings, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 14:44 UTC (permalink / raw)
  To: Brad Campbell; +Cc: linux-raid


On Wed, 5 Jan 2005, Brad Campbell wrote:

> Ok, perhaps bad example. Pick any service that may get swapped out. Hell.. any bad block in any swap 
> space that develops while that block is in use is going to cause a problem.

yup ... if you have an unstable system .. swapping things in and out of
memory and disks will cause you problems 
 
> > swap ...  swap partitions is by default checked for bad blocks during
> > formatting as swap ...
> > 	- it is highly unlikely that you'd get a bad sector in swap space
> 
> Yeah? I have 3 hard disks sitting on my desk that say you are wrong..

that was my whole initial point a few days ago 

-- get better hardware 

-- get your hardware from another source

-- i do NOT have hardware problems ...
	- i do NOT buy parts from mthe cheapest place
	- i do NOT buy parts from people that i had bad parts

	- i do get an occasional bad part, like the ibm deathstars are
	infamous
 
	- i do get an occasional 1% infant mortality rate ...
	which is normal

	- once a server has been up for say 30-60 days ... it stays
	up for years ....

- given the same identical hard disks from different vendors/stores,
  you will get different failure rates
	- i use maxtor/quantum, ibm, western digital, seagates, fujitsu

	( no one is better than the other disks, other than
	( the stupid ibm deathstar problem

	- how you install it
	- how you cool it makes all the difference in the world

- think about it ...

	== we all use the same motherboards
	== we all use the same disks
	== we all use the same memory
	== we all use the same linux

	== whats different ??

	( where you bought yours from and more importantly, 
	( how you cool things down 

	- i do NOT have problems you guyz are having ....
	and i've personally use thousnds of disks

=== get better hardware ..  or more to the point ... buy from a 
    computer parts only tier-1 vendor ... not from the millions of me-too
    online "we are the cheapest mom-n-pop we sell camera's and
    clothes too webstores"


> I have a 250GB drive I just swapped out that was "growing" bad sectors at the rate of 3 per day that 
> did a clean badblocks 5 months ago when it was installed.

you're buying bad hardware from bad vendors

c ya
alvin



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: parts -- Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 14:44                         ` parts -- " Alvin Oga
@ 2005-01-19  4:46                           ` Clemens Schwaighofer
  2005-01-19  5:05                             ` Alvin Oga
  0 siblings, 1 reply; 172+ messages in thread
From: Clemens Schwaighofer @ 2005-01-19  4:46 UTC (permalink / raw)
  To: Alvin Oga; +Cc: Brad Campbell, linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/05/2005 11:44 PM, Alvin Oga wrote:

>>I have a 250GB drive I just swapped out that was "growing" bad sectors at the rate of 3 per day that 
>>did a clean badblocks 5 months ago when it was installed.
> 
> 
> you're buying bad hardware from bad vendors

seriously, thats just wrong. Ever heard of IBM "death"start HDs? Or
other stuff? As long as you use IDE hardware you are always close to
fail, IDE hardware is done to put much data on cheap disks.

If you use IDE stuff in servers, make all raid. And SWAP too, SWAP can
be used heavilty, and if there are bad blocks you are hit too.

A expensive vendor doesn't help you from crappy HW insealf.

- --
[ Clemens Schwaighofer                      -----=====:::::~ ]
[ TBWA\ && TEQUILA\ Japan IT Group                           ]
[                6-17-2 Ginza Chuo-ku, Tokyo 104-0061, JAPAN ]
[ Tel: +81-(0)3-3545-7703            Fax: +81-(0)3-3545-7343 ]
[ http://www.tequila.co.jp        http://www.tbwajapan.co.jp ]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7eYrjBz/yQjBxz8RAnoeAKDs4ZdE+DPktmI8R/yBnFs44MvliQCg5OYy
Omq0O6Y5xqWNUFbmHorLyr8=
=GTZH
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: parts -- Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-19  4:46                           ` Clemens Schwaighofer
@ 2005-01-19  5:05                             ` Alvin Oga
  2005-01-19  5:49                               ` Clemens Schwaighofer
  0 siblings, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-19  5:05 UTC (permalink / raw)
  To: Clemens Schwaighofer; +Cc: linux-raid


hi ya clemen

On Wed, 19 Jan 2005, Clemens Schwaighofer wrote:

> On 01/05/2005 11:44 PM, Alvin Oga wrote:
> 
> > you're buying bad hardware from bad vendors
> 
> seriously, thats just wrong. Ever heard of IBM "death"start HDs? Or
> other stuff? As long as you use IDE hardware you are always close to
> fail, IDE hardware is done to put much data on cheap disks.

the "deathstar" ( from thailand to be specific ) disks are the exception
and the worst of the bunch

majority of all my dead disks are scsi disks ..
	- the so called better scsi is in fact worst probably
	because it runs hotter ( spinning faster )
	where cooling the case will not help the ball bearings
	or cooling the disk controller's chips

	( scsi: seagate, ibm, fujitsu, ..
	( ide:  seagate, ibm, fujitsu, wd, maxtor, quantum ... 

ide disks are warranteed for 1,3 or 5 years  and some have
a 1,000,000 MTBF ... what you do to it to destroy the mtbf
is what makes the difference of how long it will in fact last
in your environment

and most of my ide disks are approaching 8-10 years and no problems
though most of the cpu's or mb have long since been offline for
too slow of a cpu  -- most all are in offices or 65F server rooms

how you build systems does make a difference..
	- lots of fans ( at least 2 per ide disk, just in case one dies )
	( that is the same as with scsi .. now you can compare )

> A expensive vendor doesn't help you from crappy HW insealf.

lots ... 90% f crappy vendors ... 
	- bad disk or wrong disks
	- bad mb or wrong mb
	- bad nic or wrong nic
	- or no rma number to return bad parts
	- .. on and on ..

- i buy thru tier1 distributors and don't have any "noticeable" bad parts
  ( maybe .1% failure ( doa )) over 1,000s of ide disks in the past year
  or two  and lasts way past its warranty period

==
== everybody has their good and bad ideas of what are good disks and not
== and does not mean that otehrs should expect the same problem
== when everybody buys disks from different vendors and definitely
== installed differently  ( no redundant fans to cool 7200rpm ide disks )
==	operating temp on ide disks should be 25-30C or less per "hddtemp"
==

c ya
alvin 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: parts -- Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-19  5:05                             ` Alvin Oga
@ 2005-01-19  5:49                               ` Clemens Schwaighofer
  2005-01-19  7:08                                 ` Alvin Oga
  0 siblings, 1 reply; 172+ messages in thread
From: Clemens Schwaighofer @ 2005-01-19  5:49 UTC (permalink / raw)
  To: Alvin Oga; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/19/2005 02:05 PM, Alvin Oga wrote:
> hi ya clemen
> 
> On Wed, 19 Jan 2005, Clemens Schwaighofer wrote:
> 
> 
>>On 01/05/2005 11:44 PM, Alvin Oga wrote:
>>
>>
>>>you're buying bad hardware from bad vendors
>>
>>seriously, thats just wrong. Ever heard of IBM "death"start HDs? Or
>>other stuff? As long as you use IDE hardware you are always close to
>>fail, IDE hardware is done to put much data on cheap disks.
> 
> 
> the "deathstar" ( from thailand to be specific ) disks are the exception
> and the worst of the bunch
> 
> majority of all my dead disks are scsi disks ..
> 	- the so called better scsi is in fact worst probably
> 	because it runs hotter ( spinning faster )
> 	where cooling the case will not help the ball bearings
> 	or cooling the disk controller's chips
> 
> 	( scsi: seagate, ibm, fujitsu, ..
> 	( ide:  seagate, ibm, fujitsu, wd, maxtor, quantum ... 

Since I do SysAdmin as a "get money for it" service I had not a single
SCSI disk die (call it luck). But I saw so many dead IDE disks, that I
never ever will use IDE stuff in a mission critical or any system that
has to run more reliable.
I think SCSI disks are built for longer usage, and the thing "faster ->
hotter" for SCSI disks is bullshit.
IDE disks run also ~7.500 and more.
I have a bunch of 15K SCSI disks in use and the rest is 10K SCSI disks,
they (10K) run since 3.5 years without a problem.

> and most of my ide disks are approaching 8-10 years and no problems
> though most of the cpu's or mb have long since been offline for
> too slow of a cpu  -- most all are in offices or 65F server rooms

well thats another imporant thing. Where are those boxes. If you store
them in an un-airconditioned room, that is in a building with a central
heating, and in a country where you have >2 Months of >30C a day.
Then you just call for a server problem.

> how you build systems does make a difference..
> 	- lots of fans ( at least 2 per ide disk, just in case one dies )
> 	( that is the same as with scsi .. now you can compare )

well, "I", never build a single Server. I buy them from HP, Supermicro,
etc. I am out of the "build all myself" age ...

>>A expensive vendor doesn't help you from crappy HW insealf.
> 
> lots ... 90% f crappy vendors ... 
> 	- bad disk or wrong disks
> 	- bad mb or wrong mb
> 	- bad nic or wrong nic
> 	- or no rma number to return bad parts
> 	- .. on and on ..
> 
> - i buy thru tier1 distributors and don't have any "noticeable" bad parts
>   ( maybe .1% failure ( doa )) over 1,000s of ide disks in the past year
>   or two  and lasts way past its warranty period

still, you are not save of bad parts, you are perhaps more save to get
them replaced with more trustworthy distributors.

- --
[ Clemens Schwaighofer                      -----=====:::::~ ]
[ TBWA\ && TEQUILA\ Japan IT Group                           ]
[                6-17-2 Ginza Chuo-ku, Tokyo 104-0061, JAPAN ]
[ Tel: +81-(0)3-3545-7703            Fax: +81-(0)3-3545-7343 ]
[ http://www.tequila.co.jp        http://www.tbwajapan.co.jp ]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7fTLjBz/yQjBxz8RAqyRAJ90VF9uAbmyh4bfJaGho/CN/aTyQgCg4aqK
PWGToqzGBLgKHgolF2fqp/Y=
=woCo
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: parts -- Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-19  5:49                               ` Clemens Schwaighofer
@ 2005-01-19  7:08                                 ` Alvin Oga
  0 siblings, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-19  7:08 UTC (permalink / raw)
  To: Clemens Schwaighofer; +Cc: linux-raid


hi ya clemens

- yup.. i agree in general except for your "bull shit" comment :-) 
	( see below )

On Wed, 19 Jan 2005, Clemens Schwaighofer wrote:

> Since I do SysAdmin as a "get money for it" service I had not a single
> SCSI disk die (call it luck).

i think lots of people, probably everybody in here get $$$ for what
their doing with the disks they bought

and yeah... i'd say you're lucky with the scsi disks you have

> But I saw so many dead IDE disks, that I
> never ever will use IDE stuff in a mission critical or any system that
> has to run more reliable.

if its mission critical... i usully have 3 independent synchronized 
systems doing same "mission critical functions"
	- "scsi" is not the only answer 

> I think SCSI disks are built for longer usage, and the thing "faster ->
> hotter" for SCSI disks is bullshit.

ahh ...  i assume you stick your finger on a scsi disks
i assume you stick you finger on a ide disks

thermodymamics/physics says "most things that go faster will run hotter"

and besides .. stick a thermometer and measure it ... no need to bullshit
about it
	- comparing a 15K scsi against an itty-bitty 5400rpm ide is
	sorta silly and the "comparer" should be shot

	- comparing a 10K scsi against a 10K ide disk might be noteworthy
	even if those 10K scsi is year or two older technology, but
	at 10K rpm.... mechanical problems are the same 

> well, "I", never build a single Server. I buy them from HP, Supermicro,
> etc. I am out of the "build all myself" age ...

and you left out the crappy stuff that people tend to buy ...  
compaqs and dells... which is what i get the calls to come fix for them

> still, you are not save of bad parts, you are perhaps more save to get
> them replaced with more trustworthy distributors.

bingo on "trustworthy distributors" .. that is the 99% of the "right key"
since they sell by the millions/billions of those widgets, they wouldnt
be carrying it if it was gonna become an rma or warranty problem for them
and they probably sell the "same thing" to millions of other customers

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Swap should be mirrored or not? (was Re: ext3 journal on software raid)
  2005-01-05 13:23                     ` Alvin Oga
  2005-01-05 13:33                       ` Brad Campbell
@ 2005-01-05 13:36                       ` Andy Smith
  1 sibling, 0 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-05 13:36 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

On Wed, Jan 05, 2005 at 05:23:19AM -0800, Alvin Oga wrote:
> On Wed, 5 Jan 2005, Brad Campbell wrote:
> 
> > Picture lpd gets swapped out on a friday night. Over the weekend
> > it is not used and the drive develops a bad sector in the middle
> > of the file. Monday morning I want to print and the system tries
> > to page lpd back in again. *boom*.
> 
> admin issue ... user related services like printers should NOT be on
> critical servers and take down what everybody will notice due to unrelated
> printer problems

The problem is not that an lpd process exists, the problem is that
an attempt was made to swap back in a swapped out process while the
swap device is screwed.  It could happen to any process.  Swap is
part of your virtual memory, either keep it working or don't use it
at all -- you cannot expect your server to keep on working when half
its virtual memory is on a device that just died.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 10:56               ` Brad Campbell
  2005-01-05 11:39                 ` Alvin Oga
@ 2005-01-05 14:12                 ` Erik Mouw
  2005-01-05 14:37                   ` Michael Tokarev
  2005-01-05 15:17                 ` Guy
  2 siblings, 1 reply; 172+ messages in thread
From: Erik Mouw @ 2005-01-05 14:12 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Alvin Oga, Andy Smith, linux-raid

On Wed, Jan 05, 2005 at 02:56:30PM +0400, Brad Campbell wrote:
> I beg to differ on this one. Having spend several weeks tracking down 
> random processes dying on a machine that turned out to be a bad sector in 
> the swap partition, I have had great results by running swap on a RAID-1. 
> If you develop a bad sector in a non-mirrored swap, bad things happen 
> indeterminately and can be a royal PITA to chase down. It's just a little 
> extra piece of mind.

If you have a  bad block in your swap partition and the device doesn't
report an error about it, no amount of RAID is going to help you
against it.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 14:12                 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Erik Mouw
@ 2005-01-05 14:37                   ` Michael Tokarev
  2005-01-05 14:55                     ` errors " Alvin Oga
  2005-01-05 17:11                     ` Erik Mouw
  0 siblings, 2 replies; 172+ messages in thread
From: Michael Tokarev @ 2005-01-05 14:37 UTC (permalink / raw)
  To: Erik Mouw; +Cc: Brad Campbell, Alvin Oga, Andy Smith, linux-raid

Erik Mouw wrote:
> On Wed, Jan 05, 2005 at 02:56:30PM +0400, Brad Campbell wrote:
> 
>>I beg to differ on this one. Having spend several weeks tracking down 
>>random processes dying on a machine that turned out to be a bad sector in 
>>the swap partition, I have had great results by running swap on a RAID-1. 
>>If you develop a bad sector in a non-mirrored swap, bad things happen 
>>indeterminately and can be a royal PITA to chase down. It's just a little 
>>extra piece of mind.
> 
> If you have a  bad block in your swap partition and the device doesn't
> report an error about it, no amount of RAID is going to help you
> against it.

The drive IS reporting read errors in most cases.  But that does not
help, really: kernel swapped out some memory but can't read it back,
so things are screwed.  Just like if you hot-remove a DIMM while the
system is running: the kernel loses parts of it's memory, and it can't
work anymore.  Depending on what was in there ofcourse: the whole
system may be screwed, or a single process...  The talks isn't about
"undetectable" (unreported etc) errors here, but about the fact that
the error is here.  And if your swap is on raid, in case one component
of the array behaves badly, another component will continue to work,
so with swap on raid the system will work just fine as if nothing
happened in case one of "swap components" (i mean underlying devices)
failed for whatever reason.

And please, pretty PLEASE stop talking about those mysterious
"undetectable" or "unreported" errors here.  A drive that develops
"unreported" errors just does not work and should not be here in
the first place, just like bad memory or CPU: if your cpu or memory
is failing, no software tricks helps and the failing part should
be replaced BEFORE even thinking about possible ways to recover.

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* errors Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 14:37                   ` Michael Tokarev
@ 2005-01-05 14:55                     ` Alvin Oga
  2005-01-05 17:11                     ` Erik Mouw
  1 sibling, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 14:55 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid


hi ya michael

On Wed, 5 Jan 2005, Michael Tokarev wrote:

> The drive IS reporting read errors in most cases.

and by the time it reports an error ... its too late ...

- one has to play detective and figure out why the error occured
  and prevent it next time

- the worst possible thing to do is .. "disk seems to be dying"
  so 95% of the people say backup the disks ..
	-- its too too late
	-- exercising a dead/dying disks will only aggravate the
	problem even more

- luckily, most "disks errors" are silly errors
	- bad (cheap, not-too-specs) ide cables
	- 80 conductor vs 40 conductor
	- combining 2 different ata speed drives on the same idea cable
	- too long of an ide cable ... 18" is max
	- disk running too hot ... if its warm to the touch,
	its too hot
	( 35C is maybe okay .. use hddtemp to see what its running at )
	- bent disk cables 
	- bad (electrical characteristics) of the ide drivers from the
	motherboard
	- ... on and on ...

	- todays drives are 1000x better quality than 5-10 years ago

> so things are screwed.  Just like if you hot-remove a DIMM while the
> system is running:

those that do hot-swap of memory deserve what they get :-)
( i know you're joking ... and i've seen it done by forgetful admins
	- these atx power supplies are bad for that reason, since
	the motherboard is still live, even if the pc is off, due to
	standby voltages floating around

>  The talks isn't about
> "undetectable" (unreported etc) errors here, but about the fact that
> the error is here.

the trick is to find out what caused the error ... 
	- if you dont figure out what happened ... the problem will 
	continue 

have fun raiding
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 14:37                   ` Michael Tokarev
  2005-01-05 14:55                     ` errors " Alvin Oga
@ 2005-01-05 17:11                     ` Erik Mouw
  2005-01-06  5:41                       ` Brad Campbell
  1 sibling, 1 reply; 172+ messages in thread
From: Erik Mouw @ 2005-01-05 17:11 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Brad Campbell, Alvin Oga, Andy Smith, linux-raid

On Wed, Jan 05, 2005 at 05:37:34PM +0300, Michael Tokarev wrote:
> Erik Mouw wrote:
> >If you have a  bad block in your swap partition and the device doesn't
> >report an error about it, no amount of RAID is going to help you
> >against it.
> 
> The drive IS reporting read errors in most cases. 

"most cases" and "all cases" makes quite a difference.

> And please, pretty PLEASE stop talking about those mysterious
> "undetectable" or "unreported" errors here.  A drive that develops
> "unreported" errors just does not work and should not be here in
> the first place, just like bad memory or CPU: if your cpu or memory
> is failing, no software tricks helps and the failing part should
> be replaced BEFORE even thinking about possible ways to recover.

Indeed: any suspicious part should not be used in the recovery process.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:11                     ` Erik Mouw
@ 2005-01-06  5:41                       ` Brad Campbell
  0 siblings, 0 replies; 172+ messages in thread
From: Brad Campbell @ 2005-01-06  5:41 UTC (permalink / raw)
  To: Erik Mouw; +Cc: Michael Tokarev, Alvin Oga, Andy Smith, linux-raid

Erik Mouw wrote:
> On Wed, Jan 05, 2005 at 05:37:34PM +0300, Michael Tokarev wrote:
> 
>>Erik Mouw wrote:
>>
>>>If you have a  bad block in your swap partition and the device doesn't
>>>report an error about it, no amount of RAID is going to help you
>>>against it.
>>
>>The drive IS reporting read errors in most cases. 
> 
> 
> "most cases" and "all cases" makes quite a difference.
> 

Actually it *was* reporting *all* read errors. It was an early Maxtor 1GB drive and these have been 
notoriuosly bad for being problematic *however* they have been frightfully good at accurately 
reporting exactly what was wrong. In this case I was pretty new to the Linux sysadmin thing and 
never actually really noticed the disk errors in the syslog and correlated them to the process 
dying. (it was a fair few years ago now).

I actually have never had an ATA disk develop errors it did not report.

My point remains the same. By putting your swap on a RAID (of any redundant variety) you are 
increasing the chances of machine survival against disk errors, be they single bit, bad block or 
dead drive.

Talking of Maxtor drives. I have a unit here with less than 6000 hours on it that has started 
growing bad sectors at an alarming rate. All accurately reported by SMART mind you (clever little 
disk), but after running a badblocks -n on it (to really shake them loose) the reallocated sector 
count has halved! Now how can a drive un-reallocate dud sectors?

Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 10:56               ` Brad Campbell
  2005-01-05 11:39                 ` Alvin Oga
  2005-01-05 14:12                 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Erik Mouw
@ 2005-01-05 15:17                 ` Guy
  2005-01-05 15:33                   ` Alvin Oga
  2005-01-05 15:48                   ` Peter T. Breuer
  2 siblings, 2 replies; 172+ messages in thread
From: Guy @ 2005-01-05 15:17 UTC (permalink / raw)
  To: 'Brad Campbell', 'Alvin Oga'
  Cc: 'Andy Smith', linux-raid

I agree, but for a different reason.  Your reason is new to me.
I don't want a down system due to a single disk failure.
Loosing the swap disk would kill the system.

Maybe this is Peter's cause of frequent corruption?

I mirror everything, or RAID5.  Normally, no downtime due to disk failures.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell
Sent: Wednesday, January 05, 2005 5:57 AM
To: Alvin Oga
Cc: Andy Smith; linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Alvin Oga wrote:
> 
> 	for swap ... i personally don't see any reason to mirror
> 	swap partitions ...
> 		- once the system dies, ( power off ), all temp
> 		data is useless unless one continues from a coredump 
> 		( from the same state as when it went down initially )

I beg to differ on this one. Having spend several weeks tracking down random
processes dying on a 
machine that turned out to be a bad sector in the swap partition, I have had
great results by 
running swap on a RAID-1. If you develop a bad sector in a non-mirrored
swap, bad things happen 
indeterminately and can be a royal PITA to chase down. It's just a little
extra piece of mind.

Regards,
Brad
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 15:17                 ` Guy
@ 2005-01-05 15:33                   ` Alvin Oga
  2005-01-05 16:22                     ` Michael Tokarev
                                       ` (2 more replies)
  2005-01-05 15:48                   ` Peter T. Breuer
  1 sibling, 3 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 15:33 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid



On Wed, 5 Jan 2005, Guy wrote:

> I agree, but for a different reason.  Your reason is new to me.
..
> Loosing the swap disk would kill the system.

if one is using swap space ... i'd add more memory .. before i'd use raid
	- swap is too slow and as you folks point out, it could die
	due to (unlikely) bad disk sectors in swap area

> I don't want a down system due to a single disk failure.

that's what raid's for :-)

> I mirror everything, or RAID5.  Normally, no downtime due to disk failures.

the problem with mirror ( raid1 ).. or raid5 ...
	- if you have a bad diska ... all "bad data" will/could  also get
	copied to the good disk

	- "bad data" is hard to figure out in code ... to prevent it from
	getting copied ... how does it know with 100% certainty 

	- if you know why it's bad data,  it's lot easier to know which
	data is more correct than the bad one

	- as everybody has pointed out .. bad data ( disk errors )
	can occur for any number of gazillion reasons

have fun raiding
alvin



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 15:33                   ` Alvin Oga
@ 2005-01-05 16:22                     ` Michael Tokarev
  2005-01-05 17:23                       ` Peter T. Breuer
  2005-01-05 16:23                     ` Andy Smith
  2005-01-05 17:07                     ` Guy
  2 siblings, 1 reply; 172+ messages in thread
From: Michael Tokarev @ 2005-01-05 16:22 UTC (permalink / raw)
  To: linux-raid

Alvin Oga wrote:
> On Wed, 5 Jan 2005, Guy wrote:
> 
>>I agree, but for a different reason.  Your reason is new to me.
> ...
>>Loosing the swap disk would kill the system.
> 
> if one is using swap space ... i'd add more memory .. before i'd use raid
> 	- swap is too slow and as you folks point out, it could die
> 	due to (unlikely) bad disk sectors in swap area

It isn't always practical.  You add as much memory as needed for
your "typical workload".  But there may be "spikes" of load with
that you have to deal somehow.  Adding more memory to cover that
"spikes" may be too expensive.

Also, if your "typical workload" requires eg 2Gb memory, adding
another, say, 2Gb to cover "spikes" means you have to reconfigure
the kernel to support large amount of memory, which also costs
something in terms of speed on i386 architecture.

Disks are *much* cheaper than ram in terms of money/Mb.

>>I don't want a down system due to a single disk failure.
> 
> that's what raid's for :-)
> 
>>I mirror everything, or RAID5.  Normally, no downtime due to disk failures.
> 
> the problem with mirror ( raid1 ).. or raid5 ...
> 	- if you have a bad diska ... all "bad data" will/could  also get
> 	copied to the good disk

Again: pretty PLEASE, stop talking about thouse mysterious "silent
corruption/errors".  Errors gets detected.  It is *very* unlikely
case when an error on disk (either unability to read, or reading
the "wrong" (aka not the same as has been written) data) will not
be detected during read, and if you do care about that cases, you
have to use some very different hardware with every component
(CPU, memory, buses, controllers etc etc) at least tripled, with
hardware-level online monitoring/comparing stuff to detect errors
at any level and to switch to another component if one is "lying".

> 	- "bad data" is hard to figure out in code ... to prevent it from
> 	getting copied ... how does it know with 100% certainty 

Nothing is 100% certain.. maybe except that we all will die sometime...

> 	- if you know why it's bad data,  it's lot easier to know which
> 	data is more correct than the bad one

Nothing is "more correct".  If the disk isn't working somehow, we know
this (as it reports errors) and kick it from the array.  If disk
"does not work silently", see above.

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 16:22                     ` Michael Tokarev
@ 2005-01-05 17:23                       ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05 17:23 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> Again: pretty PLEASE, stop talking about thouse mysterious "silent
> corruption/errors".  Errors gets detected.

You confuse them here with failures (which probably get detected, but
then who can say!).  An error occurs when you do a sum on paper and you
forget to carry one in the third column.  In colloquial terms, it's a
"mistake".  Life carries on.  A failure occurs when your brain explodes
and CNN comes round and interviews your next door neignbour about the
hole in their wall.

> It is *very* unlikely
> case when an error on disk (either unability to read, or reading
> the "wrong" (aka not the same as has been written) data) will not
> be detected during read.

It's practically certain that it won't be "detected", because it is
on disk as far as anyone and anything can tell - there would have been a
failure if that were not the case.  It's an ordinary datum.

> , and if you do care about that cases, you
> have to use some very different hardware with every component
> (CPU, memory, buses, controllers etc etc) at least tripled, with
> hardware-level online monitoring/comparing stuff to detect errors

No, that detects errors internally (and corrects them, or else it
produces "failures" that are externally visible in place of them, or
else it doesn't detect them and the errors are also externally
visible).

> at any level and to switch to another component if one is "lying".

It still leaves the errors.  I don't know why everyone has such semantic
problems with this!  Think of an error as a "bug".  You catch those bugs
you can see, and don't catch the bugs you can't see.  There are always
bugs you can't see (hey!, if you saw them you would correct them, right?
Or at least die die die). It is simply a classification.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 15:33                   ` Alvin Oga
  2005-01-05 16:22                     ` Michael Tokarev
@ 2005-01-05 16:23                     ` Andy Smith
  2005-01-05 16:30                       ` Andy Smith
  2005-01-05 17:04                       ` swp - " Alvin Oga
  2005-01-05 17:07                     ` Guy
  2 siblings, 2 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-05 16:23 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1084 bytes --]

On Wed, Jan 05, 2005 at 07:33:55AM -0800, Alvin Oga wrote:
> On Wed, 5 Jan 2005, Guy wrote:
> 
> > I agree, but for a different reason.  Your reason is new to me.
> ..
> > Loosing the swap disk would kill the system.
> 
> if one is using swap space ... i'd add more memory .. before i'd use raid
> 	- swap is too slow and as you folks point out, it could die
> 	due to (unlikely) bad disk sectors in swap area

This again is besides the point.  As I said already swap is part of
your virtual memory and if you can't keep it available you should
not be using it.  It is the same as saying that your machine will
die if half its RAM suddenly ceased to exist while it was running.

Your recommendations so far have been "don't run processes which
might swap" and "don't let swap be used".  If you yourself are able
to keep to both of these 100% of the time may I ask why you yourself
have any swap at all?

ometimes) or those of us in the real world where swap is desirable
and may (sometimes) be used, a swap device failure *will* *take*
*the* *machine* *down*.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 16:23                     ` Andy Smith
@ 2005-01-05 16:30                       ` Andy Smith
  2005-01-05 17:04                       ` swp - " Alvin Oga
  1 sibling, 0 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-05 16:30 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 453 bytes --]

On Wed, Jan 05, 2005 at 04:23:37PM +0000, Andy Smith wrote:
> ometimes) or those of us in the real world where swap is desirable
> and may (sometimes) be used, a swap device failure *will* *take*
> *the* *machine* *down*.

Something strange happened with vim there.  That was meant to read:

"For those of us in the real world where swap is (sometimes)
desirable and may (sometimes) be used, a swap device failure *will*
*take* *the* *machine* *down*."

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 16:23                     ` Andy Smith
  2005-01-05 16:30                       ` Andy Smith
@ 2005-01-05 17:04                       ` Alvin Oga
  2005-01-05 17:26                         ` Andy Smith
  1 sibling, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 17:04 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid



On Wed, 5 Jan 2005, Andy Smith wrote:

> On Wed, Jan 05, 2005 at 07:33:55AM -0800, Alvin Oga wrote:
> > 
> > if one is using swap space ... i'd add more memory .. before i'd use raid
> > 	- swap is too slow and as you folks point out, it could die
> > 	due to (unlikely) bad disk sectors in swap area
...
 
> Your recommendations

that'd be your comment ... that is not whati said above

and all i said, was use memory before you use swap on disks

and if you are at the 2GB lmit of your mb ... you obviously don't
have a choice
	- if you are using 128MB of memory and using 2GB of swap
	and bitching about a slow box or system crashing due to swap...
	- that was the point ... add more memory

	- and i don't think anybody is idiotic enough to add
	more memory for the "spikes" in the workload

> so far have been "don't run processes which
> might swap" and "don't let swap be used".  If you yourself are able
> to keep to both of these 100% of the time may I ask why you yourself
> have any swap at all?

you're twisting things again... but no problem.. have fun doing that
 
> ometimes) or those of us in the real world where swap is desirable
> and may (sometimes) be used, a swap device failure *will* *take*
> *the* *machine* *down*.

my real world is obviously better .... since we do NOT suffer
from these mysterious disks crashes

we go to fix peoples problems

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:04                       ` swp - " Alvin Oga
@ 2005-01-05 17:26                         ` Andy Smith
  2005-01-05 18:32                           ` Alvin Oga
  0 siblings, 1 reply; 172+ messages in thread
From: Andy Smith @ 2005-01-05 17:26 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3177 bytes --]

On Wed, Jan 05, 2005 at 09:04:57AM -0800, Alvin Oga wrote:
> On Wed, 5 Jan 2005, Andy Smith wrote:
> 
> > On Wed, Jan 05, 2005 at 07:33:55AM -0800, Alvin Oga wrote:
> > > 
> > > if one is using swap space ... i'd add more memory .. before i'd use raid
> > > 	- swap is too slow and as you folks point out, it could die
> > > 	due to (unlikely) bad disk sectors in swap area
> ...
>  
> > Your recommendations
> 
> that'd be your comment ... that is not whati said above

Direct quote:

        i'd add more memory before i'd use raid

> and all i said, was use memory before you use swap on disks

Which means what?  Who is there on this list who likes to use swap
*before* physical memory?

Do all your machines which you believe have adequate memory also
have no swap configured?

Fact is if your machine has swap configured then that swap is part
of your virtual memory and if a device that is providing part of
your virtual memory suddenly fails then your machine is going down.

> and if you are at the 2GB lmit of your mb ... you obviously don't
> have a choice
> 	- if you are using 128MB of memory and using 2GB of swap
> 	and bitching about a slow box or system crashing due to swap...
> 	- that was the point ... add more memory

No one here is bitching about a slow machine due to swap usage and
if they were I'd wonder why they are doing it on linux-raid.

I repeat, if "add more memory" is your answer to "swapping on a
single device which then dies kills my machine" then does that mean
that your machines are configured with no swap?

All people are saying is that if you don't mirror swap then disk
failures cause downtime.  You are replying to add more memory and
don'trun things that can get swapped.  Those don't seem like very
useful recommendations.

> 	- and i don't think anybody is idiotic enough to add
> 	more memory for the "spikes" in the workload

OK so you must add swap to handle those spikes which means either
you are happy that your machine will crash should the device that
the swap is on die, or you use swap on a mirror to try to mitigate
that.

> > so far have been "don't run processes which
> > might swap" and "don't let swap be used".  If you yourself are able
> > to keep to both of these 100% of the time may I ask why you yourself
> > have any swap at all?
> 
> you're twisting things again... but no problem.. have fun doing that

From previous email (not a direct quote):

        Don't run lpd on user-accessible machine

(after given an example of some lpd process which gets swapped
out and back in again).

> > ometimes) or those of us in the real world where swap is desirable
> > and may (sometimes) be used, a swap device failure *will* *take*
> > *the* *machine* *down*.
> 
> my real world is obviously better .... since we do NOT suffer
> from these mysterious disks crashes

If you do not suffer disk crashes then why do you use mirrors at
all?  If you do suffer disk crashes then any disk that is providing
swap may very well cause the machine to crash.

I thought everyone suffered disk crashes and that was the point of
RAID.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:26                         ` Andy Smith
@ 2005-01-05 18:32                           ` Alvin Oga
  2005-01-05 22:35                             ` Andy Smith
  0 siblings, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 18:32 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid


hi ya andy

On Wed, 5 Jan 2005, Andy Smith wrote:

> > > Your recommendations
> > 
> > that'd be your comment ... that is not whati said above
> 
> Direct quote:
> 
>         i'd add more memory before i'd use raid

i see ... when i say "i would use" ... i dont mean or imply 
others to do so ..

if i do mean "you" should, or i strongly recommend ... i usually
explicitly say so
        
> > and all i said, was use memory before you use swap on disks
> 
> Which means what?  Who is there on this list who likes to use swap
> *before* physical memory?

you'd be surprized how many people wonder why their system is slow
and they using 100% swap ( hundreds of MB of swap ) and little memory
beause its $50 of expensive memory
 
> Do all your machines which you believe have adequate memory also
> have no swap configured?

yes .... in my machines .. they are all mostly tuned for specfic
tasks ... very little ( say 0.05% swap usage even at peak )

> Fact is if your machine has swap configured then that swap is part
> of your virtual memory and if a device that is providing part of
> your virtual memory suddenly fails then your machine is going down.

yup...

and in some systems ... having swap ( too slow ) is NOT an option ... 
	( embedded systems, realtime operations, ... )
 
> I repeat, if "add more memory" is your answer to "swapping on a
> single device which then dies kills my machine" then does that mean
> that your machines are configured with no swap?

i'm NOT the one having hardware problems of any sort ...

i'm stating people that are having problems usually have problems
because of many possible reasons ...
	- most common one is, that it's a machine just slapped
	together with parts on sales from far away website vendors 
 
- if the machine cannot handle swapp ... to one partiton or
  or swap on other disks, than they have a serious hardware problem 
  that they bought pieces of hw junk  vs buying good parts
  from known good vendors
	==
	== you should NEVER have swap problems
	== otherwise, toss that system out or salvage what you can
	==

> All people are saying is that if you don't mirror swap then disk
> failures cause downtime.

yes ... 

and i'm just saying, if you dont know why swapp failed or 
that if downtime is important, use better quality hardware
and install it properly ... maintain it properly ...

if you forget about it .. the disk ... the disk will get
lonely too will forget about holding its 1's and 0's too

>  You are replying to add more memory and
> don'trun things that can get swapped. 

you're twisting things gain

> Those don't seem like very  useful recommendations.

that is becuase you're twisting things so you can make
comments i didnt say


> > 	- and i don't think anybody is idiotic enough to add
> > 	more memory for the "spikes" in the workload
> 
> OK so you must add swap to handle those spikes which means either
> you are happy that your machine will crash should the device that
> the swap is on die, or you use swap on a mirror to try to mitigate
> that.

this is pointless isnt it ...

are you an idiot or what ..
 
do you like to put words and recommendations that i didnt say
and twist it in your favor 

> 
> From previous email (not a direct quote):
> 
>         Don't run lpd on user-accessible machine

you obviously do NOT understand where to run lpd and where
to run dns and where to run pop and where to run ssh and where
to run mta ... and on and on and on ...
 
> If you do not suffer disk crashes then why do you use mirrors at
> all?

i do NOT use mirrors for the protection against disk failures

nobody said i did .. you again are twisting things into your
own ideas and misconceptions

>  If you do suffer disk crashes then any disk that is providing
> swap may very well cause the machine to crash.

i dont have that problem ...  but if someone ddid have that
problem ... i assuem they are smart enough to figure out
within 5-10 minutes why the disk/system is crashing
 
> I thought everyone suffered disk crashes and that was the point of
> RAID.

not everybody suffers disk crashes ... in such great numbers
that raid is better solution ...
	- raid is NOT theonly solution

== ever hear of high availability ...
	= clusters ...
	= even plain ole backups

there are more than one solution to one disk crashing

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 18:32                           ` Alvin Oga
@ 2005-01-05 22:35                             ` Andy Smith
  2005-01-06  0:57                               ` Guy
  0 siblings, 1 reply; 172+ messages in thread
From: Andy Smith @ 2005-01-05 22:35 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7865 bytes --]

On Wed, Jan 05, 2005 at 10:32:37AM -0800, Alvin Oga wrote:
> 
> hi ya andy
> 
> On Wed, 5 Jan 2005, Andy Smith wrote:
> 
> > > > Your recommendations
> > > 
> > > that'd be your comment ... that is not whati said above
> > 
> > Direct quote:
> > 
> >         i'd add more memory before i'd use raid
> 
> i see ... when i say "i would use" ... i dont mean or imply 
> others to do so ..

That is not a very useful thing to say in response to someone who
suggests using RAID mirrors for swap.

> > > and all i said, was use memory before you use swap on disks
> > 
> > Which means what?  Who is there on this list who likes to use swap
> > *before* physical memory?
> 
> you'd be surprized how many people wonder why their system is slow
> and they using 100% swap ( hundreds of MB of swap ) and little memory
> beause its $50 of expensive memory

What you said was (and I quoted it for you) "use memory before you
use swap".  I am asking you if this also is a useful thing to say,
because I was not aware there was anyone who would prefer to use
swap before memory.

Unless you simply mean "add more memory" which again would be a
strange thing to say in a discussion about putting swap on a RAID
mirror -> implies "always have enough RAM, configure no swap" but
since you seem to have a great objection to me trying to make sense
of what you're saying I won't go any further stating what I think
you imply.

> > Do all your machines which you believe have adequate memory also
> > have no swap configured?
> 
> yes .... in my machines .. they are all mostly tuned for specfic
> tasks ... very little ( say 0.05% swap usage even at peak )

So "no" then as I asked if your machines had any swap configured and
you reply that they use 0.05% at peak.  To use 0.05% they need to
have swap configured.

So your machines do have swap configured, that is what the question
was.

> > Fact is if your machine has swap configured then that swap is part
> > of your virtual memory and if a device that is providing part of
> > your virtual memory suddenly fails then your machine is going down.
> 
> yup...
> 
> and in some systems ... having swap ( too slow ) is NOT an option ... 
> 	( embedded systems, realtime operations, ... )

That's great but it's irrelevant to this discussion since the
situation discussed is swap on mirror or not.  Whether swap is
required or beneficial is an admin issue outside the scope of RAID.
We can assume that the admin ahs already determined that configuring
some amount of swap is required, otherwise we cannot assume anything
about their setup and will be telling them how to lay out their
filesystems, what distribution to use, etc. etc..  There are some
things that just don't need to be said and on a RAID list, "add more
memory, swap is slow" is IMHO one of them.

> 	== you should NEVER have swap problems
> 	== otherwise, toss that system out or salvage what you can
> 	==

If you use RAID-1 (-4, -5,- 10, etc.) then I think you are
acknowledging that your disks are a cause for concern for you.
Otherwise no one would be using RAID.  If you would not be willing
to say the following:

        you should NEVER have filesystem problems
        otherwise, toss that system out or salvage what you can

then I don't understand why you are willing to say it when it comes
to swap and not a filesystem.  Disks fail, it's why we're here on
this list.

> >  You are replying to add more memory and
> > don'trun things that can get swapped. 
> 
> you're twisting things gain

All I've got to work with is what you're giving me.  Direct quotes
of yours lead me to believe that these are your points.  In a
discussion about swap on RAID you have clearly stated to add more
RAM, use higher quality components, not run certain userland
processes.  If you say these things in a discussion about swap on
RAID then I'm left to believe either that you see these things as an
alternative to swap on RAID or else you are just posting irrelevant
basic admin tips that have nothing to do with linux RAID.

> > Those don't seem like very  useful recommendations.
> 
> that is becuase you're twisting things so you can make
> comments i didnt say

Then please explain yourself better using points that are relevant
to RAID using md on Linux.

> > > 	- and i don't think anybody is idiotic enough to add
> > > 	more memory for the "spikes" in the workload
> > 
> > OK so you must add swap to handle those spikes which means either
> > you are happy that your machine will crash should the device that
> > the swap is on die, or you use swap on a mirror to try to mitigate
> > that.
> 
> this is pointless isnt it ...
> 
> are you an idiot or what ..

It's possible to disagree and debate while still being civil.  If you
don't have that skill then I'm not willing to spend time teaching
you it.

> do you like to put words and recommendations that i didnt say
> and twist it in your favor 

No, I like understanding your point but at the moment all I can see
is basic unix admin recommendations that are unrelated to the
discussion at hand - you admit you have swap configured on your
servers so all the other stuff you have mentioned is irrelevant.  So
help me understand.

I don't understand how you intend to survive a disk failure
involving your swap unless you put it on a mirror or configure no
swap or never have a disk failure, ever, or intend to have downtime
when you do have a disk failure.

So help me understand.

> > From previous email (not a direct quote):
> > 
> >         Don't run lpd on user-accessible machine
> 
> you obviously do NOT understand where to run lpd and where
> to run dns and where to run pop and where to run ssh and where
> to run mta ... and on and on and on ...

You have no idea whether I do or do not know these things and on
this list you never will, because placement of these things is
irrelevant to linux raid.

> > If you do not suffer disk crashes then why do you use mirrors at
> > all?
> 
> i do NOT use mirrors for the protection against disk failures
> 
> nobody said i did .. you again are twisting things into your
> own ideas and misconceptions

Well indeed, if you never suffer disk failures as you go on to say
then I imagine you don't use mirrors for that..

> >  If you do suffer disk crashes then any disk that is providing
> > swap may very well cause the machine to crash.
> 
> i dont have that problem ...  but if someone ddid have that
> problem ... i assuem they are smart enough to figure out
> within 5-10 minutes why the disk/system is crashing

Such a person, if they were not as fortunate and/or skiled at
purchasing hardware as you appear to be, and so happened to suffer
the occasional disk failure, may wish that their system did not
crash and suffer 5 to 10 minutes of downtime because a disk failed.
They may wish that the device would just fail and they could
schedule a downtime or hot swap it.

If you are not one of those people (and, since you say you don't
have disk failures then you wouldn't be), it doesn't mean those
people don't exist and don't have a valid requirement.

> > I thought everyone suffered disk crashes and that was the point of
> > RAID.
> 
> not everybody suffers disk crashes ... in such great numbers
> that raid is better solution ...
> 	- raid is NOT theonly solution
> 
> == ever hear of high availability ...
> 	= clusters ...
> 	= even plain ole backups
> 
> there are more than one solution to one disk crashing

Is there any reason why someone could not use swap on raid as well
as any of the above as they see fit?  Maybe those who advocate swap
on raid already use high availablity and clusters.  And we'd
certainly hope they have backups.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 22:35                             ` Andy Smith
@ 2005-01-06  0:57                               ` Guy
  2005-01-06  1:28                                 ` Mike Hardy
  2005-01-06  5:01                                 ` Alvin Oga
  0 siblings, 2 replies; 172+ messages in thread
From: Guy @ 2005-01-06  0:57 UTC (permalink / raw)
  To: 'Andy Smith', linux-raid

Going off topic, but...

I come from an HP-UX environment.
With HP-UX, I have always had to configure swap space.
Maybe only for 1 reason.  If you used shared memory, you must have an equal
amount of swap space available.  We always used shared memory (shmget).  It
was called a "backing store", if I recall.  Including Informix, we used
almost the hardware (OS) limit of about 1.7 Gig of shared memory on some
systems.  So we needed at least 1.7 Gig of swap space.  We would allocate
3-4 Gig I think, just to be safe.  The swap space would be allocated
(reserved), but not used, unless you really ran low on RAM.

So, the question is:
	Assume I have enough RAM to never need to swap.
	Do I need any swap space with Linux?

Silly story:
I once had a 486 with 4 Meg of ram.  Running Windows 3.1 I think.
I had 8 Meg of swap space.  So 12 meg total available virtual memory.
One day I added 16 Meg of RAM.  So now I had 20 Meg of RAM.  I deleted my
swap space.  Everyone told me I needed 20-40 Meg of swap space now!  Swap
space should be 2 time RAM size.  How crazy, my memory requirements did not
change, just the amount of memory.  I used that system for a year or so like
that.  Go figure!

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Andy Smith
Sent: Wednesday, January 05, 2005 5:36 PM
To: linux-raid@vger.kernel.org
Subject: Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM:
Kernel 2.6.10 crashing repeatedly and hard)

On Wed, Jan 05, 2005 at 10:32:37AM -0800, Alvin Oga wrote:
> 
> hi ya andy
> 
> On Wed, 5 Jan 2005, Andy Smith wrote:
> 
> > > > Your recommendations
> > > 
> > > that'd be your comment ... that is not whati said above
> > 
> > Direct quote:
> > 
> >         i'd add more memory before i'd use raid
> 
> i see ... when i say "i would use" ... i dont mean or imply 
> others to do so ..

That is not a very useful thing to say in response to someone who
suggests using RAID mirrors for swap.

> > > and all i said, was use memory before you use swap on disks
> > 
> > Which means what?  Who is there on this list who likes to use swap
> > *before* physical memory?
> 
> you'd be surprized how many people wonder why their system is slow
> and they using 100% swap ( hundreds of MB of swap ) and little memory
> beause its $50 of expensive memory

What you said was (and I quoted it for you) "use memory before you
use swap".  I am asking you if this also is a useful thing to say,
because I was not aware there was anyone who would prefer to use
swap before memory.

Unless you simply mean "add more memory" which again would be a
strange thing to say in a discussion about putting swap on a RAID
mirror -> implies "always have enough RAM, configure no swap" but
since you seem to have a great objection to me trying to make sense
of what you're saying I won't go any further stating what I think
you imply.

> > Do all your machines which you believe have adequate memory also
> > have no swap configured?
> 
> yes .... in my machines .. they are all mostly tuned for specfic
> tasks ... very little ( say 0.05% swap usage even at peak )

So "no" then as I asked if your machines had any swap configured and
you reply that they use 0.05% at peak.  To use 0.05% they need to
have swap configured.

So your machines do have swap configured, that is what the question
was.

> > Fact is if your machine has swap configured then that swap is part
> > of your virtual memory and if a device that is providing part of
> > your virtual memory suddenly fails then your machine is going down.
> 
> yup...
> 
> and in some systems ... having swap ( too slow ) is NOT an option ... 
> 	( embedded systems, realtime operations, ... )

That's great but it's irrelevant to this discussion since the
situation discussed is swap on mirror or not.  Whether swap is
required or beneficial is an admin issue outside the scope of RAID.
We can assume that the admin ahs already determined that configuring
some amount of swap is required, otherwise we cannot assume anything
about their setup and will be telling them how to lay out their
filesystems, what distribution to use, etc. etc..  There are some
things that just don't need to be said and on a RAID list, "add more
memory, swap is slow" is IMHO one of them.

> 	== you should NEVER have swap problems
> 	== otherwise, toss that system out or salvage what you can
> 	==

If you use RAID-1 (-4, -5,- 10, etc.) then I think you are
acknowledging that your disks are a cause for concern for you.
Otherwise no one would be using RAID.  If you would not be willing
to say the following:

        you should NEVER have filesystem problems
        otherwise, toss that system out or salvage what you can

then I don't understand why you are willing to say it when it comes
to swap and not a filesystem.  Disks fail, it's why we're here on
this list.

> >  You are replying to add more memory and
> > don'trun things that can get swapped. 
> 
> you're twisting things gain

All I've got to work with is what you're giving me.  Direct quotes
of yours lead me to believe that these are your points.  In a
discussion about swap on RAID you have clearly stated to add more
RAM, use higher quality components, not run certain userland
processes.  If you say these things in a discussion about swap on
RAID then I'm left to believe either that you see these things as an
alternative to swap on RAID or else you are just posting irrelevant
basic admin tips that have nothing to do with linux RAID.

> > Those don't seem like very  useful recommendations.
> 
> that is becuase you're twisting things so you can make
> comments i didnt say

Then please explain yourself better using points that are relevant
to RAID using md on Linux.

> > > 	- and i don't think anybody is idiotic enough to add
> > > 	more memory for the "spikes" in the workload
> > 
> > OK so you must add swap to handle those spikes which means either
> > you are happy that your machine will crash should the device that
> > the swap is on die, or you use swap on a mirror to try to mitigate
> > that.
> 
> this is pointless isnt it ...
> 
> are you an idiot or what ..

It's possible to disagree and debate while still being civil.  If you
don't have that skill then I'm not willing to spend time teaching
you it.

> do you like to put words and recommendations that i didnt say
> and twist it in your favor 

No, I like understanding your point but at the moment all I can see
is basic unix admin recommendations that are unrelated to the
discussion at hand - you admit you have swap configured on your
servers so all the other stuff you have mentioned is irrelevant.  So
help me understand.

I don't understand how you intend to survive a disk failure
involving your swap unless you put it on a mirror or configure no
swap or never have a disk failure, ever, or intend to have downtime
when you do have a disk failure.

So help me understand.

> > From previous email (not a direct quote):
> > 
> >         Don't run lpd on user-accessible machine
> 
> you obviously do NOT understand where to run lpd and where
> to run dns and where to run pop and where to run ssh and where
> to run mta ... and on and on and on ...

You have no idea whether I do or do not know these things and on
this list you never will, because placement of these things is
irrelevant to linux raid.

> > If you do not suffer disk crashes then why do you use mirrors at
> > all?
> 
> i do NOT use mirrors for the protection against disk failures
> 
> nobody said i did .. you again are twisting things into your
> own ideas and misconceptions

Well indeed, if you never suffer disk failures as you go on to say
then I imagine you don't use mirrors for that..

> >  If you do suffer disk crashes then any disk that is providing
> > swap may very well cause the machine to crash.
> 
> i dont have that problem ...  but if someone ddid have that
> problem ... i assuem they are smart enough to figure out
> within 5-10 minutes why the disk/system is crashing

Such a person, if they were not as fortunate and/or skiled at
purchasing hardware as you appear to be, and so happened to suffer
the occasional disk failure, may wish that their system did not
crash and suffer 5 to 10 minutes of downtime because a disk failed.
They may wish that the device would just fail and they could
schedule a downtime or hot swap it.

If you are not one of those people (and, since you say you don't
have disk failures then you wouldn't be), it doesn't mean those
people don't exist and don't have a valid requirement.

> > I thought everyone suffered disk crashes and that was the point of
> > RAID.
> 
> not everybody suffers disk crashes ... in such great numbers
> that raid is better solution ...
> 	- raid is NOT theonly solution
> 
> == ever hear of high availability ...
> 	= clusters ...
> 	= even plain ole backups
> 
> there are more than one solution to one disk crashing

Is there any reason why someone could not use swap on raid as well
as any of the above as they see fit?  Maybe those who advocate swap
on raid already use high availablity and clusters.  And we'd
certainly hope they have backups.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  0:57                               ` Guy
@ 2005-01-06  1:28                                 ` Mike Hardy
  2005-01-06  3:32                                   ` Guy
  2005-01-06  5:04                                   ` Alvin Oga
  2005-01-06  5:01                                 ` Alvin Oga
  1 sibling, 2 replies; 172+ messages in thread
From: Mike Hardy @ 2005-01-06  1:28 UTC (permalink / raw)
  To: linux-raid

Guy wrote:

> So, the question is:
> 	Assume I have enough RAM to never need to swap.
> 	Do I need any swap space with Linux?

This has been hashed out at great length on linux-kernel - with a few 
entrenched positions emerging, if I recall correctly.

There are those that think if they size the machine correctly, they 
shouldn't need swap, and they're right.

> I had 8 Meg of swap space.  So 12 meg total available virtual memory.
> One day I added 16 Meg of RAM.  So now I had 20 Meg of RAM.  I deleted my
> swap space.  Everyone told me I needed 20-40 Meg of swap space now!  Swap
> space should be 2 time RAM size.  How crazy, my memory requirements did not
> change, just the amount of memory.  I used that system for a year or so like
> that.  Go figure!

This story pretty much sums up that position, and I've certainly been in 
that position myself.

There are others (notably the maintainer of part of that code - Andrea 
Arcangeli if I recall correctly, though I apologize if that's a 
misattribution) who believe you should always have swap because it will 
allow your system to have higher throughput. If you process a large I/O, 
for instance, the kernel can swap out live processes to devote more RAM 
to VFS caching. That hurts latency though, and the nightly slocate run 
is a pathlogical example of this. You wake up in the morning and your 
machine is crawling while it swaps everything back in.

There's a "swappiness" knob you can twiddle in /proc or /sys to alter 
this, but the general idea is that you can improve throughput on a 
machine by having swap even if there is enough ram for all processes

I've got machines with and without swap, but I typically run all servers 
with swap to handle ram-usage spikes I'm not expecting (you never know) 
while I run my laptop without swap when possible to avoid latency issues 
with swappiness.

As with all things, its policy decisions and tradeoffs :-)

-Mike

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  1:28                                 ` Mike Hardy
@ 2005-01-06  3:32                                   ` Guy
  2005-01-06  4:49                                     ` Mike Hardy
  2005-01-06  5:04                                   ` Alvin Oga
  1 sibling, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-06  3:32 UTC (permalink / raw)
  To: 'Mike Hardy', linux-raid

Ok, good answer.  Thanks.  I understand the "spikes" issue.  I really do
plan to have swap space all the time, but good to know it is optional if you
have enough RAM.

Maybe the concept of swapping is becoming obsolete?

In the past, some systems had hardware limits.  Like 64K, 640K, 16M, 64M.
You had no choice but to swap stuff out.  Or use overlays.  I have worked on
systems where the app had to swap chunks of memory to disk and back.  There
was no OS support for virtual memory.  We had fixed head disks to give
better performance.  I even recall hardware that could bank switch sections
of memory to break the 640K limit!  Even IBM mainframes had a 256Meg limit,
but could go beyond that by bank switching.  But now hardware has caught up
to programmers.  Most systems can support more RAM than most programs need.
With PC based systems (and some others I assume), 1 Gig of RAM is very
cheap!  But I did note someone said you had to do something to break the
2Gig boundary in Linux with x68 based systems.  That's too bad.  In a few
years, most people will be in the 64 bit world I guess.  I hope that
corrects the 2Gig boundary.  The next boundary (8,589,934,592 Gig) should be
well beyond anything we would ever need, well, for at least a few more
years!  :)  But then we could go un-signed and double that!  Of course,
disks will be even bigger!  I can't wait!

Of course you can't put 8,589,934,592 Gig of RAM in a 64 bit computer today.

If you consider Moore's Law and RAM.  "transistors on integrated circuits
tend to double every 18 to 24 months"  Assuming we are at a about 36 bits
today (I think 64 Gig is about as much as is reasonable today).  We should
reach the 64 bit limit in 42-56 years.  Maybe it will be a long wait?  :)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mike Hardy
Sent: Wednesday, January 05, 2005 8:28 PM
To: linux-raid@vger.kernel.org
Subject: Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM:
Kernel 2.6.10 crashing repeatedly and hard)

Guy wrote:

> So, the question is:
> 	Assume I have enough RAM to never need to swap.
> 	Do I need any swap space with Linux?

This has been hashed out at great length on linux-kernel - with a few 
entrenched positions emerging, if I recall correctly.

There are those that think if they size the machine correctly, they 
shouldn't need swap, and they're right.

> I had 8 Meg of swap space.  So 12 meg total available virtual memory.
> One day I added 16 Meg of RAM.  So now I had 20 Meg of RAM.  I deleted my
> swap space.  Everyone told me I needed 20-40 Meg of swap space now!  Swap
> space should be 2 time RAM size.  How crazy, my memory requirements did
not
> change, just the amount of memory.  I used that system for a year or so
like
> that.  Go figure!

This story pretty much sums up that position, and I've certainly been in 
that position myself.

There are others (notably the maintainer of part of that code - Andrea 
Arcangeli if I recall correctly, though I apologize if that's a 
misattribution) who believe you should always have swap because it will 
allow your system to have higher throughput. If you process a large I/O, 
for instance, the kernel can swap out live processes to devote more RAM 
to VFS caching. That hurts latency though, and the nightly slocate run 
is a pathlogical example of this. You wake up in the morning and your 
machine is crawling while it swaps everything back in.

There's a "swappiness" knob you can twiddle in /proc or /sys to alter 
this, but the general idea is that you can improve throughput on a 
machine by having swap even if there is enough ram for all processes

I've got machines with and without swap, but I typically run all servers 
with swap to handle ram-usage spikes I'm not expecting (you never know) 
while I run my laptop without swap when possible to avoid latency issues 
with swappiness.

As with all things, its policy decisions and tradeoffs :-)

-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  3:32                                   ` Guy
@ 2005-01-06  4:49                                     ` Mike Hardy
  2005-01-09 21:07                                       ` Mark Hahn
  0 siblings, 1 reply; 172+ messages in thread
From: Mike Hardy @ 2005-01-06  4:49 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

Guy wrote:

> Maybe the concept of swapping is becoming obsolete?

I think its definitely headed that way, if not already there for most 
new systems

> to programmers.  Most systems can support more RAM than most programs need.
> With PC based systems (and some others I assume), 1 Gig of RAM is very
> cheap!  But I did note someone said you had to do something to break the
> 2Gig boundary in Linux with x68 based systems.  That's too bad.  In a few

To go bigger than 2GB on x86 you have to enable PAE (pointer address 
extensions? I forget the TLA expansion) which does have a speed hit and 
so is sub-optimal. Additionally, you still have limits on the amount of 
RAM any process can address and its more policy decisions and tradeoffs.

Even with 64-bit, in my professional world (Java programming) I've had 
occasion to want 4GB heaps to work with for caching purposes, and 
programs are only just getting there. I think you're right then, the 
hardware and software have just met each other in the last year or so 
except for the edge cases.

Bandwidth is the usual limiter at this point :-)

> today (I think 64 Gig is about as much as is reasonable today).  We should
> reach the 64 bit limit in 42-56 years.  Maybe it will be a long wait?  :)

:-)

To make this more on topic, I'll say that I just switched all my servers 
(which had raid1 /boot and / but parallel JBOD swap) to mirrored swap 
after this discussion. All using simple mdadm/swapon/swapoff commands 
while they were humming, thanks to all the author's work. The servers 
are all doing fine, post-change.

As further penance for taking up everyone's time, I'll point to the 
specific section of the software raid howto that discusses swapping on raid:

http://www.tldp.org/HOWTO/Software-RAID-HOWTO-2.html#ss2.3

They succinctly say what was posted here - which is that you don't swap 
on raid for performance, but if you're looking for single-hdd machine 
survivability, swapping on raid1 is the only way.

Cheers-
-Mike

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  4:49                                     ` Mike Hardy
@ 2005-01-09 21:07                                       ` Mark Hahn
  0 siblings, 0 replies; 172+ messages in thread
From: Mark Hahn @ 2005-01-09 21:07 UTC (permalink / raw)
  To: linux-raid

> > Maybe the concept of swapping is becoming obsolete?
> 
> I think its definitely headed that way, if not already there for most 
> new systems

take a deep breath.  this is manifestly untrue.

in the current era, swap is a way to optimize the use of memory.
please don't even think of its original meaning as a place to 
write out a process's complete VM in order to switch to another 
process's VM.

swap is where the kernel puts less-used pages.  having lots of memory
does NOT magically mean that all pages are equally well-used.  if you
do actually have so much physical memory that the kernel never has 
any shortage of memory, well, bully for you, you've wasted big bucks.

and it is waste.  consider the page of my Firefox browser that contains 
code for interpreting devanagari unicode glyphs.  it doesn't get much 
use, and so it should get swapped out.  writing out a swap page is basically
free, and it means I have one more page available for something more useful.
since my frequency of reading devanagari pages is sort of low, that page 
can be used to cache, say, http://images.slashdot.org/title.gif.

swapping is how the kernel optimizes space used by idle anonymous pages.

if you understand that, it says absolutely everything you need to know:
that swapping will always have some value, that it depends on high variance
in the "temperature" of pages (and thus the concept of kernel memory
"pressure"), that it doesn't apply to a system with no dirty anonymous pages.

you can refuse to do this optimization, and your system will run poorer.
the kernel can do this optimization poorly, and thus slow down, not speed up.
you can waste your money on so much ram that the kernel never experiences 
memory pressure (but you'd be surprised how much ram that would require!)

swapping is nearly free, since doing an async write doesn't slow anything 
else down (assuming good page choice, otherwise lightly loaded disks,
nothing misconfigured like PIO.)  swapping is a recognition that disk is 
drastically cheaper than ram.  swapping does mean that idle anonymous pages
could be corrupted by a flakey storage subsystem - but if that happens,
you have other serious issues - what does it say about the pages comprising
the text of your applications?  remember that a big, hard-pressed system 
might have, say, a gigabyte of swap in use, but even a small desktop will
have 50x that much in files exposed to the same hypothetical corruption.

yes, I configure my systems with un-raided swap, because if disks are *that*
flakey, I want to know about it.

finally, ram prices (per GB) have been mostly stable for a couple years now.
it's true that ram is faster, and systems are slowly increasing in ram size,
but nothing dramatic is on the horizon.  I run a supercomputer center that is 
buying a large amount of new hardware now.  the old systems are almost 4
years old and have 1GB/cpu (4-way alphas).  new systems will average about 
4GB/cpu, and only a few of our users (HPC stuff way beyond desktops) would
like more than 8GB/cpu.  all of the ~6K cpus we buy will be in systems that 
have swap.

regards, mark hahn.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  1:28                                 ` Mike Hardy
  2005-01-06  3:32                                   ` Guy
@ 2005-01-06  5:04                                   ` Alvin Oga
  2005-01-06  6:18                                     ` Guy
                                                       ` (2 more replies)
  1 sibling, 3 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-06  5:04 UTC (permalink / raw)
  To: Mike Hardy; +Cc: linux-raid



On Wed, 5 Jan 2005, Mike Hardy wrote:

> There are those that think if they size the machine correctly, they 
> shouldn't need swap, and they're right.

bingo
 
> > I had 8 Meg of swap space.  So 12 meg total available virtual memory.
> > One day I added 16 Meg of RAM.  So now I had 20 Meg of RAM.  I deleted my
> > swap space.  Everyone told me I needed 20-40 Meg of swap space now!  Swap
> > space should be 2 time RAM size.  How crazy, my memory requirements did not
> > change, just the amount of memory.  I used that system for a year or so like
> > that.  Go figure!

the silly rule of 2x size of RAM == swap space came from the old days
when memory was 10x the costs of disks or some silly cost performance
that it made sense when grandpa was floating around

by todays ram and disk pricing ... and cpu speeds ...2x memory sorta
goes out the door

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  5:04                                   ` Alvin Oga
@ 2005-01-06  6:18                                     ` Guy
  2005-01-06  6:31                                       ` Alvin Oga
  2005-01-06  9:38                                     ` swap on RAID (was Re: swp - Re: ext3 journal on software raid) Andy Smith
  2005-01-09 21:21                                     ` swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Mark Hahn
  2 siblings, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-06  6:18 UTC (permalink / raw)
  To: 'Alvin Oga', 'Mike Hardy'; +Cc: linux-raid

HAY!  DON'T CALL ME GRANDPA!!!!  Not for a few more years at least!

:)

And I was there!  And beyond!  My first home computer had a 4K card and an
8K card, if I recall.  It was 6800 based.  I wonder if I still have it?  I
did use punch cards in college!  Some computers had core memory, but no tube
based computers.  I am not that old!!!

In the days of my 486-33, I paid $400 for 16Meg of RAM.  The price doubled
just after that when the only epoxy plant burned down or something.  Now my
video card has 64Meg of RAM!  My first hard disk to break the $1 per meg
boundary was a Maxtor 340Meg, cost me about $320.  Now we are under the $1
per Gig boundary.  Today, disk drives are so cheap you get 1 in a happy
meal. :)  Today, I guess disk space is 50-100 times cheaper per gig than
RAM.

Ouch!  At the prices listed above, 1 gig of RAM would cost $25,000, and a
250Gig disk would cost $235,294.  The price of RAM has dropped by 250 times.
The price of disk drives dropped by over 1000 times.  The future is going to
be so cool!

You said memory was 10x the cost of disks.  In the example above, memory is
25x more than disk.  Today it is about 100x.  Maybe we should be swapping
even more, now?  :)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Alvin Oga
Sent: Thursday, January 06, 2005 12:04 AM
To: Mike Hardy
Cc: linux-raid@vger.kernel.org
Subject: Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM:
Kernel 2.6.10 crashing repeatedly and hard)

On Wed, 5 Jan 2005, Mike Hardy wrote:

> There are those that think if they size the machine correctly, they 
> shouldn't need swap, and they're right.

bingo

> > I had 8 Meg of swap space.  So 12 meg total available virtual memory.
> > One day I added 16 Meg of RAM.  So now I had 20 Meg of RAM.  I deleted
my
> > swap space.  Everyone told me I needed 20-40 Meg of swap space now!
Swap
> > space should be 2 time RAM size.  How crazy, my memory requirements did
not
> > change, just the amount of memory.  I used that system for a year or so
like
> > that.  Go figure!

the silly rule of 2x size of RAM == swap space came from the old days
when memory was 10x the costs of disks or some silly cost performance
that it made sense when grandpa was floating around

by todays ram and disk pricing ... and cpu speeds ...2x memory sorta
goes out the door

c ya
alvin

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  6:18                                     ` Guy
@ 2005-01-06  6:31                                       ` Alvin Oga
  0 siblings, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-06  6:31 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid



hi ya guy

On Thu, 6 Jan 2005, Guy wrote:

> HAY!  DON'T CALL ME GRANDPA!!!!  Not for a few more years at least!

wasn't calling ya grandpa :-)
 
> And I was there!  And beyond!  My first home computer had a 4K card and an
> 8K card, if I recall.  It was 6800 based.  I wonder if I still have it?  I

ahh .. that's 3rd generation stuff ...
	4004 and Z8 and lots o no-namebrand what-you-ma-call-it-kits 
	and i  forgot what predated 6800 in motorola/zilog besides z8 ..
	getting too old

> You said memory was 10x the cost of disks.  In the example above, memory is
> 25x more than disk.  Today it is about 100x.  

500x - 1000x in my book  but that depends on how one counts

> Maybe we should be swapping even more, now?  :)

disks are cheaper per byte of storage ???
	    500MB for  $50  
	250,000MB for $250  ( 250GB disks )

	i leave it to you math wiz folks to generate $$$/MB

c ya
alvin
 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* swap on RAID (was Re: swp - Re: ext3 journal on software raid)
  2005-01-06  5:04                                   ` Alvin Oga
  2005-01-06  6:18                                     ` Guy
@ 2005-01-06  9:38                                     ` Andy Smith
  2005-01-06 17:46                                       ` Mike Hardy
  2005-01-07  1:31                                       ` confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid) Alvin Oga
  2005-01-09 21:21                                     ` swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Mark Hahn
  2 siblings, 2 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-06  9:38 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

On Wed, Jan 05, 2005 at 09:04:00PM -0800, Alvin Oga wrote:
> On Wed, 5 Jan 2005, Mike Hardy wrote:
> 
> > There are those that think if they size the machine correctly, they 
> > shouldn't need swap, and they're right.
> 
> bingo

Yet in an earlier reply you state that your machines have swap
configured.

There is a difference between these two statements:

        a) I need some amount of swap configured but don't expect a
        significant swap usage under normal load.

        b) I expect my server to always be using a significant
        amount of swap.

I interpret your views as (a) but I interpret Mike's as "there are
people who are saying that a correctly sized machine can have zero
swap configured."

Having no swap configured and merely using no swap in normal
circumstances are very very different situations.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)
  2005-01-06  9:38                                     ` swap on RAID (was Re: swp - Re: ext3 journal on software raid) Andy Smith
@ 2005-01-06 17:46                                       ` Mike Hardy
  2005-01-06 22:08                                         ` No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)) Andrew Walrond
  2005-01-07  1:31                                       ` confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid) Alvin Oga
  1 sibling, 1 reply; 172+ messages in thread
From: Mike Hardy @ 2005-01-06 17:46 UTC (permalink / raw)
  To: linux-raid

Andy Smith wrote:

> I interpret your views as (a) but I interpret Mike's as "there are
> people who are saying that a correctly sized machine can have zero
> swap configured."
> 
> Having no swap configured and merely using no swap in normal
> circumstances are very very different situations.

You are correct that I was getting at the zero swap argument - and I 
agree that it is vastly different from simply not expecting it. It is 
important to know that there is no inherent need for swap in the kernel 
though - it is simply used as more "memory" (albeit slower, and with 
some optimizations to work better with real memory) and if you don't 
need it, you don't need it.

That said, I mentioned my servers run with swap for the same reason I 
run with raid. I don't plan on having a disk very often (if at all), and 
I don't plan on needing swap very often (if at all), but when it 
happens, I expect my machine to keep running. (or at least I hope it does)

-Mike

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid))
  2005-01-06 17:46                                       ` Mike Hardy
@ 2005-01-06 22:08                                         ` Andrew Walrond
  2005-01-06 22:34                                           ` Jesper Juhl
  0 siblings, 1 reply; 172+ messages in thread
From: Andrew Walrond @ 2005-01-06 22:08 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

On Thursday 06 January 2005 17:46, Mike Hardy wrote:
>
> You are correct that I was getting at the zero swap argument - and I
> agree that it is vastly different from simply not expecting it. It is
> important to know that there is no inherent need for swap in the kernel
> though - it is simply used as more "memory" (albeit slower, and with
> some optimizations to work better with real memory) and if you don't
> need it, you don't need it.
>

If I recollect a recent thread on LKML correctly, your 'no inherent need for 
swap' might be wrong.

I think the gist was this: the kernel can sometimes needs to move bits of 
memory in order to free up dma-able ram, or lowmem. If I recall correctly, 
the kernel can only do this move via swap, even if there is stacks of free 
(non-dmaable or highmem) memory.

I distinctly remember the moral of the thread being "Always mount some swap, 
if you can"

This might have changed though, or I might have got it completely wrong. - 
I've cc'ed LKML incase somebody more knowledgeable can comment...

Andrew Walrond

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid))
  2005-01-06 22:08                                         ` No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)) Andrew Walrond
@ 2005-01-06 22:34                                           ` Jesper Juhl
  2005-01-06 22:57                                             ` Mike Hardy
  0 siblings, 1 reply; 172+ messages in thread
From: Jesper Juhl @ 2005-01-06 22:34 UTC (permalink / raw)
  To: Andrew Walrond; +Cc: linux-raid, linux-kernel

On Thu, 6 Jan 2005, Andrew Walrond wrote:

> On Thursday 06 January 2005 17:46, Mike Hardy wrote:
> >
> > You are correct that I was getting at the zero swap argument - and I
> > agree that it is vastly different from simply not expecting it. It is
> > important to know that there is no inherent need for swap in the kernel
> > though - it is simply used as more "memory" (albeit slower, and with
> > some optimizations to work better with real memory) and if you don't
> > need it, you don't need it.
> >
> 
> If I recollect a recent thread on LKML correctly, your 'no inherent need for 
> swap' might be wrong.
> 
> I think the gist was this: the kernel can sometimes needs to move bits of 
> memory in order to free up dma-able ram, or lowmem. If I recall correctly, 
> the kernel can only do this move via swap, even if there is stacks of free 
> (non-dmaable or highmem) memory.
> 
> I distinctly remember the moral of the thread being "Always mount some swap, 
> if you can"
> 
> This might have changed though, or I might have got it completely wrong. - 
> I've cc'ed LKML incase somebody more knowledgeable can comment...
> 

http://kerneltrap.org/node/view/3202


-- 
Jesper Juhl



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid))
  2005-01-06 22:34                                           ` Jesper Juhl
@ 2005-01-06 22:57                                             ` Mike Hardy
  2005-01-06 23:15                                               ` Guy
  0 siblings, 1 reply; 172+ messages in thread
From: Mike Hardy @ 2005-01-06 22:57 UTC (permalink / raw)
  To: Jesper Juhl; +Cc: Andrew Walrond, linux-raid, linux-kernel

Jesper Juhl wrote:
> On Thu, 6 Jan 2005, Andrew Walrond wrote:
> 
> 
>>On Thursday 06 January 2005 17:46, Mike Hardy wrote:
>>
>>>You are correct that I was getting at the zero swap argument - and I
>>>agree that it is vastly different from simply not expecting it. It is
>>>important to know that there is no inherent need for swap in the kernel
>>>though - it is simply used as more "memory" (albeit slower, and with
>>>some optimizations to work better with real memory) and if you don't
>>>need it, you don't need it.
>>>
>>
>>If I recollect a recent thread on LKML correctly, your 'no inherent need for 
>>swap' might be wrong.
>>
>>I think the gist was this: the kernel can sometimes needs to move bits of 
>>memory in order to free up dma-able ram, or lowmem. If I recall correctly, 
>>the kernel can only do this move via swap, even if there is stacks of free 
>>(non-dmaable or highmem) memory.
>>
>>I distinctly remember the moral of the thread being "Always mount some swap, 
>>if you can"
>>
>>This might have changed though, or I might have got it completely wrong. - 
>>I've cc'ed LKML incase somebody more knowledgeable can comment...
>>
> 
> 
> http://kerneltrap.org/node/view/3202
> 

Interesting - I was familiar with the original swappiness thread 
(http://kerneltrap.org/node/view/3000) but haven't seen anything since 
then (I mainly follow via kernel-traffic - enjoyable, but nowhere near 
real time). There's clearly been a bunch more discussion...

Not to rehash the performance arguments, but it appears from my read of 
the kernel trap page referenced above that the primary argument for swap 
is still the performance argument - I didn't see anything referencing 
swap being necessary to move DMAable ram or lowmem. Was that posted 
previously on linux-kernel but not on kerneltrap?

I'm still under the impression that "to swap or not" is a 
performance/policy/risk-management question, not a correctness question. 
If I'm wrong, I'd definitely like to know...

-Mike

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid))
  2005-01-06 22:57                                             ` Mike Hardy
@ 2005-01-06 23:15                                               ` Guy
  2005-01-07  9:28                                                 ` Andrew Walrond
  0 siblings, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-06 23:15 UTC (permalink / raw)
  To: 'Mike Hardy', 'Jesper Juhl'
  Cc: 'Andrew Walrond', linux-raid, linux-kernel

If I MUST/SHOULD have swap space....
Maybe I will create a RAM disk and use it for swap!  :)  :)  :)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Mike Hardy
Sent: Thursday, January 06, 2005 5:58 PM
To: Jesper Juhl
Cc: Andrew Walrond; linux-raid@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: No swap can be dangerous (was Re: swap on RAID (was Re: swp -
Re: ext3 journal on software raid))



Jesper Juhl wrote:
> On Thu, 6 Jan 2005, Andrew Walrond wrote:
> 
> 
>>On Thursday 06 January 2005 17:46, Mike Hardy wrote:
>>
>>>You are correct that I was getting at the zero swap argument - and I
>>>agree that it is vastly different from simply not expecting it. It is
>>>important to know that there is no inherent need for swap in the kernel
>>>though - it is simply used as more "memory" (albeit slower, and with
>>>some optimizations to work better with real memory) and if you don't
>>>need it, you don't need it.
>>>
>>
>>If I recollect a recent thread on LKML correctly, your 'no inherent need
for 
>>swap' might be wrong.
>>
>>I think the gist was this: the kernel can sometimes needs to move bits of 
>>memory in order to free up dma-able ram, or lowmem. If I recall correctly,

>>the kernel can only do this move via swap, even if there is stacks of free

>>(non-dmaable or highmem) memory.
>>
>>I distinctly remember the moral of the thread being "Always mount some
swap, 
>>if you can"
>>
>>This might have changed though, or I might have got it completely wrong. -

>>I've cc'ed LKML incase somebody more knowledgeable can comment...
>>
> 
> 
> http://kerneltrap.org/node/view/3202
> 

Interesting - I was familiar with the original swappiness thread 
(http://kerneltrap.org/node/view/3000) but haven't seen anything since 
then (I mainly follow via kernel-traffic - enjoyable, but nowhere near 
real time). There's clearly been a bunch more discussion...

Not to rehash the performance arguments, but it appears from my read of 
the kernel trap page referenced above that the primary argument for swap 
is still the performance argument - I didn't see anything referencing 
swap being necessary to move DMAable ram or lowmem. Was that posted 
previously on linux-kernel but not on kerneltrap?

I'm still under the impression that "to swap or not" is a 
performance/policy/risk-management question, not a correctness question. 
If I'm wrong, I'd definitely like to know...

-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid))
  2005-01-06 23:15                                               ` Guy
@ 2005-01-07  9:28                                                 ` Andrew Walrond
  2005-02-28 20:07                                                   ` Guy
  0 siblings, 1 reply; 172+ messages in thread
From: Andrew Walrond @ 2005-01-07  9:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Guy, 'Mike Hardy', 'Jesper Juhl', linux-raid,
	alan

On Thursday 06 January 2005 23:15, Guy wrote:
> If I MUST/SHOULD have swap space....
> Maybe I will create a RAM disk and use it for swap!  :)  :)  :)

Well, indeed, I had the same thought. As long as you could guarantee that the 
ram was of the highmem/non-dmaable type...

But we're getting ahead of ourselves. I think we need an authoritive answer to 
the original premise. Perhaps Alan (cc-ed) might spare us a moment?

Did I dream this up, or is it correct?

"I think the gist was this: the kernel can sometimes needs to move bits of 
memory in order to free up dma-able ram, or lowmem. If I recall correctly, 
the kernel can only do this move via swap, even if there is stacks of free 
(non-dmaable or highmem) memory."

Andrew

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid))
  2005-01-07  9:28                                                 ` Andrew Walrond
@ 2005-02-28 20:07                                                   ` Guy
  0 siblings, 0 replies; 172+ messages in thread
From: Guy @ 2005-02-28 20:07 UTC (permalink / raw)
  To: 'Andrew Walrond', linux-kernel
  Cc: 'Mike Hardy', 'Jesper Juhl', linux-raid, alan

I was just kidding about the RAM disk!

I think swapping to a RAM disk can't work.
Let's assume a page is swapped out.  Now the first page of swap space is
used, and memory is now allocated for it.  Now assume the process frees the
memory, the page in swap can now be freed, but the RAM disk still has the
memory allocated, just not used.  Now if the Kernel were to swap the first
page of that RAM disk, it may be swapped to the first page of swap, which
would change the data in the RAM disk which is being swapped out.  So, I
guess it can't be swapped, or must be re-swapped, or new memory is
allocated.  In any event, that 1 block will never be un-swapped, since it
will never be needed.  Each time the Kernel attempts to swap some of the RAM
disk the RAM disk's memory usage will increase.  This will continue until
all of the RAM disk is used and there is no available swap space left.  Swap
will be full of swap.  :)

I hope that is clear!  It makes my head hurt!

I don't know about lomem or DMAable memory.  But if special memory does
exists....
It seems like if the Kernel can move memory to disk, it would be easier to
move memory to memory.  So, if special memory is needed, the Kernel should
be able to relocate as needed.  Maybe no code exists to do that, but I think
it would be easier to do than to swap to disk (assuming you have enough free
memory).

Guy

-----Original Message-----
From: Andrew Walrond [mailto:andrew@walrond.org] 
Sent: Friday, January 07, 2005 4:28 AM
To: linux-kernel@vger.kernel.org
Cc: Guy; 'Mike Hardy'; 'Jesper Juhl'; linux-raid@vger.kernel.org;
alan@lxorguk.ukuu.org.uk
Subject: Re: No swap can be dangerous (was Re: swap on RAID (was Re: swp -
Re: ext3 journal on software raid))

On Thursday 06 January 2005 23:15, Guy wrote:
> If I MUST/SHOULD have swap space....
> Maybe I will create a RAM disk and use it for swap!  :)  :)  :)

Well, indeed, I had the same thought. As long as you could guarantee that
the 
ram was of the highmem/non-dmaable type...

But we're getting ahead of ourselves. I think we need an authoritive answer
to 
the original premise. Perhaps Alan (cc-ed) might spare us a moment?

Did I dream this up, or is it correct?

"I think the gist was this: the kernel can sometimes needs to move bits of 
memory in order to free up dma-able ram, or lowmem. If I recall correctly, 
the kernel can only do this move via swap, even if there is stacks of free 
(non-dmaable or highmem) memory."

Andrew

^ permalink raw reply	[flat|nested] 172+ messages in thread

* confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)
  2005-01-06  9:38                                     ` swap on RAID (was Re: swp - Re: ext3 journal on software raid) Andy Smith
  2005-01-06 17:46                                       ` Mike Hardy
@ 2005-01-07  1:31                                       ` Alvin Oga
  2005-01-07  2:28                                         ` Andy Smith
  1 sibling, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-07  1:31 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid



On Thu, 6 Jan 2005, Andy Smith wrote:

> On Wed, Jan 05, 2005 at 09:04:00PM -0800, Alvin Oga wrote:
> > On Wed, 5 Jan 2005, Mike Hardy wrote:
> > 
> > > There are those that think if they size the machine correctly, they 
> > > shouldn't need swap, and they're right.
> > 
> > bingo
> 
> Yet in an earlier reply you state that your machines have swap
> configured.
> 
> There is a difference between these two statements:

since you're implying/saying i said those two statements 
below ( a and b ):

you are one totally confused dude ... sorry to say ...

i never said either of those two statements ... that is yur words
and you're understanding of some mixture of lots of comments

	== grep "under normal load" sent-mail raid ==
	( pick any set of words of your "quote" for grep )

 	- it's not even in anybody comments posted, since i save
	all of my posts ( in sent-mail ) and not in anybody elses 
	replies that is saved here
	( thus my comment ... you're confused .. )

please refrain from making (re)quotes .. i didn't say so that
i don't have to reply

please do try to use:  instead of "implying" i said something
as in your "there is two different statments" in response to me
which i didn't say either statments you're "quoting" incorrectly

	"from what i gather, i understand that ... "
	"i think this is ..."
	.. blah blah ..

>         a) I need some amount of swap configured but don't expect a
>         significant swap usage under normal load.
> 
>         b) I expect my server to always be using a significant
>         amount of swap.

i personally do not expect any system i'm involved with to use any
swap and if it does, i'd be adding more memory, as soon as they
complain its running too slow when xxx or yyy job is running
	- they have choices of what they want done about it

and it is NOT the same thing ... "not using swap" vs not creating one
	- i don't use swap and i rather the system not use it ...
	but i do create an itty bitty 250MB of swap partition 
	even if there's is 2GB of system memory

	- in other cases ... like embedded systems ...

	it has no swap files .. no swap partitions adn things
	works just fine

	another example might be cell phones .. does those
	puppies have swap space ?? ( probably not )

c ya
alvin
 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)
  2005-01-07  1:31                                       ` confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid) Alvin Oga
@ 2005-01-07  2:28                                         ` Andy Smith
  2005-01-07 13:04                                           ` Alvin Oga
  0 siblings, 1 reply; 172+ messages in thread
From: Andy Smith @ 2005-01-07  2:28 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4187 bytes --]

On Thu, Jan 06, 2005 at 05:31:48PM -0800, Alvin Oga wrote:
> 
> 
> On Thu, 6 Jan 2005, Andy Smith wrote:
> 
> > On Wed, Jan 05, 2005 at 09:04:00PM -0800, Alvin Oga wrote:
> > > On Wed, 5 Jan 2005, Mike Hardy wrote:
> > > 
> > > > There are those that think if they size the machine correctly, they 
> > > > shouldn't need swap, and they're right.
> > > 
> > > bingo
> > 
> > Yet in an earlier reply you state that your machines have swap
> > configured.
> > 
> > There is a difference between these two statements:
> 
> since you're implying/saying i said those two statements 
> below ( a and b ):

No I was not implying that you said either of them.  Why is this hard
to understand?  Look at Mike's response to me; he understood what I
was getting at perfectly.  I was only trying to point out to you
that what you've said isn't consistent with things you said in other
posts.

> you are one totally confused dude ... sorry to say ...

Look.   It is quite simple.

Mike above (which I leave quoted so you believe me) said:

        There are those that think if they size the machine
        correctly, they shouldn't need swap, and they're right.

To which *you* replied:

        bingo

*However* in an earlier mail in this thread you said that your
machines do have swap configured.  Mike has already clarified that
he meant that these people say that no swap needs to be configured.

So my point was that while you say "bingo" to Mike, you do not in
reality do as Mike describes.

> i never said either of those two statements ... that is yur words
> and you're understanding of some mixture of lots of comments

They were indeed my words and I never claimed otherwise.

> 	== grep "under normal load" sent-mail raid ==
> 	( pick any set of words of your "quote" for grep )
> 
>  	- it's not even in anybody comments posted, since i save
> 	all of my posts ( in sent-mail ) and not in anybody elses 
> 	replies that is saved here
> 	( thus my comment ... you're confused .. )
> 
> please refrain from making (re)quotes .. i didn't say so that
> i don't have to reply

I didn't attribute any words to you.  I quite clearly said
they were my interpretation.  A direct quote:

        There is a difference between these two statements:

                a) I need some amount of swap configured but don't
                expect a significant swap usage under normal load.

                b) I expect my server to always be using a
                significant amount of swap.

        I interpret your views as (a) but I interpret Mike's as
        "there are people who are saying that a correctly sized
        machine can have zero swap configured."

You now go on to say:

> i personally do not expect any system i'm involved with to use any
> swap and if it does, i'd be adding more memory, as soon as they
> complain its running too slow when xxx or yyy job is running
> 	- they have choices of what they want done about it

Which appears to me to be pretty much exactly what (a) above says.
So I'm at a loss to understand why you bothered to write this email
complaining about being misrepresented.

Now that we have in one email original unedited quotes of you saying
you agree with someone who suggests configuring no swap, and then
quotes of you saying you do configure swap, I hope you will now
concede my point that you are being inconsistent.  If you don't
agree, fine, I'm confident that what's been said speaks for itself
and don't feel like pursuing the matter further.  Especially if it's
going to result in you telling ME that I "must be an idiot" and am
"confused."

> and it is NOT the same thing ... "not using swap" vs not creating one

Yes, exactly my point from two emails ago.

> 	- i don't use swap and i rather the system not use it ...
> 	but i do create an itty bitty 250MB of swap partition 
> 	even if there's is 2GB of system memory

So you do configure swap, this is now my point from 3 emails ago.

Why it has taken 3 *long* emails to get you to answer one simple question
and concede a blatant inconsistency in your argument is beyond me.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)
  2005-01-07  2:28                                         ` Andy Smith
@ 2005-01-07 13:04                                           ` Alvin Oga
  0 siblings, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-07 13:04 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid



On Fri, 7 Jan 2005, Andy Smith wrote:

one last time ...
	stop saying/mis-quoting that i said what you're misquoting

> > since you're implying/saying i said those two statements 
> > below ( a and b ):
> 
> No I was not implying that you said either of them. 

... note the above ..

> Why is this hard  to understand?

good question ... to ask yourself ..

> *However* in an earlier mail in this thread you said that your

i did NOT say what you're quoting .... 

> > i never said either of those two statements ... that is yur words
> > and you're understanding of some mixture of lots of comments
> 
> They were indeed my words and I never claimed otherwise.

see below

> > please refrain from making (re)quotes .. i didn't say so that
> > i don't have to reply
> 
> I didn't attribute any words to you.  I quite clearly said
> they were my interpretation.  A direct quote:

stop this nonsense ...

i never made the quotes you're re-citing incorrectly again 
and still, even if i told you to STOP making comments on 
my behalf as if i said that  ... which i never did

>         There is a difference between these two statements:
> 
>                 a) I need some amount of swap configured but don't
>                 expect a significant swap usage under normal load.
> 
>                 b) I expect my server to always be using a
>                 significant amount of swap.
> 
>         I interpret your views 

how many times do i have to tell you ... it is NOT my view
.... it is your (mis)interpretation, nothing close to
what i said ...

end of story .. for the last time 

do i need to send the lawyers to your doorstep to make you
stop making incorrect statements and quotes that i never said

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  5:04                                   ` Alvin Oga
  2005-01-06  6:18                                     ` Guy
  2005-01-06  9:38                                     ` swap on RAID (was Re: swp - Re: ext3 journal on software raid) Andy Smith
@ 2005-01-09 21:21                                     ` Mark Hahn
  2005-01-09 22:20                                       ` Alvin Oga
  2 siblings, 1 reply; 172+ messages in thread
From: Mark Hahn @ 2005-01-09 21:21 UTC (permalink / raw)
  To: Alvin Oga; +Cc: linux-raid

> the silly rule of 2x size of RAM == swap space came from the old days
> when memory was 10x the costs of disks or some silly cost performance
> that it made sense when grandpa was floating around

ram is currently about $.1/MB, and disk is about $.0004/MB,
so there is still a good reason to put idle pages onto swap disks.

> by todays ram and disk pricing ... and cpu speeds ...2x memory sorta
> goes out the door

no.  the cpu-speed argument is based on the fact that disk latencies 
are improving quite slowly, compared to ram latency (which is itself
falling drastically behind cpu speeds.)  this assumes that the argument
for swap depends on swap latency, which it doesn't: swap pages are,
ideally, *NEVER*READ*!  the whole point is to choose anonymous pages
which are so idle that they won't practically ever be touched.

you *could* argue that the fraction of pages which can be profitably swapped
is decreasing because "hot" items in memory are larger.  it would be
interesting to find out if that's true.  certainly if only a few percent 
of ram is being used by idle anonymous pages, swapping has become irrelevant.

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-09 21:21                                     ` swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Mark Hahn
@ 2005-01-09 22:20                                       ` Alvin Oga
  0 siblings, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-09 22:20 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid


hi ya mark

On Sun, 9 Jan 2005, Mark Hahn wrote:

> > the silly rule of 2x size of RAM == swap space came from the old days
> > when memory was 10x the costs of disks or some silly cost performance
> > that it made sense when grandpa was floating around
> 
> ram is currently about $.1/MB, and disk is about $.0004/MB,
> so there is still a good reason to put idle pages onto swap disks.

yes.. as long as "swap(aka system thruput) speed" is not an issue
and the alternative is to crash once memory options is used up :-)

> > by todays ram and disk pricing ... and cpu speeds ...2x memory sorta
> > goes out the door
> 
> no.  the cpu-speed argument is based on the fact that disk latencies 
> are improving quite slowly, compared to ram latency (which is itself
> falling drastically behind cpu speeds.) 

the 2x memory capacity "old rule of thumb" i was referring to was
if a system has 500MB of real memory, the old rule of thumb says
create ( 2x 500 ) 1GB of disk swap
	- whether that 1GB of swap should be one partition
	or many spread out across the disk is another PhD paper

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-06  0:57                               ` Guy
  2005-01-06  1:28                                 ` Mike Hardy
@ 2005-01-06  5:01                                 ` Alvin Oga
  1 sibling, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-06  5:01 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid


hi ya guy

On Wed, 5 Jan 2005, Guy wrote:

> So, the question is:
> 	Assume I have enough RAM to never need to swap.

that depends on what you ae using your box/system for,
and for you to figure out based on what the system is supposed
to do

> 	Do I need any swap space with Linux?

it's not required ... many embedded systems runs without swap

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 15:33                   ` Alvin Oga
  2005-01-05 16:22                     ` Michael Tokarev
  2005-01-05 16:23                     ` Andy Smith
@ 2005-01-05 17:07                     ` Guy
  2005-01-05 17:21                       ` Alvin Oga
  2005-01-05 17:26                       ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
  2 siblings, 2 replies; 172+ messages in thread
From: Guy @ 2005-01-05 17:07 UTC (permalink / raw)
  To: 'Alvin Oga'; +Cc: linux-raid

RAID does not cause bad data!
Bad data can get on any disk, even if it is part of a RAID system.
The bad data does not come from the hard disk, CRCs prevent that.

The problem is:  Where does the bad data come from?
Bad memory?  No, everyone use ECC memory, right?

Guy

-----Original Message-----
From: Alvin Oga [mailto:aoga@ns.Linux-Consulting.com] 
Sent: Wednesday, January 05, 2005 10:34 AM
To: Guy
Cc: linux-raid@vger.kernel.org
Subject: RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)



On Wed, 5 Jan 2005, Guy wrote:

> I agree, but for a different reason.  Your reason is new to me.
..
> Loosing the swap disk would kill the system.

if one is using swap space ... i'd add more memory .. before i'd use raid
	- swap is too slow and as you folks point out, it could die
	due to (unlikely) bad disk sectors in swap area

> I don't want a down system due to a single disk failure.

that's what raid's for :-)

> I mirror everything, or RAID5.  Normally, no downtime due to disk
failures.

the problem with mirror ( raid1 ).. or raid5 ...
	- if you have a bad diska ... all "bad data" will/could  also get
	copied to the good disk

	- "bad data" is hard to figure out in code ... to prevent it from
	getting copied ... how does it know with 100% certainty 

	- if you know why it's bad data,  it's lot easier to know which
	data is more correct than the bad one

	- as everybody has pointed out .. bad data ( disk errors )
	can occur for any number of gazillion reasons

have fun raiding
alvin



^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:07                     ` Guy
@ 2005-01-05 17:21                       ` Alvin Oga
  2005-01-05 17:32                         ` Guy
  2005-01-05 17:34                         ` ECC: RE: ext3 blah blah blah Gordon Henderson
  2005-01-05 17:26                       ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
  1 sibling, 2 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 17:21 UTC (permalink / raw)
  To: Guy; +Cc: 'Alvin Oga', linux-raid

hi ya guy

On Wed, 5 Jan 2005, Guy wrote:

> RAID does not cause bad data!

again .. that's not what i said ..

the point i've been tring to say .. one MUST figure out "exactly"
where the bad data or errors or crashes are coming from

	it will usually be bad ("cheap") parts or operator error
	or simple "slapped" together boxes

> Bad data can get on any disk, even if it is part of a RAID system.
> The bad data does not come from the hard disk, CRCs prevent that.
> 
> The problem is:  Where does the bad data come from?
> Bad memory?  No, everyone use ECC memory, right?

that's what i was saying... but you do a better job

even if one uses ecc... ecc can only fix certain errors,
and non-correctable errors are flagged and not fixed
( burst errors are hard to fix )

and people that use ecc memory ... usually have higher-end motherboards
and more memory in the system vs the $50 motherboard+cpu combo (disasters)
from fries ( a local pc store )

c ya
alvin

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:21                       ` Alvin Oga
@ 2005-01-05 17:32                         ` Guy
  2005-01-05 18:37                           ` Alvin Oga
  2005-01-05 17:34                         ` ECC: RE: ext3 blah blah blah Gordon Henderson
  1 sibling, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-05 17:32 UTC (permalink / raw)
  To: 'Alvin Oga'; +Cc: linux-raid

You said this:
"	- as everybody has pointed out .. bad data ( disk errors )
	can occur for any number of gazillion reasons

have fun raiding
Alvin"

Bad data is not caused by disk errors.  This seemed like you were blaming
the hard disk.  I understand you now.  It seems we agree!

Guy

-----Original Message-----
From: Alvin Oga [mailto:aoga@ns.Linux-Consulting.com] 
Sent: Wednesday, January 05, 2005 12:22 PM
To: Guy
Cc: 'Alvin Oga'; linux-raid@vger.kernel.org
Subject: RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

hi ya guy

On Wed, 5 Jan 2005, Guy wrote:

> RAID does not cause bad data!

again .. that's not what i said ..

the point i've been tring to say .. one MUST figure out "exactly"
where the bad data or errors or crashes are coming from

	it will usually be bad ("cheap") parts or operator error
	or simple "slapped" together boxes

> Bad data can get on any disk, even if it is part of a RAID system.
> The bad data does not come from the hard disk, CRCs prevent that.
> 
> The problem is:  Where does the bad data come from?
> Bad memory?  No, everyone use ECC memory, right?

that's what i was saying... but you do a better job

even if one uses ecc... ecc can only fix certain errors,
and non-correctable errors are flagged and not fixed
( burst errors are hard to fix )

and people that use ecc memory ... usually have higher-end motherboards
and more memory in the system vs the $50 motherboard+cpu combo (disasters)
from fries ( a local pc store )

c ya
alvin

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:32                         ` Guy
@ 2005-01-05 18:37                           ` Alvin Oga
  0 siblings, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 18:37 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid



On Wed, 5 Jan 2005, Guy wrote:

> You said this:
> "	- as everybody has pointed out .. bad data ( disk errors )
> 	can occur for any number of gazillion reasons
...
 
> Bad data is not caused by disk errors.  This seemed like you were blaming
> the hard disk.  I understand you now.  It seems we agree!

i should have said "wrong data", where i don;t care why 
the data is not what one expects... due to failures, errors, problems,
features, bugs ...  

and the point is .. if the data is wrong ... find out precisely
why its wrong ... and experiment to confirm the suspicions
	- usually you will toss out one hardware item at a time
	to isolate the problem

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* ECC: RE: ext3 blah blah blah ...
  2005-01-05 17:21                       ` Alvin Oga
  2005-01-05 17:32                         ` Guy
@ 2005-01-05 17:34                         ` Gordon Henderson
  2005-01-05 18:33                           ` Alvin Oga
  1 sibling, 1 reply; 172+ messages in thread
From: Gordon Henderson @ 2005-01-05 17:34 UTC (permalink / raw)
  To: linux-raid

On Wed, 5 Jan 2005, Alvin Oga wrote:

> even if one uses ecc... ecc can only fix certain errors,
> and non-correctable errors are flagged and not fixed
> ( burst errors are hard to fix )
>
> and people that use ecc memory ... usually have higher-end motherboards
> and more memory in the system vs the $50 motherboard+cpu combo (disasters)
> from fries ( a local pc store )

One thing thats been irritating me for a long time is that it's all very
well using (and paying for!) ECC memory, but there are parts of the system
that don't have ECC or parity - eg. the PCI bus, processor bus and
internal paths, and so on...

But where do you draw the line?

Gordon

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ECC: RE: ext3 blah blah blah ...
  2005-01-05 17:34                         ` ECC: RE: ext3 blah blah blah Gordon Henderson
@ 2005-01-05 18:33                           ` Alvin Oga
  0 siblings, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-05 18:33 UTC (permalink / raw)
  To: Gordon Henderson; +Cc: linux-raid



On Wed, 5 Jan 2005, Gordon Henderson wrote:

> One thing thats been irritating me for a long time is that it's all very
> well using (and paying for!) ECC memory, but there are parts of the system
> that don't have ECC or parity - eg. the PCI bus, processor bus and
> internal paths, and so on...

bingo !!!
 
> But where do you draw the line?

because the silly motherboard will nto work at all without those
expensive registered ecc memory
	and you want that motherboard because it has the memory
	capacity you want ot the support to the fastest cpu around
	or blah features that no other mb has

c ya
alvin
 


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:07                     ` Guy
  2005-01-05 17:21                       ` Alvin Oga
@ 2005-01-05 17:26                       ` David Greaves
  2005-01-05 18:16                         ` Peter T. Breuer
  2005-01-05 18:26                         ` Guy
  1 sibling, 2 replies; 172+ messages in thread
From: David Greaves @ 2005-01-05 17:26 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

Guy wrote:

>RAID does not cause bad data!
>  
>
Guy - how can you speak such heresy!!      ;)
Haven't you seen the special 'make_undetectable_error(float p)' function?

>Bad data can get on any disk, even if it is part of a RAID system.
>The bad data does not come from the hard disk, CRCs prevent that.
>
>The problem is:  Where does the bad data come from?
>Bad memory?  No, everyone use ECC memory, right?
>  
>
I like to blame cosmic rays - I just like the image ;)

Of course voltage fluctuations, e/m interference, thermal variations, 
mechanical interfaces, dust capacitance/resistance, insects shorts... 
anywhere inside the tin box between the media and the corners of the 
motherboard. It's amazing that some machines even boot.

David


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:26                       ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
@ 2005-01-05 18:16                         ` Peter T. Breuer
  2005-01-05 18:28                           ` Guy
  2005-01-05 18:26                         ` Guy
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05 18:16 UTC (permalink / raw)
  To: linux-raid

David Greaves <david@dgreaves.com> wrote:
> Haven't you seen the special 'make_undetectable_error(float p)' function?

 dd if=/dev/urandom of=/dev/hda1 bs=512 count=1 skip=$RANDOM

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 18:16                         ` Peter T. Breuer
@ 2005-01-05 18:28                           ` Guy
  0 siblings, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-05 18:28 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

I think that is in some boot-up scripts!  :)
But only if you don't have mirrors!  :)

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Wednesday, January 05, 2005 1:16 PM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

David Greaves <david@dgreaves.com> wrote:
> Haven't you seen the special 'make_undetectable_error(float p)' function?

 dd if=/dev/urandom of=/dev/hda1 bs=512 count=1 skip=$RANDOM

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 17:26                       ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
  2005-01-05 18:16                         ` Peter T. Breuer
@ 2005-01-05 18:26                         ` Guy
  1 sibling, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-05 18:26 UTC (permalink / raw)
  To: 'David Greaves'; +Cc: linux-raid

You forgot one!
Peter said: "There might be a small temporal displacement."

This worries me!  RED ALERT!
Beam me up!  Wait, I will take the shuttle craft please.
"You cannot change the laws of physics!" - Scotty

But, really, I do understand what Peter is saying.  It just seemed too funny
to me.  And since I have disk read errors more than temporal displacement, I
need RAID more than a temporal dampening field.  Maybe a temporal phase
converter?

And, a single non-RAID disk can suffer from temporal displacement.

I love Star Trek, so I love temporal displacements!  :)

Guy

-----Original Message-----
From: David Greaves [mailto:david@dgreaves.com] 
Sent: Wednesday, January 05, 2005 12:27 PM
To: Guy
Cc: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Guy wrote:

>RAID does not cause bad data!
>  
>
Guy - how can you speak such heresy!!      ;)
Haven't you seen the special 'make_undetectable_error(float p)' function?

>Bad data can get on any disk, even if it is part of a RAID system.
>The bad data does not come from the hard disk, CRCs prevent that.
>
>The problem is:  Where does the bad data come from?
>Bad memory?  No, everyone use ECC memory, right?
>  
>
I like to blame cosmic rays - I just like the image ;)

Of course voltage fluctuations, e/m interference, thermal variations, 
mechanical interfaces, dust capacitance/resistance, insects shorts... 
anywhere inside the tin box between the media and the corners of the 
motherboard. It's amazing that some machines even boot.

David

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 15:17                 ` Guy
  2005-01-05 15:33                   ` Alvin Oga
@ 2005-01-05 15:48                   ` Peter T. Breuer
  1 sibling, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05 15:48 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> I agree, but for a different reason.  Your reason is new to me.
> I don't want a down system due to a single disk failure.
> Loosing the swap disk would kill the system.
> 
> Maybe this is Peter's cause of frequent corruption?

I wouldn't have aid there were frequent failures!  There are relatively
frequent disk errors, at the rate of probably 10 bits per hundred
machines per week. I do not think that is very frequent.

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2004-12-30 17:39   ` Peter T. Breuer
  2004-12-30 17:53     ` Sandro Dentella
  2004-12-30 19:50     ` Michael Tokarev
@ 2005-01-07  6:21     ` Clemens Schwaighofer
  2005-01-07  9:39       ` Andy Smith
  2 siblings, 1 reply; 172+ messages in thread
From: Clemens Schwaighofer @ 2005-01-07  6:21 UTC (permalink / raw)
  To: Peter T. Breuer, linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/31/2004 02:39 AM, Peter T. Breuer wrote:
> In gmane.linux.raid Georg C. F. Greve <greve@fsfeurope.org> wrote:
> 
>>The message I saw on the remote console when it crashed with pure ext3
>>on raid5 was:
>>
>> Assertion failure in journal_start() at fs/jbd/transaction.c:271: "handle->h_transaction->t_journal == journal"
>>
> 
> Yes, well, don't put the journal on the raid partition. Put it
> elsewhere (anyway, journalling and raid do not mix, as write ordering
> is not - deliberately - preserved in raid, as far as I can tell).

Thats a very new claim. I never ever heard of that. I have a lot of
boxes running with software raid (1 or 5) and they run either XFS or
ext3 on it, and since one year I never had a single problem (and they
all use 2.6.7 or 2.6.8.1 kernels).

- --
[ Clemens Schwaighofer                      -----=====:::::~ ]
[ TBWA\ && TEQUILA\ Japan IT Group                           ]
[                6-17-2 Ginza Chuo-ku, Tokyo 104-0061, JAPAN ]
[ Tel: +81-(0)3-3545-7703            Fax: +81-(0)3-3545-7343 ]
[ http://www.tequila.co.jp        http://www.tbwajapan.co.jp ]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB3ipTjBz/yQjBxz8RAsExAJ9ZhniY/vsm9TIpRtHhGsfdVjws6wCeO4RS
rJ/o4roBjSBo3z5os/E6tKk=
=d3Rn
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard
  2005-01-07  6:21     ` PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Clemens Schwaighofer
@ 2005-01-07  9:39       ` Andy Smith
  0 siblings, 0 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-07  9:39 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 550 bytes --]

On Fri, Jan 07, 2005 at 03:21:08PM +0900, Clemens Schwaighofer wrote:
> On 12/31/2004 02:39 AM, Peter T. Breuer wrote:
> > Yes, well, don't put the journal on the raid partition. Put it
> > elsewhere (anyway, journalling and raid do not mix, as write ordering
> > is not - deliberately - preserved in raid, as far as I can tell).
> 
> Thats a very new claim. I never ever heard of that.

Yes, and neither have most people I talked to about this, which is
why I think this deserves to be explored more and any gotchas
documented somewhere.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
@ 2005-01-03  9:30 Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03  9:30 UTC (permalink / raw)
  To: linux raid

"Also sprach Guy:"
> "Well, you can make somewhere. You only require an 8MB (one cylinder)
> partition."
> 
> So, it is ok for your system to fail when this disk fails?

You lose the journal.  I don't remember what ext3fs does in that case.
I seem to recall that it feels ill and goes into read-only mode, but it
may feel sicker and just go toes up.  I don't recall.  That journals
kept dying on me for systems like /var taught me to put them on
disposable media when I did use to use them there ...

You can definitely react with a simple tune2fs -O ^journal or whatever
is appropriate. I know that because Ted Tso added the "force" option to
tune2fs at my request, when I showed him a system that had lost its
journal and wouldn't remove it from its metadata BECAUSE the journal
was not there to be removed.

> I don't want system failures when a disk fails,

Your scenario seems to be that you have the disks of your mirror on the
ame physical system.  That's fundamentally dangerous - they're both
subject to damage when the system blows up.  I instead have an array
node (where the journal is kept), and a local mirror component and a
remote mirror component.

That system is doubled, and each half of the double hosts the others
remote mirror component. Each half fails over to the other.

There it makes sense to have the journal separate, but local on the
array node(s), not mirrored.  The FS on the remote mirror is guaranteed
consuistent, becuase of the local journal.  That's what I want.  I don't
know precisely what would be in the journal on the remote side if I
mirrored it, but I am sure I would want to roll it back, not complete it
(the completer just died!), so why have it there at all?

> so mirror (or RAID5)
> everything required to keep your system running.
> 
> "And there is a risk of silent corruption on all raid systems - that is
> well known."
> I question this....

?

> I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1, you

The corruption risk is doubled for a 2-way mirror, and there is a 50%
chance of it not being detected at all even if you try and check for it,
because you may be reading from the wrong mirror at the time you pass
over the imperfection in the check.

Isn't that simply the most naive calculation? So why would you make
your bet!

And then you don't generally check at all, ever.

But whether you check or not, corruptions simply have only a 50% chancce
of being seen (you look on the wrong mirror when you look), and a 200%
chance of occuring (twice as much real estate) wrt normal rate.

In contrast, on a single disk they have a 100% chance of detection (if
you look!) and a 100% chance of occuring, wrt normal rate.

> know when a difference occurs, if you want.

How!

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

[parent not found: <200501030916.j039Gqe23568@inv.it.uc3m.es>]

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
       [not found] <200501030916.j039Gqe23568@inv.it.uc3m.es>
@ 2005-01-03 10:17 ` Guy
  2005-01-03 11:31   ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-03 10:17 UTC (permalink / raw)
  To: ptb, 'linux raid'

See notes below with **.

Guy

-----Original Message-----
From: ptb@inv.it.uc3m.es [mailto:ptb@inv.it.uc3m.es] 
Sent: Monday, January 03, 2005 4:17 AM
To: Guy
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

"Also sprach Guy:"
> "Well, you can make somewhere. You only require an 8MB (one cylinder)
> partition."
> 
> So, it is ok for your system to fail when this disk fails?

You lose the journal, that's all.  You can react with a simple tune2fs
-O ^journal or whatever is appropriate.  And a journal is ONLY there in
order to protect you against crashes of the SYSTEM (not the disk), so
what was the point of having the journal in the first place? 

** When you lose the journal, does the system continue without it?
** Or does it require user intervention?

> I don't want system failures when a disk fails,

"don't use a journal then" seems to be the easy answer for you, but
probably "put it on an ultrasafe medium like gold-plated persistent ram"
works better!
**RAM will be lost if you crash or lose power.

Your scenario seems to be that you have the disks of your mirror on the
ame physical system.  That's funamentally dangerous.  They're both
subject to damage when the system blows up.  I don't.  I have an array
node (where the journal is kept), and a local mirror component and a
remote mirror component.

That system is doubled, and each half of the double hosts the others
remote mirror component. Each half fails over to the other.
** So, you have 2 systems, 1 fails and the "system" switches to the other.
** I am not going for a 5 nines system.
** I just don't want any down time if a disk fails.
** A disk failing is the most common failure a system can have (IMO).
** In a computer room with about 20 Unix systems, in 1 year I have seen 10
or so disk failures and no other failures.
** Are your 2 systems in the same state?
** They should be at least 50 miles apart (at a minimum).
** Otherwise if your data center blows up, your system is down!
** In my case, this is so rare, it is not an issue.
** Just use off-site tape backups.
** My computer room is for development and testing, no customer access.
** If the data center is gone, the workers have nowhere to work anyway (in
my case).
** Some of our customers do have failover systems 50+ miles apart.

> so mirror (or RAID5)
> everything required to keep your system running.
> 
> "And there is a risk of silent corruption on all raid systems - that is
> well known."
> I question this....

Why!
** You lost me here.  I did not make the above statement.  But, in the case
of RAID5, I believe it can occur.  Your system crashes while a RAID5 stripe
is being written, but the stripe is not completely written.  During the
re-sync, the parity will be adjusted, but it may be more current than 1 or
more of the other disks.  But this would be similar to what would happen to
a non-RAID disk (some data not written).
** Also with RAID1 or RAID5, if corruption does occur without a crash or
re-boot, then a disk fails, the corrupt data will be copied to the
replacement disk.  With RAID1 a 50% risk of copying the corruption, and 50%
risk of correcting the corruption.  With RAID5, risk % depends on the number
of disks in the array.

> I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1,
you

The corruption risk is doubled for a 2-way mirror, and there is a 50%
chance of it not being detected at all even if you try and check for it,
because you may be reading from the wrong mirror at the time you pass
over the imperfection in the check.
** After a crash, md will re-sync the array.
** But during the re-sync, md could be checking for differences and
reporting them.
** It won't help correct anything, but it could explain why you may be
having problems with your data.
** Since md re-syncs after a crash, I don't think the risk is double.

Isn't that simply the most naive calculation? So why would you make
your bet?
** I don't understand this.

And then of course you don't generally check at all, ever.
** True, But I would like md to report when a mirror is wrong.
** Or a RAID5 parity is wrong.

But whether you check or not, corruptions simply have only a 50% chancce
of being seen (you look on the wrong mirror when you look), and a 200%
chance of occuring (twice as much real estate) wrt normal rate.
** Since md re-syncs after a crash, I don't think the risk is double.
** Also, I don't think most corruption would be detectable (ignoring a RAID
problem).
** It depends to the type of data.
** Example: Your MP3 collection would go undetected until someone listened
to the corrupt file. 

In contrast, on a single disk they have a 100% chance of detection (if
you look!) and a 100% chance of occuring, wrt normal rate.
** Are you talking about the disk drive detecting the error?
** If so, are you referring to a read error or what?
** Please explain the nature of the detectable error.

> know when a difference occurs, if you want.

How?
** Compare the 2 halves or the RAID1, or check the parity of RAID5.
Peter

** Guy

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 10:17 ` Guy
@ 2005-01-03 11:31   ` Peter T. Breuer
  2005-01-03 17:34     ` Guy
  2005-01-03 17:46     ` maarten
  0 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03 11:31 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> "Also sprach Guy:"
> > "Well, you can make somewhere. You only require an 8MB (one cylinder)
> > partition."
> > 
> > So, it is ok for your system to fail when this disk fails?
> 
> You lose the journal, that's all.  You can react with a simple tune2fs
> -O ^journal or whatever is appropriate.  And a journal is ONLY there in
> order to protect you against crashes of the SYSTEM (not the disk), so
> what was the point of having the journal in the first place? 
> 
> ** When you lose the journal, does the system continue without it?
> ** Or does it require user intervention?

I don't recall. It certainly at least puts itself into read-only mode
(if that's the error mode specified via tune2fs). And the situation
probably changes from version t version.

On a side note, I don't know why you think user intervention is not
required when a raid system dies.  As a matter of liklihoods, I have
never seen a disk die while IN a working soft (or hard) raid system, and
the system continue working afterwards, instead the normal disaster
sequence as I have experienced it is:

   1) lightning strikes rails, or a/c goes out and room full of servers
      overheats. All lights go off.

   2) when sysadmin arrives to sort out the smoking wrecks, he finds
      that 1 in 3 random disks are fried - they're simply the points
      of failure that died first, and they took down the hardware with
      them.

   3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware
      to piece together the raid arrays from the surviving disks, and
      hastily does a copy to somewhere very safe and distant, while
      an assistant holds off howling hordes outside the door with
      a shutgun.

In this scenario, a disk simply acts as the weakest link in a fuse
chain, and the whole chain goes down.  But despite my dramatisation it
is likely that a hardware failure will take out or damage your hardware!
Ide disks live on an electric bus conected to other hardware.  Try a
shortcircuit and see what happens.  You can't even yank them out while
the bus is operating if you want to keep your insurance policy.

For scsi the situation is better wrt hot-swap, but still not perfect.
And you have the electric connections also. That makes it likely that a
real nasty hardware failure will do nasty things (tm) to whatever is in
the same electric environment.  It is possible if not likely that you
will lose contact with scsi disks further along the bus, if you don't
actually blow the controller.

That said, there ARE situations which raid protects you from - simply a
"gentle disconnect" (a totally failed disk that goes open circuit), or
a "gradual failure" (a disk that runs out of spare sectors). In the
latter case the raid will fail the disk completely at the first
detected error, which may well be what you want (or may not be!).

However, I don't see how you can expect to replace a failed disk
without taking down the system. For that reason you are expected to be
running "spare disks" that you can virtually insert hot into the array
(caveat, it is possible with scsi, but you will need to rescan the bus,
which will take it out of commission for some seconds, which may
require you to take the bus offline first, and it MAY be possible with
recent IDE buses that purport to support hotswap - I don't know).

So I think the relevant question is: "what is it that you are protecting
yourself from by this strategy of yours".

When you have the scenario, you can evaluate risks.

> > I don't want system failures when a disk fails,
> 
> Your scenario seems to be that you have the disks of your mirror on the
> same physical system.  That's fundamentally dangerous.  They're both
> subject to damage when the system blows up.  [ ... ]  I have an array
> node (where the journal is kept), and a local mirror component and a
> remote mirror component.
> 
> That system is doubled, and each half of the double hosts the others
> remote mirror component. Each half fails over to the other.

> ** So, you have 2 systems, 1 fails and the "system" switches to the other.
> ** I am not going for a 5 nines system.
> ** I just don't want any down time if a disk fails.

Well,

(1) how likely is it that a disk will fail without taking down the system
(2) how likely is it that a disk will fail
(3) how likely is it that a whole system will fail

I would say that (2) is about 10% per year. I would say that (3) is
about 1200% per year. It is therefore difficult to calculate (1), which
is your protection scenario, since it doesn't show up very often in the
stats!

> ** A disk failing is the most common failure a system can have (IMO).

Not in my experience. See above. I'd say each disk has about a 10%
failure expectation per year. Whereas I can guarantee that an
unexpected  system failure will occur about once a month, on every
important system.

If you think about it that is quite likely, since a system is by
definition a complicated thing. And then it is subject to all kinds of
horrible outside influences, like people rewiring the server room in 
order to reroute cables under the floor instead of through he ceiling,
and the maintenance people spraying the building with insecticide,
everywhere, or just "turning off the electricity in order to test it"
(that happens about four times a year here - hey, I remember when they
tested the giant UPS by turning off the electricity! Wrong switch.
Bummer).

Yes, you can try and keep these systems out of harms way on a
colocation site, or something, but by then you are at professional
level paranoia. For "home systems", whole system failures are far more
common than disk failures.

I am not saying that RAID is useless! Just the opposite. It is a useful
and EASY way of allowing you to pick up the pieces when everything
falls apart. In contrast, running a backup regime is DIFFICULT.

> ** In a computer room with about 20 Unix systems, in 1 year I have seen 10
> or so disk failures and no other failures.

Well, let's see. If each system has 2 disks, then that would be 25% per
disk per year, which I would say indicates low quality IDE disks, but
is about the level I would agree with as experiential.

> ** Are your 2 systems in the same state?

No, why should they be?

> ** They should be at least 50 miles apart (at a minimum).

They aren't - they are in two different rooms. Different systems copy
them every day to somewhere else. I have no need for instant
survivability across nuclear attack.

> ** Otherwise if your data center blows up, your system is down!

True. So what? I don't care. The cost of such a thing is zero, because
if my data center goes I get a lot of insurance money and can retire.
The data is already backed up elsewhere if anyone cares.

> ** In my case, this is so rare, it is not an issue.

It's very common here and everywhere else I know! Think about it - if
your disks are in the same box then it is statistically likely that when
one disks fails it is BECAUSE of some local cause, and that therefore
the other disk will also be affected by it.

It's your "if your date center burns down" reasoning, applied to your
box.

> ** Just use off-site tape backups.

No way! I hate tapes. I backup to other disks.

> ** My computer room is for development and testing, no customer access.

Unfortunately, the admins do most of the sabotage.

> ** If the data center is gone, the workers have nowhere to work anyway (in
> my case).

I agree. Therefore who cares. OTOH, if only the  server room smokes
out, they have plenty of places to work, but nothing to work on.
Tut tut.

> ** Some of our customers do have failover systems 50+ miles apart.

Banks don't (hey, I wrote the interbank communications encryption
software here on the peninsula) here.  They have tapes.  As far as I
know, the tapes are sent to vaults. It often happens that their systems
go down. In fact, I have NEVER managed to connect via the internet page
to their systems at a time when they were in working order. And I have
been trying on and off for about three years. And I often have been in
the bank managers office (discussing mortgages, national debt, etc.)
when the bank's internal systems have gone down, nationwide.

> > so mirror (or RAID5)
> > everything required to keep your system running.
> > 
> > "And there is a risk of silent corruption on all raid systems - that is
> > well known."
> > I question this....
> 
> Why!
> ** You lost me here.  I did not make the above statement. 

Yes you did. You can see from the quoting that you did.

> But, in the case
> of RAID5, I believe it can occur.

So do I. I am asking why you "question this">

> Your system crashes while a RAID5 stripe
> is being written, but the stripe is not completely written. 

This is fairly meaningless.  I don't now what precise meaning the word
"stripe" has in raid5 but it's irrelevant.  Simply, if you write
redundant data, whatever way you write it, raid 1, 5, 6 or whatever,
there is a possibility that you write only one of the copies before the
system goes down. Then when the system comes up it has two different 
sources of data to choose to believe.

> During the
> re-sync, the parity will be adjusted,

See. "when the system comes up ...".

There is no need to go into detail and I don't know why you do!

> but it may be more current than 1 or
> more of the other disks.  But this would be similar to what would happen to
> a non-RAID disk (some data not written).

No, it would not be similar. You don't seem to understand the
mechanism. The mechanism for corruption is that there are two different
versions of the data available when the system comes back up, and you
and the raid system don't know which is more correct. Or even what it
means to be "correct". Maybe the earlier written data is "correct"!

> ** Also with RAID1 or RAID5, if corruption does occur without a crash or
> re-boot, then a disk fails, the corrupt data will be copied to the
> replacement disk.

Exactly so. It's a generic problem with redundant data sources. You
don't know which one to believe when they disagree!

> With RAID1 a 50% risk of copying the corruption, and 50%
> risk of correcting the corruption.  With RAID5, risk % depends on the number
> of disks in the array.

It's the same.  There are two sources of data that you can believe.  The
"real data on disk" or "all the other data blocks in the 'stripe' plus
the parity block".  You get to choose which you believe.

> > I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1, you
> 
> The corruption risk is doubled for a 2-way mirror, and there is a 50%
> chance of it not being detected at all even if you try and check for it,
> because you may be reading from the wrong mirror at the time you pass
> over the imperfection in the check.
> ** After a crash, md will re-sync the array.

It doesn't know which disk to believe is correct.

There is a stamp on the disks superblocks, but it is only updated every
so often. If the whole system dies while both disks are OK, I don't know 
what will be stamped or what will happen (which will be believed) at
resync. I suspect it is random. I would appreciate clarificaton from
Neil.

> ** But during the re-sync, md could be checking for differences and
> reporting them.

It could. That might be helpful.

> ** It won't help correct anything, but it could explain why you may be
> having problems with your data.

Indeed, it sounds a good idea. It could slow down RAID1 resync, but I
don't think the impact on RAID5 would be noticable.

> ** Since md re-syncs after a crash, I don't think the risk is double.

That is not germane. I already pointed out that you are 50% likely to
copy the "wrong" data IF you copy (and WHEN you copy).  Actually doing
the copy merely brings that calculation into play at the moment of the
resync, instead of later, at the moment when one of the two disks
actually dies and yu have to use the remaining one.

> Isn't that simply the most naive calculation? So why would you make
> your bet?
> ** I don't understand this.

Evidently! :)

> And then of course you don't generally check at all, ever.
> ** True, But I would like md to report when a mirror is wrong.
> ** Or a RAID5 parity is wrong.

Software raid does not spin off threads randomly checking data. If you
don't use it, you don't get to check at all. So just leaving disks
sitting there exposes them to corruption that is checked least of all.

> But whether you check or not, corruptions simply have only a 50% chancce
> of being seen (you look on the wrong mirror when you look), and a 200%
> chance of occuring (twice as much real estate) wrt normal rate.
> ** Since md re-syncs after a crash, I don't think the risk is double.

It remains double whatever you think. The question is whether you
detect it or not. You cannot detect it without checking.

> ** Also, I don't think most corruption would be detectable (ignoring a RAID
> problem).

You wouldn't know which disk was right. The disk might know if it was a
hardware problem

Incidentally, I wish raid would NOT offline the disk when it detects a
read error. It should fall back to the redundant data. 

I may submit a patch for that.

In 2.6 the raid system may even do that. The resync thread comments SAY
that it retries reads. I don't know if it actually does.

Neil?

> ** It depends to the type of data.
> ** Example: Your MP3 collection would go undetected until someone listened
> to the corrupt file. 

:-).

> In contrast, on a single disk they have a 100% chance of detection (if
> you look!) and a 100% chance of occuring, wrt normal rate.
> ** Are you talking about the disk drive detecting the error?

No. You are quite right. I should categorise the types of error more
precisely. We want to distinguish

    1) hard errors (detectable by the disk firmware)
    2) soft errors (not detected by the above)

> ** If so, are you referring to a read error or what?

Read.

> ** Please explain the nature of the detectable error.

"wrong data on the disk or as read from the disk". Define "wrong"!

> > know when a difference occurs, if you want.
> 
> How?
> ** Compare the 2 halves or the RAID1, or check the parity of RAID5.

You wouldn't necesarily know  which of the two data sources was
"correct".

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 11:31   ` Peter T. Breuer
@ 2005-01-03 17:34     ` Guy
  2005-01-03 17:46     ` maarten
  1 sibling, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-03 17:34 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

Having a filesystem go into read only mode is a "down system".  Not
acceptable to me!  Maybe ok for a home system, but I don't assume Linux is
limited to home use.  In my case, this is not acceptable for my home system.
Time is money!

About user intervention.  If the system stops working until someone does
something, that is a down system.  That is what I meant by user
intervention.  Replacing a disk Monday that failed Friday night, is what I
would expect.  This is a normal failure to me.  Even if a re-boot is
required, as long as it can be scheduled, it is acceptable to me.

You and I have had very different failures over the years!
In my case, most failures are disks, and most of the time the system
continues to work just fine, without user intervention.  If spare disks are
configured, the array re-builds to the spare.  At my convenience, I replace
the disk, without a system re-boot.  Most Unix systems I have used have SCSI
disks.  IDE tends to be in home systems.  My home system is Linux with 17
SCSI disks.  I have replaced a disk without a re-boot, but the disk cabinet
is not hot-swap, so I tend to shut down the system to replace a disk.

My 20 systems had anywhere from 4 to about 44 disks.  You should expect 1
disk failure out of 25-100 disks per year.  There are good years and bad!
Our largest customer system has more than 300 disks.  I don't know the
failure rate, but most failures do not take the system down!  Our customer
systems tend to have hardware RAID systems.  HP, EMC, DG (now EMC).

If you have a 10% disk failure rate per year, something else is wrong!  You
may have a bad building ground, or too much current flowing on the building
ground line.  All sorts of power problems are very common.  Most if not all
electricians only know the building code.  They are not qualified to debug
all power problems.  I once talked to an expert in the field.  He said
thunder causes more power problems than lighting!  Most buildings use
conduit for ground, no separate ground wire.  The thunder will shake the
conduit and loosen the connections.  This causes a bad ground during the
thunder, which could crash computer systems (including hardware RAID boxes).
Never depend a conduit for ground, always have a separate ground wire.  This
is just one example of many issues he had, I don't recall all the details,
and I am not an expert on building power.

I know of 1 event that matches your most common failure.  A PC with a $50
case and power supply, the power supply failed in such a way that it put
120V on the 12V and/or 5V line.  Everything in the case was lost!  Well, the
heat sink was ok. :)  The system was not repaired, it went into the trash.
But this was a home user's clone PC, not a server.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Monday, January 03, 2005 6:32 AM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Guy <bugzilla@watkins-home.com> wrote:
> "Also sprach Guy:"
> > "Well, you can make somewhere. You only require an 8MB (one cylinder)
> > partition."
> > 
> > So, it is ok for your system to fail when this disk fails?
> 
> You lose the journal, that's all.  You can react with a simple tune2fs
> -O ^journal or whatever is appropriate.  And a journal is ONLY there in
> order to protect you against crashes of the SYSTEM (not the disk), so
> what was the point of having the journal in the first place? 
> 
> ** When you lose the journal, does the system continue without it?
> ** Or does it require user intervention?

I don't recall. It certainly at least puts itself into read-only mode
(if that's the error mode specified via tune2fs). And the situation
probably changes from version t version.

On a side note, I don't know why you think user intervention is not
required when a raid system dies.  As a matter of liklihoods, I have
never seen a disk die while IN a working soft (or hard) raid system, and
the system continue working afterwards, instead the normal disaster
sequence as I have experienced it is:

   1) lightning strikes rails, or a/c goes out and room full of servers
      overheats. All lights go off.

   2) when sysadmin arrives to sort out the smoking wrecks, he finds
      that 1 in 3 random disks are fried - they're simply the points
      of failure that died first, and they took down the hardware with
      them.

   3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware
      to piece together the raid arrays from the surviving disks, and
      hastily does a copy to somewhere very safe and distant, while
      an assistant holds off howling hordes outside the door with
      a shutgun.

In this scenario, a disk simply acts as the weakest link in a fuse
chain, and the whole chain goes down.  But despite my dramatisation it
is likely that a hardware failure will take out or damage your hardware!
Ide disks live on an electric bus conected to other hardware.  Try a
shortcircuit and see what happens.  You can't even yank them out while
the bus is operating if you want to keep your insurance policy.

For scsi the situation is better wrt hot-swap, but still not perfect.
And you have the electric connections also. That makes it likely that a
real nasty hardware failure will do nasty things (tm) to whatever is in
the same electric environment.  It is possible if not likely that you
will lose contact with scsi disks further along the bus, if you don't
actually blow the controller.

That said, there ARE situations which raid protects you from - simply a
"gentle disconnect" (a totally failed disk that goes open circuit), or
a "gradual failure" (a disk that runs out of spare sectors). In the
latter case the raid will fail the disk completely at the first
detected error, which may well be what you want (or may not be!).

However, I don't see how you can expect to replace a failed disk
without taking down the system. For that reason you are expected to be
running "spare disks" that you can virtually insert hot into the array
(caveat, it is possible with scsi, but you will need to rescan the bus,
which will take it out of commission for some seconds, which may
require you to take the bus offline first, and it MAY be possible with
recent IDE buses that purport to support hotswap - I don't know).

So I think the relevant question is: "what is it that you are protecting
yourself from by this strategy of yours".

When you have the scenario, you can evaluate risks.

> > I don't want system failures when a disk fails,
> 
> Your scenario seems to be that you have the disks of your mirror on the
> same physical system.  That's fundamentally dangerous.  They're both
> subject to damage when the system blows up.  [ ... ]  I have an array
> node (where the journal is kept), and a local mirror component and a
> remote mirror component.
> 
> That system is doubled, and each half of the double hosts the others
> remote mirror component. Each half fails over to the other.

> ** So, you have 2 systems, 1 fails and the "system" switches to the other.
> ** I am not going for a 5 nines system.
> ** I just don't want any down time if a disk fails.

Well,

(1) how likely is it that a disk will fail without taking down the system
(2) how likely is it that a disk will fail
(3) how likely is it that a whole system will fail

I would say that (2) is about 10% per year. I would say that (3) is
about 1200% per year. It is therefore difficult to calculate (1), which
is your protection scenario, since it doesn't show up very often in the
stats!

> ** A disk failing is the most common failure a system can have (IMO).

Not in my experience. See above. I'd say each disk has about a 10%
failure expectation per year. Whereas I can guarantee that an
unexpected  system failure will occur about once a month, on every
important system.

If you think about it that is quite likely, since a system is by
definition a complicated thing. And then it is subject to all kinds of
horrible outside influences, like people rewiring the server room in 
order to reroute cables under the floor instead of through he ceiling,
and the maintenance people spraying the building with insecticide,
everywhere, or just "turning off the electricity in order to test it"
(that happens about four times a year here - hey, I remember when they
tested the giant UPS by turning off the electricity! Wrong switch.
Bummer).

Yes, you can try and keep these systems out of harms way on a
colocation site, or something, but by then you are at professional
level paranoia. For "home systems", whole system failures are far more
common than disk failures.

I am not saying that RAID is useless! Just the opposite. It is a useful
and EASY way of allowing you to pick up the pieces when everything
falls apart. In contrast, running a backup regime is DIFFICULT.

> ** In a computer room with about 20 Unix systems, in 1 year I have seen 10
> or so disk failures and no other failures.

Well, let's see. If each system has 2 disks, then that would be 25% per
disk per year, which I would say indicates low quality IDE disks, but
is about the level I would agree with as experiential.

> ** Are your 2 systems in the same state?

No, why should they be?

> ** They should be at least 50 miles apart (at a minimum).

They aren't - they are in two different rooms. Different systems copy
them every day to somewhere else. I have no need for instant
survivability across nuclear attack.

> ** Otherwise if your data center blows up, your system is down!

True. So what? I don't care. The cost of such a thing is zero, because
if my data center goes I get a lot of insurance money and can retire.
The data is already backed up elsewhere if anyone cares.

> ** In my case, this is so rare, it is not an issue.

It's very common here and everywhere else I know! Think about it - if
your disks are in the same box then it is statistically likely that when
one disks fails it is BECAUSE of some local cause, and that therefore
the other disk will also be affected by it.

It's your "if your date center burns down" reasoning, applied to your
box.

> ** Just use off-site tape backups.

No way! I hate tapes. I backup to other disks.

> ** My computer room is for development and testing, no customer access.

Unfortunately, the admins do most of the sabotage.

> ** If the data center is gone, the workers have nowhere to work anyway (in
> my case).

I agree. Therefore who cares. OTOH, if only the  server room smokes
out, they have plenty of places to work, but nothing to work on.
Tut tut.

> ** Some of our customers do have failover systems 50+ miles apart.

Banks don't (hey, I wrote the interbank communications encryption
software here on the peninsula) here.  They have tapes.  As far as I
know, the tapes are sent to vaults. It often happens that their systems
go down. In fact, I have NEVER managed to connect via the internet page
to their systems at a time when they were in working order. And I have
been trying on and off for about three years. And I often have been in
the bank managers office (discussing mortgages, national debt, etc.)
when the bank's internal systems have gone down, nationwide.

> > so mirror (or RAID5)
> > everything required to keep your system running.
> > 
> > "And there is a risk of silent corruption on all raid systems - that is
> > well known."
> > I question this....
> 
> Why!
> ** You lost me here.  I did not make the above statement. 

Yes you did. You can see from the quoting that you did.

> But, in the case
> of RAID5, I believe it can occur.

So do I. I am asking why you "question this">

> Your system crashes while a RAID5 stripe
> is being written, but the stripe is not completely written. 

This is fairly meaningless.  I don't now what precise meaning the word
"stripe" has in raid5 but it's irrelevant.  Simply, if you write
redundant data, whatever way you write it, raid 1, 5, 6 or whatever,
there is a possibility that you write only one of the copies before the
system goes down. Then when the system comes up it has two different 
sources of data to choose to believe.

> During the
> re-sync, the parity will be adjusted,

See. "when the system comes up ...".

There is no need to go into detail and I don't know why you do!

> but it may be more current than 1 or
> more of the other disks.  But this would be similar to what would happen
to
> a non-RAID disk (some data not written).

No, it would not be similar. You don't seem to understand the
mechanism. The mechanism for corruption is that there are two different
versions of the data available when the system comes back up, and you
and the raid system don't know which is more correct. Or even what it
means to be "correct". Maybe the earlier written data is "correct"!

> ** Also with RAID1 or RAID5, if corruption does occur without a crash or
> re-boot, then a disk fails, the corrupt data will be copied to the
> replacement disk.

Exactly so. It's a generic problem with redundant data sources. You
don't know which one to believe when they disagree!

> With RAID1 a 50% risk of copying the corruption, and 50%
> risk of correcting the corruption.  With RAID5, risk % depends on the
number
> of disks in the array.

It's the same.  There are two sources of data that you can believe.  The
"real data on disk" or "all the other data blocks in the 'stripe' plus
the parity block".  You get to choose which you believe.

> > I bet a non-mirror disk has similar risk as a RAID1.  But with a RAID1,
you
> 
> The corruption risk is doubled for a 2-way mirror, and there is a 50%
> chance of it not being detected at all even if you try and check for it,
> because you may be reading from the wrong mirror at the time you pass
> over the imperfection in the check.
> ** After a crash, md will re-sync the array.

It doesn't know which disk to believe is correct.

There is a stamp on the disks superblocks, but it is only updated every
so often. If the whole system dies while both disks are OK, I don't know 
what will be stamped or what will happen (which will be believed) at
resync. I suspect it is random. I would appreciate clarificaton from
Neil.

> ** But during the re-sync, md could be checking for differences and
> reporting them.

It could. That might be helpful.

> ** It won't help correct anything, but it could explain why you may be
> having problems with your data.

Indeed, it sounds a good idea. It could slow down RAID1 resync, but I
don't think the impact on RAID5 would be noticable.

> ** Since md re-syncs after a crash, I don't think the risk is double.

That is not germane. I already pointed out that you are 50% likely to
copy the "wrong" data IF you copy (and WHEN you copy).  Actually doing
the copy merely brings that calculation into play at the moment of the
resync, instead of later, at the moment when one of the two disks
actually dies and yu have to use the remaining one.

> Isn't that simply the most naive calculation? So why would you make
> your bet?
> ** I don't understand this.

Evidently! :)

> And then of course you don't generally check at all, ever.
> ** True, But I would like md to report when a mirror is wrong.
> ** Or a RAID5 parity is wrong.

Software raid does not spin off threads randomly checking data. If you
don't use it, you don't get to check at all. So just leaving disks
sitting there exposes them to corruption that is checked least of all.

> But whether you check or not, corruptions simply have only a 50% chancce
> of being seen (you look on the wrong mirror when you look), and a 200%
> chance of occuring (twice as much real estate) wrt normal rate.
> ** Since md re-syncs after a crash, I don't think the risk is double.

It remains double whatever you think. The question is whether you
detect it or not. You cannot detect it without checking.

> ** Also, I don't think most corruption would be detectable (ignoring a
RAID
> problem).

You wouldn't know which disk was right. The disk might know if it was a
hardware problem

Incidentally, I wish raid would NOT offline the disk when it detects a
read error. It should fall back to the redundant data. 

I may submit a patch for that.

In 2.6 the raid system may even do that. The resync thread comments SAY
that it retries reads. I don't know if it actually does.

Neil?

> ** It depends to the type of data.
> ** Example: Your MP3 collection would go undetected until someone listened
> to the corrupt file. 

:-).

> In contrast, on a single disk they have a 100% chance of detection (if
> you look!) and a 100% chance of occuring, wrt normal rate.
> ** Are you talking about the disk drive detecting the error?

No. You are quite right. I should categorise the types of error more
precisely. We want to distinguish

    1) hard errors (detectable by the disk firmware)
    2) soft errors (not detected by the above)

> ** If so, are you referring to a read error or what?

Read.

> ** Please explain the nature of the detectable error.

"wrong data on the disk or as read from the disk". Define "wrong"!

> > know when a difference occurs, if you want.
> 
> How?
> ** Compare the 2 halves or the RAID1, or check the parity of RAID5.

You wouldn't necesarily know  which of the two data sources was
"correct".

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 11:31   ` Peter T. Breuer
  2005-01-03 17:34     ` Guy
@ 2005-01-03 17:46     ` maarten
  2005-01-03 19:52       ` maarten
                         ` (2 more replies)
  1 sibling, 3 replies; 172+ messages in thread
From: maarten @ 2005-01-03 17:46 UTC (permalink / raw)
  To: linux-raid

On Monday 03 January 2005 12:31, Peter T. Breuer wrote:
> Guy <bugzilla@watkins-home.com> wrote:
> > "Also sprach Guy:"

>    1) lightning strikes rails, or a/c goes out and room full of servers
>       overheats. All lights go off.
>
>    2) when sysadmin arrives to sort out the smoking wrecks, he finds
>       that 1 in 3 random disks are fried - they're simply the points
>       of failure that died first, and they took down the hardware with
>       them.
>
>    3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware
>       to piece together the raid arrays from the surviving disks, and
>       hastily does a copy to somewhere very safe and distant, while
>       an assistant holds off howling hordes outside the door with
>       a shutgun.
>
> In this scenario, a disk simply acts as the weakest link in a fuse
> chain, and the whole chain goes down.  But despite my dramatisation it
> is likely that a hardware failure will take out or damage your hardware!
> Ide disks live on an electric bus conected to other hardware.  Try a
> shortcircuit and see what happens.  You can't even yank them out while
> the bus is operating if you want to keep your insurance policy.

The chance of a PSU blowing up or lightning striking is, reasonably, much less 
than an isolated disk failure.  If this simple fact is not true for you 
personally, you really ought to reevaluate the quality of your PSU (et al) 
and / or the buildings' defenses against a lightning strike...

> However, I don't see how you can expect to replace a failed disk
> without taking down the system. For that reason you are expected to be
> running "spare disks" that you can virtually insert hot into the array
> (caveat, it is possible with scsi, but you will need to rescan the bus,
> which will take it out of commission for some seconds, which may
> require you to take the bus offline first, and it MAY be possible with
> recent IDE buses that purport to support hotswap - I don't know).

I think the point is not what actions one has to take at time T+1 to replace 
the disk, but rather whether at time T, when the failure first occurs, the 
system survives the failure or not.

> (1) how likely is it that a disk will fail without taking down the system
> (2) how likely is it that a disk will fail
> (3) how likely is it that a whole system will fail
>
> I would say that (2) is about 10% per year. I would say that (3) is
> about 1200% per year. It is therefore difficult to calculate (1), which
> is your protection scenario, since it doesn't show up very often in the
> stats!

I don't understand your math.  For one, percentage is measured from 0 to 100, 
not from 0 to 1200.  What is that, 12 twelve times 'absolute certainty' that 
something will occur ?
But besides that, I'd wager that from your list number (3) has, by far, the 
smallest chance of occurring. Choosing between (1) and (2) is more difficult, 
my experiences with IDE disks are definitely that it will take the system 
down, but that is very biased since I always used non-mirrored swap.
I sure can understand a system dying if it loses part of its memory...

> > ** A disk failing is the most common failure a system can have (IMO).

I fully agree.

> Not in my experience. See above. I'd say each disk has about a 10%
> failure expectation per year. Whereas I can guarantee that an
> unexpected  system failure will occur about once a month, on every
> important system.

Whoa !  What are you running, windows perhaps ?!? ;-)
No but seriously, joking aside, you have 12 system failures per year ?
I would not be alone in thinking that figure is VERY high.  My uptimes 
generally are in the three-digit range, and most *certainly* not in the low 
2-digit range. 

> If you think about it that is quite likely, since a system is by
> definition a complicated thing. And then it is subject to all kinds of
> horrible outside influences, like people rewiring the server room in
> order to reroute cables under the floor instead of through he ceiling,
> and the maintenance people spraying the building with insecticide,
> everywhere, or just "turning off the electricity in order to test it"
> (that happens about four times a year here - hey, I remember when they
> tested the giant UPS by turning off the electricity! Wrong switch.
> Bummer).

If you have building maintenance people and other random staff that can access 
your server room unattended and unmonitored, you have far worse problems than 
making decicions about raid lavels.  IMNSHO.   

By your description you could almost be the guy the joke with the recurring 7 
o'clock system crash is about (where the cleaning lady unplugs the server 
every morning in order to plug in her vacuum cleaner) ;-) 

> Yes, you can try and keep these systems out of harms way on a
> colocation site, or something, but by then you are at professional
> level paranoia. For "home systems", whole system failures are far more
> common than disk failures.

Don't agree.  Not only do disk failures occur more often than full system 
failures, disk failures are also much more time-consuming to recover from. 
Compare changing a system board or PSU with changing a drive and finding, 
copying and verifying a backup (if you even have one that's 100% up to date)

> > ** In a computer room with about 20 Unix systems, in 1 year I have seen
> > 10 or so disk failures and no other failures.
>
> Well, let's see. If each system has 2 disks, then that would be 25% per
> disk per year, which I would say indicates low quality IDE disks, but
> is about the level I would agree with as experiential.

The point here was, disk failures being more common than other failures...

> No way! I hate tapes. I backup to other disks.

Then for your sake, I hope they're kept offline, in a safe.

> > ** My computer room is for development and testing, no customer access.
>
> Unfortunately, the admins do most of the sabotage.

Change admins.  
I could understand an admin making typing errors and such, but then again that 
would not usually lead to a total system failure.  Some daemon not working, 
sure.  Good admins review or test their changes, for one thing, and in most 
cases any such mistake is rectified much simpler and faster than a failed 
disk anyway. Except maybe for lilo errors with no boot media available. ;-\ 

> Yes you did. You can see from the quoting that you did.

Or the quoting got messed up.  That is known to happen in threads.

> > but it may be more current than 1 or
> > more of the other disks.  But this would be similar to what would happen
> > to a non-RAID disk (some data not written).
>
> No, it would not be similar. You don't seem to understand the
> mechanism. The mechanism for corruption is that there are two different
> versions of the data available when the system comes back up, and you
> and the raid system don't know which is more correct. Or even what it
> means to be "correct". Maybe the earlier written data is "correct"!

That is not the whole truth.  To be fair, the mechanism works like this:
With raid, you have a 50% chance the wrong, corrupted, data is used.
Without raid, thus only having a single disk, the chance of using the 
corrupted data is 100% (obviously, since there is only one source)

Or, much more elaborate:

Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5

With raid, you always have a 50% chance of reading faultty data IF one of the 
drives holds faulty data.  For the drives itself, the chance of both disks 
being wrong is 0.5x0.5=0.25(scenario A).  Similarly, 25 % chance both disks 
are good (scenario B). The chance of one of the disks being wrong is 50% 
(scenarios C & D together).  In scenarios A & B the outcome is certain. In 
scenarios C & D the chance of the raid choosing the false mirror is 50%.
Accumulating those chances one can say that the chance of reading false data 
is:
in scenario A: 100%
in scenario B: 0%
scenario C: 50%
scenario D: 50%

Doing the math, the outcome is still (200% divided by four)= 50%.
Ergo: the same as with a single disk.  No change.

> > In contrast, on a single disk they have a 100% chance of detection (if
> > you look!) and a 100% chance of occuring, wrt normal rate.
> > ** Are you talking about the disk drive detecting the error?

No, you have a zero chance of detection, since there is nothing to compare TO.
Raid-1 at least gives you a 50/50 chance to choose the right data.  With a 
single disk, the chance of reusing the corrupted data is 100% and there is no 
mechanism to detect the odd 'tumbled bit' at all.

> > How?
> > ** Compare the 2 halves or the RAID1, or check the parity of RAID5.
>
> You wouldn't necesarily know  which of the two data sources was
> "correct".

No, but you have a theoretical choice, and a 50% chance of being right.
Not so without raid, where you get no choice, and a 100% chance of getting the 
wrong data, in the case of a corruption.

Maarten

-- 

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 17:46     ` maarten
@ 2005-01-03 19:52       ` maarten
  2005-01-03 20:41         ` Peter T. Breuer
  2005-01-03 20:22       ` Peter T. Breuer
  2005-01-03 21:36       ` Guy
  2 siblings, 1 reply; 172+ messages in thread
From: maarten @ 2005-01-03 19:52 UTC (permalink / raw)
  To: linux-raid

On Monday 03 January 2005 18:46, maarten wrote:
> On Monday 03 January 2005 12:31, Peter T. Breuer wrote:
> > Guy <bugzilla@watkins-home.com> wrote:

>
> Doing the math, the outcome is still (200% divided by four)= 50%.
> Ergo: the same as with a single disk.  No change.

Just for laughs, I calculated this chance also for a three-way raid-1 setup 
using a lower 'failure possibility' percentage.  The outcome does not change.
The (statisticly higher) chance of a disk failing is exactly offset by the 
greater likelyhood that the raid system chooses one of the good drives to 
read from.
(Obviously this is only valid for raid level 1, not for level 5 or others)

Let us (randomly) assume there is a 10% chance of a disk failure.
We use three raid-1 disks, numbered 1 through 3.

We therefore have eight possible scenarios:

A
disk1 fail
disk2 good
disk3 good

B
disk1 good
disk2 fail
disk3 good

C
disk1 good
disk2 good
disk3 fail

D
disk1 fail
disk2 fail
disk3 good

E
disk1 fail
disk2 good
disk3 fail

F
disk1 good
disk2 fail
disk3 fail

G
disk1 fail
disk2 fail
disk3 fail

H
disk1 good
disk2 good
disk3 good

Scenarios A, B and C are similar (one disk failed). Scenario's D, E and F are 
also similar (two disk failures).  Scenarios G and H are special, the chances 
of that occurring are calculated seperately.

H: the chance of all good disks is (0.9x0.9x0.9) = 0.729
G: the chance of all disks bad is (0.1x0.1x0.1) = 0.001
The chance of A, B or C (one bad disk) is (0.9x0.9x0.1) = 0.081
The chance of D, E or F (two bad disks) is (0.9x0.1x0.1) = 0.009

The chance of (A, B or C) and (D, E or F) occurring must be multiplied by 
three as there are three scenarios each. So this becomes: 
The chance of one bad disk is = 0.243
The chance of two bad disks is = 0.027

Now let's see. It is certain that the raid subsystem will read the good data 
in H. The chance of that in scenario G is zero. The chance in (A, B or C) is 
two-thirds. And for D, E or F the chance the raid system getting the good 
data is one-third.

Let's calculate all this.
[ABC] x 0.667 = 0.243 x 0.667 = 0.162
[DEF] x 0.333 = 0.027 x 0.333 = 0.008
[G] x 0 = 0.0 
[H] x 1.0 = 0.729

(total added up is 0.9)

Conversely, the chance of reading the BAD data:
[ABC] x 0.333 = 0.243 x 0.333 = 0.081
[DEF] x 0.667 = 0.027 x 0.667 = 0.018
[G] x 1.0 = 0.001 
[H] x 0.0 = 0.0

(total added up is 0.1)

Which, again, is exactly the same chance a single disk will get corrupted, as 
we assumed above in line one is 10%.  Ergo, using raid-1 does not make the 
risks of bad data creeping in any worse.  Nor does it make it better either.

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 19:52       ` maarten
@ 2005-01-03 20:41         ` Peter T. Breuer
  2005-01-03 23:19           ` Peter T. Breuer
  2005-01-04  0:45           ` maarten
  0 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03 20:41 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> Just for laughs, I calculated this chance also for a three-way raid-1 setup 

There's no need for you to do this - your calculations are
unfortunately not meaningful.

> Let us (randomly) assume there is a 10% chance of a disk failure.

No, call it "p". That is the correct name. And I presume you mean "an
error", not "a failure".

> We therefore have eight possible scenarios:

Oh, puhleeeeze.  Infantile arithmetic instead of elementary probabilistic
algebra is not something I wish to suffer through ...

> A
> disk1 fail
> disk2 good
> disk3 good

 ...

> H
> disk1 good
> disk2 good
> disk3 good

Was that all? 8 was it? 1 all good, 3 with one good, 3 with two good, 1
with all fail? Have we got the binomial theorem now!

> Scenarios A, B and C are similar (one disk failed).

Hoot. 3p.

> Scenario's D, E and F are 
> also similar (two disk failures). 

There is no need for you to consider these scenarios. The probability
is 3p^2, which is tiny. Forget it. (actually 3p^2(1-p), but forget the
cube term).

> Scenarios G and H are special, the chances 
> of that occurring are calculated seperately.

No, they are NOT special. one of them is the chance that everything is
OK, which is (1-p)^3, or approx 1-3p (surprise surprise). The other is
the completely forgetable probaility p^3 that all three are bad at that
spot.

> H: the chance of all good disks is (0.9x0.9x0.9) = 0.729

Surprisingly enough, 1-3p, even though you have such improbably large
probability p as to make the approximation only approximate!  Puhleeze.
This is excruciatingly poor baby math!

> G: the chance of all disks bad is (0.1x0.1x0.1) = 0.001

Surprise. p^3.

> The chance of A, B or C (one bad disk) is (0.9x0.9x0.1) = 0.081
> The chance of D, E or F (two bad disks) is (0.9x0.1x0.1) = 0.009
> 
> The chance of (A, B or C) and (D, E or F) occurring must be multiplied by 
> three as there are three scenarios each. So this becomes: 
> The chance of one bad disk is = 0.243
> The chance of two bad disks is = 0.027

Surprise, surprise. 3p and 3p^2(1-p)  (well, call it 3p^2).

> Now let's see. It is certain that the raid subsystem will read the good data 
> in H. The chance of that in scenario G is zero. The chance in (A, B or C) is 
> two-thirds. And for D, E or F the chance the raid system getting the good 
> data is one-third.
> 
> Let's calculate all this.
> [ABC] x 0.667 = 0.243 x 0.667 = 0.162
> [DEF] x 0.333 = 0.027 x 0.333 = 0.008
> [G] x 0 = 0.0 
> [H] x 1.0 = 0.729
> 
> (total added up is 0.9)

The chance  of reading good data is 1 (1-3p) + 2/3 3p
or

Or approx 1-p.

Probably exactly so, were I to do the calculation exactly, which I won't.

> Conversely, the chance of reading the BAD data:
> [ABC] x 0.333 = 0.243 x 0.333 = 0.081
> [DEF] x 0.667 = 0.027 x 0.667 = 0.018
> [G] x 1.0 = 0.001 
> [H] x 0.0 = 0.0
> 
> (total added up is 0.1)

It should  be  p! It is one minus your previous result.

SIgh ...      0 (1-3p) + 1/3 3p  = p

> Which, again, is exactly the same chance a single disk will get corrupted, as 
> we assumed above in line one is 10%.  Ergo, using raid-1 does not make the 
> risks of bad data creeping in any worse.  Nor does it make it better either.

All false. And baby false at that. Annoying! Look, the chance of an
undetected detectable failure occuring is

         0 (1-3p) + 2/3 3p

         = 2p

and it grows with the number n of disks, as you may expect, being
proportional to n-1.

With one disk, it is zero.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 20:41         ` Peter T. Breuer
@ 2005-01-03 23:19           ` Peter T. Breuer
  2005-01-03 23:46             ` Neil Brown
  2005-01-04  0:45           ` maarten
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03 23:19 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> No, call it "p". That is the correct name. And I presume you mean "an
> error", not "a failure".

I'll do this thoroughly, so you can see how it goes.

Let 

   p = probability of a detectible error occuring on a disk in a unit time
   p'= ................ indetectible .....................................

Then the probability of an error occuring UNdetected on a n-disk raid
array is

       (n-1)p + np'
  
and on a 1 disk system (a 1-disk raid array :) it is

       p'

OK? (hey, I'm a mathematician, it's obvious to me).

Exercise .. calculate effect of majority voting! 


Peter
      


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 23:19           ` Peter T. Breuer
@ 2005-01-03 23:46             ` Neil Brown
  2005-01-04  0:28               ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-03 23:46 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> > No, call it "p". That is the correct name. And I presume you mean "an
> > error", not "a failure".
> 
> I'll do this thoroughly, so you can see how it goes.
> 
> Let 
> 
>    p = probability of a detectible error occuring on a disk in a unit time
>    p'= ................ indetectible .....................................
> 
> Then the probability of an error occuring UNdetected on a n-disk raid
> array is
> 
>        (n-1)p + np'
>   
> and on a 1 disk system (a 1-disk raid array :) it is
> 
>        p'
> 
> OK? (hey, I'm a mathematician, it's obvious to me).

It may be obvious, but it is also wrong.  But then probability is, I
think, the branch of mathematics that has the highest ratio of people
who think that understand it to people to actually do (witness the
success of lotteries).

The probability of an event occurring lies between 0 and 1 inclusive.
You have given a formula for a probability which could clearly evaluate
to a number greater than 1.  So it must be wrong.

You have also been very sloppy in your language, or your definitions.
What do you mean by a "detectable error occurring"?  Is it a bit
getting flipped on the media, or the drive detecting a CRC error
during read?

And what is your senario for an undetectable error happening?  My
understanding of drive technology and CRCs suggests that undetectable
errors don't happen without some sort of very subtle hardware error,
or high level software error (i.e. the wrong data was written - and
that doesn't really count).

NeilBrown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 23:46             ` Neil Brown
@ 2005-01-04  0:28               ` Peter T. Breuer
  2005-01-04  1:18                 ` Alvin Oga
  2005-01-04  2:07                 ` Neil Brown
  0 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04  0:28 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> > Peter T. Breuer <ptb@lab.it.uc3m.es> wrote:
> > > No, call it "p". That is the correct name. And I presume you mean "an
> > > error", not "a failure".
> > 
> > I'll do this thoroughly, so you can see how it goes.
> > 
> > Let 
> > 
> >    p = probability of a detectible error occuring on a disk in a unit time
> >    p'= ................ indetectible .....................................
> > 
> > Then the probability of an error occuring UNdetected on a n-disk raid
> > array is
> > 
> >        (n-1)p + np'
> >   
> > and on a 1 disk system (a 1-disk raid array :) it is
> > 
> >        p'
> > 
> > OK? (hey, I'm a mathematician, it's obvious to me).
> 
> It may be obvious, but it is also wrong.

No, it's quite correct.

> But then probability is, I
> think, the branch of mathematics that has the highest ratio of people
> who think that understand it to people to actually do (witness the
> success of lotteries).

Possibly. But not all of them teach probability at university level
(and did so when they were 21, at the University of Cambridge to boot,
and continued teaching pure math there at all subjects and all levels
until the age of twenty-eight - so puhleeeze don't bother!).

> The probability of an event occurring lies between 0 and 1 inclusive.
> You have given a formula for a probability which could clearly evaluate
> to a number greater than 1.  So it must be wrong.

The hypothesis here is that p is vanishingly small.  I.e. this is a Poisson
distribution - the analysis assumes that only one event can occcur per
unit time.  Take the unit too be one second if you like.  Does that make
it true enough for you?

Poisson distros are pre-A level math.

> You have also been very sloppy in your language, or your definitions.
> What do you mean by a "detectable error occurring"? 

I mean an error occurs that can be detected (by the experiment you run,
which is prsumably an fsck, but I don't presume to dictate to you).

> Is it a bit
> getting flipped on the media, or the drive detecting a CRC error
> during read?

I don't know. It's whatever your test can detect. You can tell me!

> And what is your senario for an undetectable error happening?

Likewise, I don't know. It's whatever error your experiment
(presumably an fsck) will miss.

> My
> understanding of drive technology and CRCs suggests that undetectable
> errors don't happen without some sort of very subtle hardware error,

They happen all the time - just write a 1 to disk A and a zero to disk
B in the middle of the data in some file, and you will have an
undetectible error (vis a vis your experimental observation, which is
presumably an fsck).

> or high level software error (i.e. the wrong data was written - and
> that doesn't really count).

It counts just fine, since it's what does happen :- consider a system
crash that happens AFTER one of a pair of writes to the two disk
components has completed, but BEFORE the second has completed.  Then on
reboot your experiment (an fsck) has the task of finding the error
(which exists at least as a discrepency between the two disks), if it
can, and shouting at you about it.

All I am saying is that the error is either detectible by your
experiment (the fsck), or not. If it IS detectible, then there
is a 50% chance that it WON'T be deetcted, even though it COULD be
detected, because the system unfortunately chose to read the wrong
disk at that moment. However, the error is twice as likely as with only
one disk, whatever it is (you can argue aboutthe real multiplier, but
it is about that).

And if it is not detectible, it's still twice as likely as with one
disk, for the same reason - more real estate for it to happen on.

This is just elementary operational research!

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  0:28               ` Peter T. Breuer
@ 2005-01-04  1:18                 ` Alvin Oga
  2005-01-04  4:29                   ` Neil Brown
  2005-01-04  2:07                 ` Neil Brown
  1 sibling, 1 reply; 172+ messages in thread
From: Alvin Oga @ 2005-01-04  1:18 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid



On Tue, 4 Jan 2005, Peter T. Breuer wrote:

> Neil Brown <neilb@cse.unsw.edu.au> wrote:

> > > Let 
> > > 
> > >    p = probability of a detectible error occuring on a disk in a unit time
> > >    p'= ................ indetectible .....................................
> > > 

i think the definitions and modes of failures is what each reader is
interpretting from their perspective ??

> > think, the branch of mathematics that has the highest ratio of people
> > who think that understand it to people to actually do (witness the
> > success of lotteries).

ahh ... but the stock market is the worlds largest casino
 
> Possibly. But not all of them teach probability at university level
> (and did so when they were 21, at the University of Cambridge to boot,
> and continued teaching pure math there at all subjects and all levels
> until the age of twenty-eight - so puhleeeze don't bother!).

:-)
  
> I mean an error occurs that can be detected (by the experiment you run,
> which is prsumably an fsck, but I don't presume to dictate to you).

or more simply, the disk doesnt work .. what you write is not what you get
back ??
	- below that level, there'd be crc errors, some fixable
	some not
	- below that, there'd be disk controller problems with
	bad block mapping and temperature sensitive failures
	- below that ... flaky heads and platters and oxide ..

> > Is it a bit
> > getting flipped on the media, or the drive detecting a CRC error
> > during read?

different error conditions ...
	- bit flipping is trivially fixed ...
	and the user probasbly doesnt know about it

	- crc error of 1 bit error or 2 bit error or burst errors 
	( all are different crc errors and ecc problems )
 
> I don't know. It's whatever your test can detect. You can tell me!

i think most people only care about ... can we read the "right data"
back some time later after we had previously written it
"supposedly correctly"

> > And what is your senario for an undetectable error happening?

there's lots of undetectable errors ...

there's lots of detectable errors that was fixed, so that the user
doesnt know abut the underlying errors

> Likewise, I don't know. It's whatever error your experiment
> (presumably an fsck) will miss.

fsck is too high a level to be worried about errors...
	- it assume the disk is workiing fine
	and fsck fixes the filesystem inodes and doesnt worry
	about "disk errors"
 
> > My
> > understanding of drive technology and CRCs suggests that undetectable
> > errors don't happen without some sort of very subtle hardware error,

some crc ( ecc ) will fix it ... some errors are Not fixable

"crc" is not used too much ... ecc is used ...

> 
> > or high level software error (i.e. the wrong data was written - and
> > that doesn't really count).
> 
> It counts just fine, since it's what does happen :- consider a system
> crash that happens AFTER one of a pair of writes to the two disk
> components has completed, but BEFORE the second has completed.  Then on
> reboot your experiment (an fsck) has the task of finding the error
> (which exists at least as a discrepency between the two disks), if it
> can, and shouting at you about it.

a common problem ... that data is partially written during a crash

very hard to fix .. without knowing what the data should have been
 
> All I am saying is that the error is either detectible by your
> experiment (the fsck), or not.

or detectable/undetectable/fixable by other "methods"

 If it IS detectible, then there
> is a 50% chance that it WON'T be deetcted,

that'd depend on what the failure mode was ..

> even though it COULD be
> detected, because the system unfortunately chose to read the wrong
> disk at that moment.

the assumption is that if one writes data ... that the crc/ecc is
written somewhere else that is correct or vice versa, but both
could be written wrong

> And if it is not detectible, it's still twice as likely as with one
> disk, for the same reason - more real estate for it to happen on.

more "(disk) real estate" increases the places where errors can 
occur ... 

but todays, disk drives is lots lots better than the old days

and todays dd copying of disk might work, but doing dd on
old disks w/ bad oxides will create lots of problems ...

==
== fun stuff ... how do you make your data more secure ...
== and reliable
==

c ya
alvin



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  1:18                 ` Alvin Oga
@ 2005-01-04  4:29                   ` Neil Brown
  2005-01-04  8:43                     ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-04  4:29 UTC (permalink / raw)
  To: Alvin Oga; +Cc: Peter T. Breuer, linux-raid

On Monday January 3, aoga@ns.Linux-Consulting.com wrote:
> 
> > > think, the branch of mathematics that has the highest ratio of people
> > > who think that understand it to people to actually do (witness the
> > > success of lotteries).
> 
> ahh ... but the stock market is the worlds largest casino

and how many people do you know who make money on stock markets.
Now compare that with how many loose money on lotteries.
Find out the ratio and .....

>  
> > Possibly. But not all of them teach probability at university level
> > (and did so when they were 21, at the University of Cambridge to boot,
> > and continued teaching pure math there at all subjects and all levels
> > until the age of twenty-eight - so puhleeeze don't bother!).

Apparently teaching probability at University doesn't necessary mean
that you understand it.  I cannot comment on your understanding, but
if you ask google about the Monty Hall problem  and include search
terms like "professor" or "maths department" you will find plenty of
(reported) cases of University staff not getting it.

e.g.  http://www25.brinkster.com/ranmath/marlright/montynyt.htm

 "Our math department had a good, self-righteous laugh at your
 expense," wrote Mary Jane Still, a professor at Palm Beach Junior
 College. Robert Sachs, a professor of mathematics at George Mason
 University in Fairfax, Va., expressed the prevailing view that there
 was no reason to switch doors. 

They were both wrong.

NeilBrown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  4:29                   ` Neil Brown
@ 2005-01-04  8:43                     ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04  8:43 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Monday January 3, aoga@ns.Linux-Consulting.com wrote:
> > 
> > > > think, the branch of mathematics that has the highest ratio of people
> > > > who think that understand it to people to actually do (witness the
> > > > success of lotteries).
> > 
> > ahh ... but the stock market is the worlds largest casino
> 
> and how many people do you know who make money on stock markets.

Ooh .. several mathematicians (pay is not very high!). 

> Now compare that with how many loose money on lotteries.

I don't have to - I wouldn't place money in a lottery.  The expected
gain is negative whatever you do.  I stick to investments where I have
an expectation of a positive gain with at least some strategy.

Mind you, as Conway often said,  statistics don't apply to improbable
events. So you should bet on anything which is not likely to occur more
than once or twice a lifetime (theory - if you win, just don't try it
again; if you die first, well, you won't care).

Stick to someting more certain, like blackjack, if you want to  make $.

> Apparently teaching probability at University doesn't necessary mean
> that you understand it. 

Perhaps the problem is at your end?

> I cannot comment on your understanding, but
> if you ask google about the Monty Hall problem  and include search
> terms like "professor" or "maths department" you will find plenty of
> (reported) cases of University staff not getting it.

Who cares? It's easy to concoct problems that a person will get wrong
if they answer according to intuition. I can do that trick easily on
you! (or anyone).

> They were both wrong.

What are you trying to "prove"? There is no need to be insulting.
Simply pick up the technical conversation!

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  0:28               ` Peter T. Breuer
  2005-01-04  1:18                 ` Alvin Oga
@ 2005-01-04  2:07                 ` Neil Brown
  2005-01-04  2:16                   ` Ewan Grantham
  2005-01-04  9:40                   ` Peter T. Breuer
  1 sibling, 2 replies; 172+ messages in thread
From: Neil Brown @ 2005-01-04  2:07 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> > > Then the probability of an error occuring UNdetected on a n-disk raid
> > > array is
> > > 
> > >        (n-1)p + np'
> > >   
> 
> > The probability of an event occurring lies between 0 and 1 inclusive.
> > You have given a formula for a probability which could clearly evaluate
> > to a number greater than 1.  So it must be wrong.
> 
> The hypothesis here is that p is vanishingly small.  I.e. this is a Poisson
> distribution - the analysis assumes that only one event can occcur per
> unit time.  Take the unit too be one second if you like.  Does that make
> it true enough for you?

Sorry, I didn't see any such hypothesis stated and I don't like to
assUme.

So what you are really saying is that:
  for sufficiently small p and p' (i.e. p-squared terms can be ignored)
  the probability of an error occurring undetected approximates
     (n-1)p + np'

this may be true, but I'm still having trouble understanding what your
p and p' really mean.

> > You have also been very sloppy in your language, or your definitions.
> > What do you mean by a "detectable error occurring"? 
> 
> I mean an error occurs that can be detected (by the experiment you run,
> which is prsumably an fsck, but I don't presume to dictate to you).
> 

The whole point of RAID is that fsck should NEVER see any error caused
by drive failure.
I think we have a major communication failure here, because I have no
idea what sort of failure scenario you are imagining.

> > Is it a bit
> > getting flipped on the media, or the drive detecting a CRC error
> > during read?
> 
> I don't know. It's whatever your test can detect. You can tell me!
> 
> > And what is your senario for an undetectable error happening?
> 
> Likewise, I don't know. It's whatever error your experiment
> (presumably an fsck) will miss.

But 'fsck's primary purpose is not to detect errors on the disk.  It is
to repair a filesystem after an unclean shutdown.  It can help out a
bit after disk corruption, but usually disk corruption (apart from
very minimal problems) causes fsck to fail to do anything useful.

> 
> > My
> > understanding of drive technology and CRCs suggests that undetectable
> > errors don't happen without some sort of very subtle hardware error,
> 
> They happen all the time - just write a 1 to disk A and a zero to disk
> B in the middle of the data in some file, and you will have an
> undetectible error (vis a vis your experimental observation, which is
> presumably an fsck).

But this doesn't happen.  You *don't* write 1 to disk A and 0 to disk
B.

I admit that this can actually happen occasionally (but certainly not
"all the time"). But when it does, there will be subsequent writes to
both A and B with new, correct, data.  During the intervening time
that block will not be read from A or B.
If there is a system crash before correct, consistent data is written,
then on restart, disk B will not be read at all until disk A as been
completely copied on it.

So again, I fail to see your failure scenario.

> 
> > or high level software error (i.e. the wrong data was written - and
> > that doesn't really count).
> 
> It counts just fine, since it's what does happen :- consider a system
> crash that happens AFTER one of a pair of writes to the two disk
> components has completed, but BEFORE the second has completed.  Then on
> reboot your experiment (an fsck) has the task of finding the error
> (which exists at least as a discrepency between the two disks), if it
> can, and shouting at you about it.

No.  RAID will not let you see that discrepancy, and will not let the
discrepancy last any longer that it takes to read on drive and write
the other.

> 
> All I am saying is that the error is either detectible by your
> experiment (the fsck), or not. If it IS detectible, then there
> is a 50% chance that it WON'T be deetcted, even though it COULD be
> detected, because the system unfortunately chose to read the wrong
> disk at that moment. However, the error is twice as likely as with only
> one disk, whatever it is (you can argue aboutthe real multiplier, but
> it is about that).
> 
> And if it is not detectible, it's still twice as likely as with one
> disk, for the same reason - more real estate for it to happen on.

Maybe I'm beginning to understand your failure scenario.
It involves different data being written to the drives. Correct?

That only happens if:
  1/ there is a software error
  2/ there is an admin error

You seem to be saying that if this happens, then raid is less reliable
than non-raid.
There may be some truth in this, but it is irrelevant.
The likelyhood of such a software error or admin error happening on a
well-managed machine is substantially less than the likelyhood of a
drive media error, and raid will protect from drive media errors.
So using raid might reduce reliability in a tiny number of cases, but
will increase it substantially in a vastly greater number of cases.

NeilBrown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:07                 ` Neil Brown
@ 2005-01-04  2:16                   ` Ewan Grantham
  2005-01-04  2:22                     ` Neil Brown
  2005-01-04  9:40                   ` Peter T. Breuer
  1 sibling, 1 reply; 172+ messages in thread
From: Ewan Grantham @ 2005-01-04  2:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

I are confused...

Which perhaps should be a lesson to a slightly knowlegeable user not
to read a thread like this.

But having given myself a headache trying to figure this all out, I
guess I'll just go ahead and ask directly.

I've setup a RAID-5 array using two internal 250 Gig HDs and two
external 250 Gig HDs through a USB-2 interface. Each of the externals
is on it's own card, and the internals are on seperate IDE channels.

I "thought" I was doing a good thing by doing all of this and then
setting them up using an ext3 filesystem.

From the reading on here I'm not clear if I should have specified
something besides whatever ext3 does by default when you set it up,
and if so if it's something I can still do without having to redo
everything. Something I'd rather not do to be honest.

Thanks in advance,
Ewan
---
http://a1.blogspot.com - commentary since 2002

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:16                   ` Ewan Grantham
@ 2005-01-04  2:22                     ` Neil Brown
  2005-01-04  2:41                       ` Andy Smith
  0 siblings, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-04  2:22 UTC (permalink / raw)
  To: Ewan Grantham; +Cc: linux-raid

On Monday January 3, ewan.grantham@gmail.com wrote:
> I are confused...
> 
> Which perhaps should be a lesson to a slightly knowlegeable user not
> to read a thread like this.
> 
> But having given myself a headache trying to figure this all out, I
> guess I'll just go ahead and ask directly.
> 
> I've setup a RAID-5 array using two internal 250 Gig HDs and two
> external 250 Gig HDs through a USB-2 interface. Each of the externals
> is on it's own card, and the internals are on seperate IDE channels.
> 
> I "thought" I was doing a good thing by doing all of this and then
> setting them up using an ext3 filesystem.

Sounds like a perfectly fine setup (providing always that external
cables are safe from stray feet etc).

No need to change anything.

NeilBrown

> 
> >From the reading on here I'm not clear if I should have specified
> something besides whatever ext3 does by default when you set it up,
> and if so if it's something I can still do without having to redo
> everything. Something I'd rather not do to be honest.
> 
> Thanks in advance,
> Ewan
> ---
> http://a1.blogspot.com - commentary since 2002

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:22                     ` Neil Brown
@ 2005-01-04  2:41                       ` Andy Smith
  2005-01-04  3:42                         ` Neil Brown
                                           ` (2 more replies)
  0 siblings, 3 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-04  2:41 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> On Monday January 3, ewan.grantham@gmail.com wrote:
> > I've setup a RAID-5 array using two internal 250 Gig HDs and two
> > external 250 Gig HDs through a USB-2 interface. Each of the externals
> > is on it's own card, and the internals are on seperate IDE channels.
> > 
> > I "thought" I was doing a good thing by doing all of this and then
> > setting them up using an ext3 filesystem.
> 
> Sounds like a perfectly fine setup (providing always that external
> cables are safe from stray feet etc).
> 
> No need to change anything.

Except that Peter says that the ext3 journals should be on separate
non-mirrored devices and the reason this is not mentioned in any
documentation (md / ext3) is that everyone sees it as obvious.
Whether it is true or not it's clear to me that it's not obvious to
everyone.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:41                       ` Andy Smith
@ 2005-01-04  3:42                         ` Neil Brown
  2005-01-04  9:50                           ` Peter T. Breuer
  2005-01-04  9:30                         ` Maarten
  2005-01-04  9:46                         ` Peter T. Breuer
  2 siblings, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-04  3:42 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-raid

On Tuesday January 4, andy@strugglers.net wrote:
> On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > On Monday January 3, ewan.grantham@gmail.com wrote:
> > > I've setup a RAID-5 array using two internal 250 Gig HDs and two
> > > external 250 Gig HDs through a USB-2 interface. Each of the externals
> > > is on it's own card, and the internals are on seperate IDE channels.
> > > 
> > > I "thought" I was doing a good thing by doing all of this and then
> > > setting them up using an ext3 filesystem.
> > 
> > Sounds like a perfectly fine setup (providing always that external
> > cables are safe from stray feet etc).
> > 
> > No need to change anything.
> 
> Except that Peter says that the ext3 journals should be on separate
> non-mirrored devices and the reason this is not mentioned in any
> documentation (md / ext3) is that everyone sees it as obvious.
> Whether it is true or not it's clear to me that it's not obvious to
> everyone.

If Peter says that, then Peter is WRONG.

ext3 journals are much safer on mirrored devices than on non-mirrored
devices just the same as any other data is safer on mirrored than on
non-mirrored. 
In the case in question, it is raid5, not mirrored, but still raid5 is
safer than raid0 or single devices (possibly not quite as safe was raid1).

NeilBrown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  3:42                         ` Neil Brown
@ 2005-01-04  9:50                           ` Peter T. Breuer
  2005-01-04 14:15                             ` David Greaves
  2005-01-04 16:42                             ` Guy
  0 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04  9:50 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday January 4, andy@strugglers.net wrote:
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
> > Whether it is true or not it's clear to me that it's not obvious to
> > everyone.
> 
> If Peter says that, then Peter is WRONG.

But Peter does NOT say that.

> ext3 journals are much safer on mirrored devices than on non-mirrored

That's irrelevant - you don't care what's in the journal, because if
your system crashes before committal you WANT the data in the journal
to be lost, rolled back, whatever, and you don't want your machine to
have acked the write until it actually has gone to disk.

Or at least that's what *I* want. But then everyone has different
wants and needs. What is obvious, however, are the issues involved.

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  9:50                           ` Peter T. Breuer
@ 2005-01-04 14:15                             ` David Greaves
  2005-01-04 15:20                               ` Peter T. Breuer
  2005-01-04 16:42                             ` Guy
  1 sibling, 1 reply; 172+ messages in thread
From: David Greaves @ 2005-01-04 14:15 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:

>>ext3 journals are much safer on mirrored devices than on non-mirrored
>>    
>>
>That's irrelevant - you don't care what's in the journal, because if
>your system crashes before committal you WANT the data in the journal
>to be lost, rolled back, whatever, and you don't want your machine to
>have acked the write until it actually has gone to disk.
>
>Or at least that's what *I* want. But then everyone has different
>wants and needs. What is obvious, however, are the issues involved.
>  
>
err, no.

If the journal is safely written to the journal device and the machine 
crashes whilst updating the main filesystem you want the journal to be 
replayed, not erased. The journal entries are designed to be replayable 
to a partially updated filesystem.

That's the whole point of journalling filesystems, write the deltas to 
the journal, make the changes to the fs, delete the deltas from the journal.

If the machine crashes whilst the deltas are being written then you 
won't play them back - but your fs will be consistent.

Journaled filesystems simply ensure the integrity of the fs metadata - 
they don't protect against random acts of application/user level 
vandalism (ie power failure).

David

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 14:15                             ` David Greaves
@ 2005-01-04 15:20                               ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 15:20 UTC (permalink / raw)
  To: linux-raid

David Greaves <david@dgreaves.com> wrote:
> Peter T. Breuer wrote:
> 
> >>ext3 journals are much safer on mirrored devices than on non-mirrored
> >That's irrelevant - you don't care what's in the journal, because if
> >your system crashes before committal you WANT the data in the journal
> >to be lost, rolled back, whatever, and you don't want your machine to
> >have acked the write until it actually has gone to disk.
> >
> >Or at least that's what *I* want. But then everyone has different
> >wants and needs. What is obvious, however, are the issues involved.
> 
> If the journal is safely written to the journal device and the machine 

You don't know it has been. Raid can't tell.

> crashes whilst updating the main filesystem you want the journal to be 
> replayed, not erased. The journal entries are designed to be replayable 
> to a partially updated filesystem.

It doesn't work. You can easily get a block  written to the journal on
disk A, but not on disk B (supposing raid 1 with disks A and B).
According to you "this" should be replayed. Well, which result do you
want? Raid has no way of telling.

Suppose that A contains the last block to be written to a file, and
does not. Yet B is chosen by raid as the "reliable" source.

Then what happens? 

Is the transaction declared "completed" with incomplete data? With
incorrect data?

Myself I'd hope it were rolled back, whichever of A or B were chosen,
because some final annotation was missing from the journal, saying
"finished and ready to send" (alternating bit protocol :-). But you
can't win ... what if the "final" annotation were written to journal on
A but not on B.

Then what would happen?

Well, then whichever of A or B the raid chose, you'd either get the
data rolled forward or backward.

Which would you prefer? 

I'd just prefer that it was all rolled back. 

> That's the whole point of journalling filesystems, write the deltas to 
> the journal, make the changes to the fs, delete the deltas from the journal.

Consider the above. There is no magic.

> If the machine crashes whilst the deltas are being written then you 
> won't play them back - but your fs will be consistent.

What if the delta is written to one journal, but not to the other, when
the machine crashes?

I outlined the problem above. You can't win this game.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  9:50                           ` Peter T. Breuer
  2005-01-04 14:15                             ` David Greaves
@ 2005-01-04 16:42                             ` Guy
  2005-01-04 17:46                               ` Peter T. Breuer
  1 sibling, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-04 16:42 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

This may be a stupid question...  But it seems obvious to me!
If you don't want your journal after a crash, why have a journal?

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Tuesday, January 04, 2005 4:51 AM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday January 4, andy@strugglers.net wrote:
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
> > Whether it is true or not it's clear to me that it's not obvious to
> > everyone.
> 
> If Peter says that, then Peter is WRONG.

But Peter does NOT say that.

> ext3 journals are much safer on mirrored devices than on non-mirrored

That's irrelevant - you don't care what's in the journal, because if
your system crashes before committal you WANT the data in the journal
to be lost, rolled back, whatever, and you don't want your machine to
have acked the write until it actually has gone to disk.

Or at least that's what *I* want. But then everyone has different
wants and needs. What is obvious, however, are the issues involved.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 16:42                             ` Guy
@ 2005-01-04 17:46                               ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 17:46 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> This may be a stupid question...  But it seems obvious to me!
> If you don't want your journal after a crash, why have a journal?

Journalled fs's have the property that their file systems are always
coherent (provided other corruption has not occurred).  This is often
advantageous in terms of providing you with the ability to at least
boot. The fs code is oranised so that everuthig is set up for a
metadata change, and then a single "final" atomic operation occurs that
finalizes the change.

It is THAT property that is desirable. It is not intrinsic to journalled
file systems, but in practice only journalled file systems have
implemented it.

In other words, what I'd like here is a journalled file system with a
zero size journal.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:41                       ` Andy Smith
  2005-01-04  3:42                         ` Neil Brown
@ 2005-01-04  9:30                         ` Maarten
  2005-01-04 10:18                           ` Peter T. Breuer
  2005-01-04  9:46                         ` Peter T. Breuer
  2 siblings, 1 reply; 172+ messages in thread
From: Maarten @ 2005-01-04  9:30 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > On Monday January 3, ewan.grantham@gmail.com wrote:

> > No need to change anything.
>
> Except that Peter says that the ext3 journals should be on separate
> non-mirrored devices and the reason this is not mentioned in any
> documentation (md / ext3) is that everyone sees it as obvious.
> Whether it is true or not it's clear to me that it's not obvious to
> everyone.

Be that as it may, with all that Peter wrote in the last 24 hours I tend to 
weigh his expertise a bit less than I did before.  YMMV, but his descriptions 
of his data center do not instill a very high confidence, do they ?

While it may be true that genius math people may make lousy server admins (and 
vice versa), when I read someone claiming there are random undetected errors 
propagating through raid, yet this person cannot even regulate his own 
"random, undetected" power supply problems, then I start to wonder.  

Would you believe that at one point, for a minute I wondered whether Peter was 
actually a troll ?  (yeah, sorry for that, but it happened...)
So no, he apparently is employed at a Spanish university, and he even has a 
Freshmeat project entry, something to do with raid...  

So I'm left with a blank stare, trying to figure out what to make of it. 

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  9:30                         ` Maarten
@ 2005-01-04 10:18                           ` Peter T. Breuer
  2005-01-04 13:36                             ` Maarten
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 10:18 UTC (permalink / raw)
  To: linux-raid

Maarten <maarten@ultratux.net> wrote:
> On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > On Monday January 3, ewan.grantham@gmail.com wrote:
> 
> > > No need to change anything.
> >
> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.
> > Whether it is true or not it's clear to me that it's not obvious to
> > everyone.
> 
> Be that as it may, with all that Peter wrote in the last 24 hours I tend to 
> weigh his expertise a bit less than I did before.  YMMV, but his descriptions 
> of his data center do not instill a very high confidence, do they ?

It's not "my" data center.  It is what it is.  I can only control
certain things in it, such as the software on the machines, and which
machines are bought.  Nor is it a "data center", but a working
environment for about 200 scientists and engineers, plus thousands of
incompetent monkeys.  I.e., a university department.

It would be good of you to refrain from justifications based on
denigration.


Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 10:18                           ` Peter T. Breuer
@ 2005-01-04 13:36                             ` Maarten
  2005-01-04 14:13                               ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Maarten @ 2005-01-04 13:36 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> Maarten <maarten@ultratux.net> wrote:
> > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > On Monday January 3, ewan.grantham@gmail.com wrote:

>
> It's not "my" data center.  It is what it is.  I can only control
> certain things in it, such as the software on the machines, and which
> machines are bought.  Nor is it a "data center", but a working
> environment for about 200 scientists and engineers, plus thousands of
> incompetent monkeys.  I.e., a university department.
>
> It would be good of you to refrain from justifications based on
> denigration.

I seem to recall you starting off boasting about the systems you had in place, 
with the rsync mirroring and all-servers-bought-in-duplicate.  If then later 
on your whole secure data center turns out to be a school department, 
undoubtedly with viruses rampant, students hacking at the schools' systems, 
peer to peer networks installed on the big fileservers unbeknownst to the 
admins, and only mains power when you're lucky, yes, then I get a completely 
other picture than you drew at first.  You can't blame me for that.

This does not mean you're incompetent, it just means you called a univ IT dept 
something that it is not, and never will be: secure, stable and organized. 
In other words, if you dislike being put down, you best not boast so much.

Now you'll have to excuse me, I have things to get done today.

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 13:36                             ` Maarten
@ 2005-01-04 14:13                               ` Peter T. Breuer
  2005-01-04 19:22                                 ` maarten
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 14:13 UTC (permalink / raw)
  To: linux-raid

Maarten <maarten@ultratux.net> wrote:
> On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> > Maarten <maarten@ultratux.net> wrote:
> > > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > > On Monday January 3, ewan.grantham@gmail.com wrote:
> >
> > It's not "my" data center.  It is what it is.  I can only control
> > certain things in it, such as the software on the machines, and which
> > machines are bought.  Nor is it a "data center", but a working
> > environment for about 200 scientists and engineers, plus thousands of
> > incompetent monkeys.  I.e., a university department.
> >
> > It would be good of you to refrain from justifications based on
> > denigration.
> 
> I seem to recall you starting off boasting about the systems you had in place, 

I'm not "boasting"  about them. They simply ARE.

> with the rsync mirroring and all-servers-bought-in-duplicate.  If then later 

That's what there is.  Is that supposed to be boasting?  The servers are
always bought in pairs.  They always failover to each other.  They
contain each others mirrors.  Etc.

> on your whole secure data center turns out to be a school department, 

Eh?

> undoubtedly with viruses rampant, students hacking at the schools' systems, 

Sure - that's precisely what there is.

> peer to peer networks installed on the big fileservers unbeknownst to the 

Uh, no. We don't run windos. Well, it is on the clients, but I simply
sabaotage them whenever I can :). That saves time. Then they can boot
into the right o/s.

> admins, and only mains power when you're lucky, yes, then I get a completely 
> other picture than you drew at first.  You can't blame me for that.

I don't "draw any picture". I am simply telling you it as it is.

> This does not mean you're incompetent, it just means you called a univ IT dept 
> something that it is not, and never will be: secure, stable and organized. 

Eh? It's as secure stable and organised as it can be, given that nobody
is in charge of anything.

> In other words, if you dislike being put down, you best not boast so much.

About what!

> Now you'll have to excuse me, I have things to get done today.

I don't. I just have to go generate some viruses and introduce chaos
into some otherwise perfectly stable systems. Ho hum. 

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 14:13                               ` Peter T. Breuer
@ 2005-01-04 19:22                                 ` maarten
  2005-01-04 20:05                                   ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: maarten @ 2005-01-04 19:22 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> Maarten <maarten@ultratux.net> wrote:
> > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> > > Maarten <maarten@ultratux.net> wrote:
> > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > > > On Monday January 3, ewan.grantham@gmail.com wrote:
> > >
> > > It's not "my" data center.  It is what it is.  I can only control
> > > certain things in it, such as the software on the machines, and which
> > > machines are bought.  Nor is it a "data center", but a working
> > > environment for about 200 scientists and engineers, plus thousands of
> > > incompetent monkeys.  I.e., a university department.

> I'm not "boasting"  about them. They simply ARE.

Are you not boasting about it, simply by providing all the little details no 
one cares about, except that it makes your story more believable ?

If I state my IQ was tested as above 140, am I then boasting, or simply 
stating a fact ?  Stating a fact and boasting are not mutually exclusive. 

> > on your whole secure data center turns out to be a school department,
>
> Eh?

What, "Eh?" ?  
Are you taking offense to me calling a "university department" a school ?  Is 
it not what you are, you are an educational institution, ie. a school.

> > undoubtedly with viruses rampant, students hacking at the schools'
> > systems,
>
> Sure - that's precisely what there is.

Hah. Show me one school where there isn't.

> > peer to peer networks installed on the big fileservers unbeknownst to the
>
> Uh, no. We don't run windos. Well, it is on the clients, but I simply
> sabaotage them whenever I can :). That saves time. Then they can boot
> into the right o/s.

Ehm. p2p exists for linux too.  Look into it.  Are you so dead certain no 
student of yours ever found a local root hole ?  
Then you have more balls than you can carry.

> Eh? It's as secure stable and organised as it can be, given that nobody
> is in charge of anything.

Normal people usually refer to such a state as "an anarchy".
Not a perfect example of stability, security or organization by any stretch of 
the imagination...

Maarten

-- 

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 19:22                                 ` maarten
@ 2005-01-04 20:05                                   ` Peter T. Breuer
  2005-01-04 21:38                                     ` Guy
  2005-01-04 21:48                                     ` maarten
  0 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 20:05 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> > Maarten <maarten@ultratux.net> wrote:
> > > On Tuesday 04 January 2005 11:18, Peter T. Breuer wrote:
> > > > Maarten <maarten@ultratux.net> wrote:
> > > > > On Tuesday 04 January 2005 03:41, Andy Smith wrote:
> > > > > > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > > > > > On Monday January 3, ewan.grantham@gmail.com wrote:
> > > >
> > > > It's not "my" data center.  It is what it is.  I can only control
> > > > certain things in it, such as the software on the machines, and which
> > > > machines are bought.  Nor is it a "data center", but a working
> > > > environment for about 200 scientists and engineers, plus thousands of
> > > > incompetent monkeys.  I.e., a university department.
> 
> > I'm not "boasting"  about them. They simply ARE.
> 
> Are you not boasting about it, simply by providing all the little details no 
> one cares about, except that it makes your story more believable ?

What "little details"? Really, this is most aggravating!

> If I state my IQ was tested as above 140, am I then boasting, or simply 
> stating a fact ?

You're being an improbability.

> Stating a fact and boasting are not mutually exclusive. 

But about WHAT? I have no idea what you may consider boasting!

> > > on your whole secure data center turns out to be a school department,
> >
> > Eh?
> 
> What, "Eh?" ?  

What "secure data center"? I have never mentioned such a thing! We have
a big floor of a big building, plus a few labs in the basement, and a
few labs in another couple of buldings.  We used to locate the servers
in secure places in the various buildings, but nowadays we tend to dump
most of them in a single highly a/c'ed rack room up here.  Mind you, I
still have six in my office, and I don't use a/c.  I guess others are
scattered arund the place too.

However, all the servers are paired, and fail over to each other, and
do each others mirroring. If I recall correctly, even the pairs fail
over to backup pairs.  Last xmas I distinctly remember holding up the
department on a single surviving server because a faulty cable had
intermittently taken out one pair, and a faulty router had taken out
another.  I forget what had happened to the remaining server.  Probably
the cleaners switched it off!  Anyway, one survived and everything
failed over to it, in a planned degradation.

It would have been amusing, if I hadn't had  to deal with a horrible
mail loop caused by mail being bounced by he server with intermittent
contact through the faulty cable. There was no way of stopping it,
since I couldn't open the building till Jan 6!

> Are you taking offense to me calling a "university department" a school ? 

No - it is a school. "La escuela superior de ...". What the french call
an "ecole superior".

> Is 
> it not what you are, you are an educational institution, ie. a school.

Schools are not generally universityies, except perhaps in the united
states!  Elsewhere one goes to learn, not to be taught.

> > > undoubtedly with viruses rampant, students hacking at the schools'
> > > systems,
> >
> > Sure - that's precisely what there is.
> 
> Hah. Show me one school where there isn't.

It doesn't matter. There is nothing they can do (provided that is, the
comp department manages to learn how to configure ldap so that people
don't send their passwords in the clear to their server for
confirmation ... however, only you and I know that, eh?).

> > Uh, no. We don't run windos. Well, it is on the clients, but I simply
> > sabaotage them whenever I can :). That saves time. Then they can boot
> > into the right o/s.
> 
> Ehm. p2p exists for linux too.  Look into it.  Are you so dead certain no 

Do you mean edonkey and emule by that? "p2p" signifies nothing to me
except "peer to peer", which is pretty well everything. For example,
samba. There's nothing wrong with using such protocols. If you mean
using it to download fillums, that's a personal question - we don't
check data contents, and indeed it's not clear that we legally could,
since the digital information acts here recognise digital "property
rights" and "rights to privacy" that we cannot intrude into. Legally,
of course.

> student of yours ever found a local root hole ?  

Absolutely. Besides - it would be trivial to do. I do it all the time.

That's really not the point - we would see it at once if they decided to
do anything with root - all the alarm systems would trigger if _anyone_
does anything with root.  All the machines are alarmed like mines,
checked daily, byte by byte, and rootkits are easy to see, whenever they
turn up.  I have a nice collection.

Really, I am surprised at you! Any experienced sysadmin would know that
such things are trivialities to spot and remove. It is merely an
intelligence test, and the attacker does not have more intelligence
or experience than the defenders! Quite the opposite.

> Then you have more balls than you can carry.

???

> > Eh? It's as secure stable and organised as it can be, given that nobody
> > is in charge of anything.
> 
> Normal people usually refer to such a state as "an anarchy".

Good - that's the way I like it.  Coping with and managing chaos amounts
to giving the maximum freedom to all, and preserving their freedoms.
Including the freedom to mess up.  To help them, I maintain copies of
their work for them, and guard them against each other and outside
threats.

> Not a perfect example of stability, security or organization by any stretch of 
> the imagination...

Sounds great to me!

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 20:05                                   ` Peter T. Breuer
@ 2005-01-04 21:38                                     ` Guy
  2005-01-04 23:53                                       ` Peter T. Breuer
  2005-01-05  0:58                                       ` Mikael Abrahamsson
  2005-01-04 21:48                                     ` maarten
  1 sibling, 2 replies; 172+ messages in thread
From: Guy @ 2005-01-04 21:38 UTC (permalink / raw)
  To: linux-raid

Back to MTBF please.....

I agree that 1M hours MTBF is very bogus.  I don't really know how they
compute MTBF.  But I would like to see them compute the MTBF of a birthday
candle.

A birthday candle lasts about 2 minutes (as a guess).  I think they would
light 1000 candles at the same time.  Then monitor them until the first one
fails, say at 2 minutes.  I think the MTBF would then be computed as 2000
minutes MTBF!  But we can be sure that by 2.5 minutes, at least 90% of them
would have failed.

Guy

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:38                                     ` Guy
@ 2005-01-04 23:53                                       ` Peter T. Breuer
  2005-01-05  0:58                                       ` Mikael Abrahamsson
  1 sibling, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 23:53 UTC (permalink / raw)
  To: linux-raid

Guy <bugzilla@watkins-home.com> wrote:
> A birthday candle lasts about 2 minutes (as a guess).  I think they would
> light 1000 candles at the same time.  Then monitor them until the first one
> fails, say at 2 minutes.  I think the MTBF would then be computed as 2000
> minutes MTBF!

If the distribution is Poisson (i.e. the probabilty of dying per moment
time is constant over time) then that is correct. I don't know offhand
if that is an unbiassed estimator. I would imagine not. It would be
biassed to the short side.

> But we can be sure that by 2.5 minutes, at least 90% of them
> would have failed.

Then you would be sure that the distribution was not Poisson. What is
the problem here, exactly?  Many different distributions can have the
same mean.  For example, this one:

deaths per unit time
|
|   /\
|  /  \
| /    \
|/      \
---------->t

and this one

deaths per unit time
|
|\      /
| \    /
|  \  /
|   \/
---------->t

have the same mean. The same mtbf.

Is this a surprise ? The mean on its own is only one parameter of a
distribution - for a posson distribution, it is the only parameter, but
that is a particular case.  For the normal disribution you require both
the mean and the standard deviation in order to specify the
distribution.  You can get very different normal distributions with the
same mean!

I can't draw a Poisson distribution in ascii, but it has a short sharp
rise to the peak, then a long slow decline to infinity. If you were to
imagine that half the machines had died by the time the mtbf were
reached, you would be very wrong! Many more have died than half. But
that long tail of those very few machines that live a LOT longer than
the mtbf balances it out.

I already did this once for you, but I'll do it again: if the mtbf is
ten years, then 10% die every year.  Or 90% survive every year.  This
means that by the time 10 years have passed only 35% have survived
(90%^10).  So 2/3 of the machines have died by the time the mtbf is
reached!

If you want to know where the peak of the death rate occurs, well, it
looks to me as though it is at the mtbf (but I am calculating mentally,
not on paper, so do your own checks). After that deaths become less
frequent in the population as a whole.

To estimate the mtbf, I would imagine that one averages the proportion
of the population that die per month, for several months. But I guess
serious appicative statisticians have evolved far more sophisticated
and more efficient estimators.

And then there is the problem that the distribution is bipolar, not pure
poisson.  There will be a subpopulation of faulty disks that die off
earlier.  So they need to discount early measurements in favour of the
later ones (bad luck if you get one of the subpopulation of defectives
:) - but that's what their return policy is for).

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:38                                     ` Guy
  2005-01-04 23:53                                       ` Peter T. Breuer
@ 2005-01-05  0:58                                       ` Mikael Abrahamsson
  1 sibling, 0 replies; 172+ messages in thread
From: Mikael Abrahamsson @ 2005-01-05  0:58 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

On Tue, 4 Jan 2005, Guy wrote:

> light 1000 candles at the same time.  Then monitor them until the first one
> fails, say at 2 minutes.  I think the MTBF would then be computed as 2000
> minutes MTBF!  But we can be sure that by 2.5 minutes, at least 90% of them
> would have failed.

Which is why you, when you purchase a lot of stuff, should ask for an 
annual return rate value, which probably makes more sense than MTBF, even 
though these values are related.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 20:05                                   ` Peter T. Breuer
  2005-01-04 21:38                                     ` Guy
@ 2005-01-04 21:48                                     ` maarten
  2005-01-04 23:14                                       ` Peter T. Breuer
  1 sibling, 1 reply; 172+ messages in thread
From: maarten @ 2005-01-04 21:48 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:
> > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> > > Maarten <maarten@ultratux.net> wrote:

> > Are you not boasting about it, simply by providing all the little details
> > no one cares about, except that it makes your story more believable ?
>
> What "little details"? Really, this is most aggravating!

These little details, as you scribbled, very helpfully I might add, below. ;)
  |
  |
  V

> over to backup pairs.  Last xmas I distinctly remember holding up the
> department on a single surviving server because a faulty cable had
> intermittently taken out one pair, and a faulty router had taken out
> another.  I forget what had happened to the remaining server.  Probably
> the cleaners switched it off!  Anyway, one survived and everything
> failed over to it, in a planned degradation.
>
> It would have been amusing, if I hadn't had  to deal with a horrible
> mail loop caused by mail being bounced by he server with intermittent
> contact through the faulty cable. There was no way of stopping it,
> since I couldn't open the building till Jan 6!

And another fine example of the various hurdles you encounter ;-)
Couldn't you just get the key from someone ?  If not, what if you saw 
something far worse happening, like all servers in one room dying shortly 
after another, or a full encompassing system compromise going on ??

> > Hah. Show me one school where there isn't.
>
> It doesn't matter. There is nothing they can do (provided that is, the
> comp department manages to learn how to configure ldap so that people
> don't send their passwords in the clear to their server for
> confirmation ... however, only you and I know that, eh?).

Yes.  This is not a public mailing list.  Ceci n'est pas une pipe.

There is nothing they can do... except of course, running p2p nets, spreading 
viruses, changing their grades, finding out other students' personal info and 
trying out new ways to collect credit card numbers.  Is that what you meant ?

> Do you mean edonkey and emule by that? "p2p" signifies nothing to me
> except "peer to peer", which is pretty well everything. For example,
> samba. There's nothing wrong with using such protocols. If you mean
> using it to download fillums, that's a personal question - we don't
> check data contents, and indeed it's not clear that we legally could,
> since the digital information acts here recognise digital "property
> rights" and "rights to privacy" that we cannot intrude into. Legally,
> of course.

P2p might encompass samba in theory, but the term as used by everybody 
specifically targets more or less rogue networks that share movies et al.  
I know of the legal uncertainties associated with it (I'm in the EU too) and I 
do not condemn the use of them even. It's just that this type of activity can 
wreak havoc on a network, just from a purely technical standpoint alone.

> Absolutely. Besides - it would be trivial to do. I do it all the time.
>
> That's really not the point - we would see it at once if they decided to
> do anything with root - all the alarm systems would trigger if _anyone_
> does anything with root.  All the machines are alarmed like mines,
> checked daily, byte by byte, and rootkits are easy to see, whenever they
> turn up.  I have a nice collection.

Yes, well, someday someone may come up with a way to defeat your alarms and 
tripwire / AIDE or whatever you have in place...  For instance, how do you 
check for a rogue LKM ?  If coded correctly, there is little you can do to 
find out it is loaded (all the while feeding you the md5 checksums you expect 
to find, without any of you being the wiser) apart from booting off a set of 
known good read-only media...  AFAIK.

> Really, I am surprised at you! Any experienced sysadmin would know that
> such things are trivialities to spot and remove. It is merely an
> intelligence test, and the attacker does not have more intelligence
> or experience than the defenders! Quite the opposite.

Uh-huh.  Defeating a random worm, yes. Finding a rogue 4777 /tmp/.../bash 
shell or an extra  "..... root /bin/sh" line in inetd.conf is, too.  Those 
things are scriptkiddies at work. But from math students I expect much more, 
and so should you, I think. You are dealing with highly intelligent people, 
some of whom already know more about computers than you'll ever know.
(the same holds true for me though, as I'm no young student anymore either...)

Maarten

-- 
When I answered where I wanted to go today, they just hung up -- Unknown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:48                                     ` maarten
@ 2005-01-04 23:14                                       ` Peter T. Breuer
  2005-01-05  1:53                                         ` maarten
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 23:14 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote:
> > maarten <maarten@ultratux.net> wrote:
> > > On Tuesday 04 January 2005 15:13, Peter T. Breuer wrote:
> > > > Maarten <maarten@ultratux.net> wrote:
> 
> 
> > > Are you not boasting about it, simply by providing all the little details
> > > no one cares about, except that it makes your story more believable ?
> >
> > What "little details"? Really, this is most aggravating!

> These little details, as you scribbled, very helpfully I might add, below. ;)
>   |
>   |
>   V
> 
> > over to backup pairs.  Last xmas I distinctly remember holding up the
> > department on a single surviving server because a faulty cable had
> > intermittently taken out one pair, and a faulty router had taken out
> > another.  I forget what had happened to the remaining server.  Probably
> > the cleaners switched it off!  Anyway, one survived and everything
> > failed over to it, in a planned degradation.

This is in response to your strange statement that I had a "data center".
I hope it gives you a better idea.

> > It would have been amusing, if I hadn't had  to deal with a horrible
> > mail loop caused by mail being bounced by he server with intermittent
> > contact through the faulty cable. There was no way of stopping it,
> > since I couldn't open the building till Jan 6!
> 
> And another fine example of the various hurdles you encounter ;-)
> Couldn't you just get the key from someone ?

Xmas holidays are total here, from 24 december to 6 jan.  There is
nobody in the building.  Maybe a security guard doing a round, but they
certainly do not have authority to let anyone in.  Just the opposite!

> If not, what if you saw 
> something far worse happening, like all servers in one room dying shortly 
> after another, or a full encompassing system compromise going on ??

Nothing - I could not get in. 

> > It doesn't matter. There is nothing they can do (provided that is, the
> > comp department manages to learn how to configure ldap so that people
> > don't send their passwords in the clear to their server for
> > confirmation ... however, only you and I know that, eh?).
> 
> Yes.  This is not a public mailing list.  Ceci n'est pas une pipe.

Indeed. Let's keep it to ourselves, then.

> There is nothing they can do... except of course, running p2p nets, spreading 
> viruses, changing their grades, finding out other students' personal info and 
> trying out new ways to collect credit card numbers.  Is that what you meant ?

No - they can't do any of those things.  P2p nets are not illegal, and
we would see the traffic if there were any.  They cannot "change their
grades" because they do not have access to them - nobody does.  They are
sent to goodness knows where in a govt bulding somewhere via ssl (an
improvement from the times when we had to fill in a computer card marked
in ink, for goodness sake, but I haven't done the sending in myself
lately, so I don't know the details - I give the list to the secretary
rather than suffer).  As to reading MY disk, anyone can do that.  I
don't have secrets, be it marks on anything else.  Indeed, my disk will
nfs mount on the student machines if they so much as cd to my home
directory (but don't tell them that!).  Of course they'd then have to
figure out how to become root in order to change uid so they could read
my data, and they can't do that - all the alarms in the building would
go off!  su isn't even executable, let alone suid, and root login is
disabled so many places I forget (heh, .profile in /root ays something
awful to you, and then exits), and then there are the reapers, the
monitors, oh, everything, waiting for just such an opportunity to ring
the alarm bells.  As to holes in other protocols, I can't even remenber
a daemon that runs as root nowadays without looking!  What?  And so
what?  If they got a root shell, everything would start howling.  And
then if they got a root shell and did something, all the alrms would go
off again as the checks swung in on the hour.  Why would they risk it?
Na ..  we only get breakin attempts from script-kiddies outside, not
inside.

As to credit card numbers - nobody has one.  Students don't earn enough
to get credit cards.  Heck, even the profs don't!  As to personal info,
they can see whatever anyone can see.  There is no special "personal
info" anywhere in particular.  If somebody wants to keep their password
in a file labelled "my passwords" in a world readable directory of their
own creation, called "secret", that is their own lookout.  If somebody
else steals their "digital identity" (if only they knew what it was)
they can change their password - heck they have enough trouble
remembering the one they have!

I'm not paranoid - this is an ordinary place.  They either act All I do
is provide copies of their accounts, and failover services for them to
use, or to hld up other services.

And they have plenty of illegal things to do on their own without
involving me.

> P2p might encompass samba in theory, but the term as used by everybody 
> specifically targets more or less rogue networks that share movies et al.  

Not by me - you must be in a particular clique.  This is a networking
department!  It would be strange if anyone were NOT running a peer to
peer system!

> do not condemn the use of them even. It's just that this type of activity can 
> wreak havoc on a network, just from a purely technical standpoint alone.

Why should it wreak havoc? We have no problem with bandwidth. We have
far more prblems when the routing classes deliberately change the
network topologies, or some practical implements RIP and one student
gets it wrong!

There is a time of year when the network bounces like a yo yo because
the students are implementing proxy arp  and getting it completely
wrong!

> > That's really not the point - we would see it at once if they decided to
> > do anything with root - all the alarm systems would trigger if _anyone_
> > does anything with root.  All the machines are alarmed like mines,
> > checked daily, byte by byte, and rootkits are easy to see, whenever they
> > turn up.  I have a nice collection.
> 
> Yes, well, someday someone may come up with a way to defeat your alarms and 
> tripwire / AIDE or whatever you have in place...  For instance, how do you 

No they won't. And if they do, so what? They will fall over the next
one along the line!  There is no way they can do it.  I couldn't do it
if I were trying to avoid me seeing - I'm too experienced as a defender.
I can edit a running kernel to reset syscalls that have been altered by
adore, and see them.  I regularly have I-get-root duels, and I have no
problem with getting and keeping root, while letting someone else also
stay root. I can get out of a chroot jail, since I have root. I run
uml honeypots.

> check for a rogue LKM ?

Easily. Not worth mentioning. Apart from the classic error that hidden
processes have a different count in /proc than via other syscalls, one
can see other giveaways like directories with the wrong entry count,
and one can see from the outside open ports that are not visibly
occupied by anything on the inside.

> If coded correctly, there is little you can do to 
> find out it is loaded (all the while feeding you the md5 checksums you expect 

They can't predict what attack I can use against them to see it!  And
there is no defense against an unknown attack.

> to find,

They don't know what I expect to find, and they would have to keep the
original data around, something which would show up in the free space
count. And anyway I don't have to see the md5sums to know when a
computer is acting strangely - it's entire signature would have changed
in terms of reactions to stimuli, the apparant load versus the actual,
and so on. You are not being very imaginative!

> without any of you being the wiser) apart from booting off a set of 
> known good read-only media...  AFAIK.

I don't have to - they don't even know what IS the boot media. 

> > Really, I am surprised at you! Any experienced sysadmin would know that
> > such things are trivialities to spot and remove. It is merely an
> > intelligence test, and the attacker does not have more intelligence
> > or experience than the defenders! Quite the opposite.
> 
> Uh-huh.  Defeating a random worm, yes. Finding a rogue 4777 /tmp/.../bash 
> shell or an extra  "..... root /bin/sh" line in inetd.conf is, too.  Those 
> things are scriptkiddies at work.

Sure - and that's all they are.

> But from math students I expect much more, 

I don't, but then neither are these math students - they're
telecommunications engineers.

> and so should you, I think. You are dealing with highly intelligent people, 

No I'm not!  They are mostly idiots with computers.  Most of them
couldn't tell you how to copy a file from one place to another (I've
seen it), or think of that concept to replace "move to where the file
is".  If any one of them were to develop to the point of being good
enough to even know how to run a skript, I would be pleased.  I'd even
be pleased if the concept of "change your desktop environment to suit
yourself" entered their head, along with "indent your code", "keep
comments to less than 80 characters", and so on.

If someone were to actually be capable of writing something that looked
capable, I would be pleased. I've only seen decent code from overseas
students - logical concepts don't seem to penetrate the environment
here.  The first year of the technical school (as distinct to the
"superior" school) is spent trying bring some small percentage of the
technical students up to the concept of loops in code - which they
mostly cannot grasp.

(the technical school has three-year courses, the superior school has
six-year courses, though it is not unusual to take eight or nine
years, or more).

And if they were to be good enough to get root even for a moment, I
would be plee3ed.

But of course they aren't - they have enough problems passing the exams
and finding somebody else to copy practicals off (which they can do
simply by paying).

> some of whom already know more about computers than you'll ever know.
> (the same holds true for me though, as I'm no young student anymore either...)

If anyone were good enough to notice, I would notice. And what would
make me notice would not be good.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 23:14                                       ` Peter T. Breuer
@ 2005-01-05  1:53                                         ` maarten
  0 siblings, 0 replies; 172+ messages in thread
From: maarten @ 2005-01-05  1:53 UTC (permalink / raw)
  To: linux-raid

On Wednesday 05 January 2005 00:14, Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:
> > On Tuesday 04 January 2005 21:05, Peter T. Breuer wrote:
> > > maarten <maarten@ultratux.net> wrote:

> > If not, what if you saw
> > something far worse happening, like all servers in one room dying shortly
> > after another, or a full encompassing system compromise going on ??
>
> Nothing - I could not get in.

Now that is a sensible solution !  The fans in the server died off, you have 
30 minutes before everything overheats and subsequently incinerates the whole 
building, and you have no way to prevent that.  Great !  Well played.

> No - they can't do any of those things.  P2p nets are not illegal, and
> we would see the traffic if there were any.  They cannot "change their
> grades" because they do not have access to them - nobody does.  They are
> sent to goodness knows where in a govt bulding somewhere via ssl (an
> improvement from the times when we had to fill in a computer card marked
> in ink, for goodness sake, but I haven't done the sending in myself
> lately, so I don't know the details - I give the list to the secretary
> rather than suffer).  As to reading MY disk, anyone can do that.  I
> don't have secrets, be it marks on anything else.  Indeed, my disk will
> nfs mount on the student machines if they so much as cd to my home
> directory (but don't tell them that!).  Of course they'd then have to
> figure out how to become root in order to change uid so they could read
> my data, and they can't do that - all the alarms in the building would
> go off!  su isn't even executable, let alone suid, and root login is
> disabled so many places I forget (heh, .profile in /root ays something
> awful to you, and then exits), and then there are the reapers, the
> monitors, oh, everything, waiting for just such an opportunity to ring
> the alarm bells.  As to holes in other protocols, I can't even remenber
> a daemon that runs as root nowadays without looking!  What?  And so
> what?  If they got a root shell, everything would start howling.  And
> then if they got a root shell and did something, all the alrms would go
> off again as the checks swung in on the hour.  Why would they risk it?
> Na ..  we only get breakin attempts from script-kiddies outside, not
> inside.

Uh-oh. Where to start.  Shall I start by saying that when you exploit a local 
root hole you _are_ root and there is no need for any su ?  Or shall I start 
by saying that if they can get access to their tests well in advance they 
need not access to their grades ? Or perhaps... That your alarm bells 
probably are just as predictable and reliable as your UPS systems ?
Let's leave it at that shall we.

> > P2p might encompass samba in theory, but the term as used by everybody
> > specifically targets more or less rogue networks that share movies et al.
>
> Not by me - you must be in a particular clique.  This is a networking
> department!  It would be strange if anyone were NOT running a peer to
> peer system!

Read a newspaper someday, why don't you...?

> There is a time of year when the network bounces like a yo yo because
> the students are implementing proxy arp  and getting it completely
> wrong!

Yeah. So maybe they are proxy-arping that PC you mentioned above that sends 
the grades over SSL.  But nooo, no man in the middle attack there, is there ?

> > Yes, well, someday someone may come up with a way to defeat your alarms
> > and tripwire / AIDE or whatever you have in place...  For instance, how
> > do you
>
> No they won't. And if they do, so what? They will fall over the next
> one along the line!  There is no way they can do it.  I couldn't do it
> if I were trying to avoid me seeing - I'm too experienced as a defender.
> I can edit a running kernel to reset syscalls that have been altered by
> adore, and see them.  I regularly have I-get-root duels, and I have no
> problem with getting and keeping root, while letting someone else also
> stay root. I can get out of a chroot jail, since I have root. I run
> uml honeypots.

W0w you'r3 5o l33t, P3ter !

But thanks, this solves our mystery here !  If you routinely change syscalls 
on a running kernel that has already been compromised by a rootkit, then it 
is no wonder you also flip a bit here and there in random files.  
So you were the culprit all along !!! 

> and one can see from the outside open ports that are not visibly
> occupied by anything on the inside.

Oh suuuuure.  Never thought about them NOT opening an extra port to the 
outside ?  By means of trojaning sshd, or login, or whatever.  W00t !
Or else by portknocking, which google will yield results for I'm sure.

> > If coded correctly, there is little you can do to
> > find out it is loaded (all the while feeding you the md5 checksums you
> > expect
>
> They can't predict what attack I can use against them to see it!  And
> there is no defense against an unknown attack.

Nope, rather the reverse is true.  YOU don't know how to defend yourself, 
since you don't know what they'll hit you with, when (I'll bet during the two 
weeks mandatory absense of christmas!) and where they're coming from. 

> They don't know what I expect to find, and they would have to keep the
> original data around, something which would show up in the free space
> count. And anyway I don't have to see the md5sums to know when a
> computer is acting strangely - it's entire signature would have changed
> in terms of reactions to stimuli, the apparant load versus the actual,
> and so on. You are not being very imaginative!

They have all the time in the world to research all your procedures, if they 
even have to. For one, this post is googleable. Next, they can snoop around 
all year on your system just behaving like the good students they seem, and 
last but not least you seem pretty vulnerable to a social engineering attack; 
you tell me -a complete stranger- all about it without the least of effort 
from my side.  A minute longer and you'd have told me how your scripts make 
the md5 snapshots, what bells you have in place and what time you usually 
read your logfiles (giving a precise window to work in undetected).

Please.  Enough with the endless arrogance.  You are not invincible.
The fact alone that you have a "nice stack of rootkits" already is a clear 
sign on how well your past endeavours fared stopping intruders...

> I don't, but then neither are these math students - they're
> telecommunications engineers.

Oh, telecom engineers ?  Oh, indeed, those guys know nothing about computers 
whatsoever.   Nothing.  There isn't a single computer to be found in the 
telecom industry.

> If someone were to actually be capable of writing something that looked
> capable, I would be pleased. I've only seen decent code from overseas
> students - logical concepts don't seem to penetrate the environment
> here.  The first year of the technical school (as distinct to the
> "superior" school) is spent trying bring some small percentage of the
> technical students up to the concept of loops in code - which they
> mostly cannot grasp.

The true blackhat will make an effort NOT to be noticed, so he'll be the last 
that will try to impress you with an impressive piece of code!  It's very 
strange not to realize even that.
I might be paranoid, but you are naive like I've never seen before...

> And if they were to be good enough to get root even for a moment, I
> would be plee3ed.

Of course you would, but then again chances are they will not tell you they 
got root as that is precisely the point of the whole game. :-)

> But of course they aren't - they have enough problems passing the exams
> and finding somebody else to copy practicals off (which they can do
> simply by paying).

Or just copying it off the server directory.

> If anyone were good enough to notice, I would notice. And what would
> make me notice would not be good.

Sure thing, Peter Mitnick...

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:41                       ` Andy Smith
  2005-01-04  3:42                         ` Neil Brown
  2005-01-04  9:30                         ` Maarten
@ 2005-01-04  9:46                         ` Peter T. Breuer
  2005-01-04 19:02                           ` maarten
  2 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04  9:46 UTC (permalink / raw)
  To: linux-raid

Andy Smith <andy@strugglers.net> wrote:
> [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines --]
> 
> On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > On Monday January 3, ewan.grantham@gmail.com wrote:
> > > I've setup a RAID-5 array using two internal 250 Gig HDs and two
> > > external 250 Gig HDs through a USB-2 interface. Each of the externals
> > > is on it's own card, and the internals are on seperate IDE channels.
> > > 
> > > I "thought" I was doing a good thing by doing all of this and then
> > > setting them up using an ext3 filesystem.
> > 
> > Sounds like a perfectly fine setup (providing always that external
> > cables are safe from stray feet etc).
> > 
> > No need to change anything.
> 
> Except that Peter says that the ext3 journals should be on separate
> non-mirrored devices and the reason this is not mentioned in any
> documentation (md / ext3) is that everyone sees it as obvious.

No, I dont say the "SHOULD BE" is obvious.  I say the issues are
obvious.  The "should be" is up to you to decide, based on the obvious
issues involved :-).

> Whether it is true or not it's clear to me that it's not obvious to
> everyone.

It's not obvious to anyone, where by "it" I mean whether or not you
"should" put a journal on the same raid device.  There are pros and
cons.  I would not.  My reasoning is that I don't want data in the
journal to be subject to the same kinds of creeping invisible corruption
on reboot and resync that raid is subject to.  But you can achieve that
by simply not putting data in the journal at all. But what good does
the journal do you then? Well, it helps you avoid an fsck on reboot. 
But do you wantto avoid an fsck?

And reason onwards ... I won't rehash the arguments.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  9:46                         ` Peter T. Breuer
@ 2005-01-04 19:02                           ` maarten
  2005-01-04 19:12                             ` David Greaves
  2005-01-04 21:08                             ` Peter T. Breuer
  0 siblings, 2 replies; 172+ messages in thread
From: maarten @ 2005-01-04 19:02 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 10:46, Peter T. Breuer wrote:
> Andy Smith <andy@strugglers.net> wrote:
> > [-- text/plain, encoding quoted-printable, charset: us-ascii, 20 lines
> > --]
> >
> > On Tue, Jan 04, 2005 at 01:22:56PM +1100, Neil Brown wrote:
> > > On Monday January 3, ewan.grantham@gmail.com wrote:

> > Except that Peter says that the ext3 journals should be on separate
> > non-mirrored devices and the reason this is not mentioned in any
> > documentation (md / ext3) is that everyone sees it as obvious.

>
> It's not obvious to anyone, where by "it" I mean whether or not you
> "should" put a journal on the same raid device.  There are pros and
> cons.  I would not.  My reasoning is that I don't want data in the
> journal to be subject to the same kinds of creeping invisible corruption
> on reboot and resync that raid is subject to.  But you can achieve that

[ I'll attempt to adress all issues that have come up in this entire thread 
until now here...  please bear with me. ]

@Peter:
I still need you to clarify what can cause such creeping corruption.
There are several possible cases:

1) A bit flipped on the platter or the drive firmware had a 'thinko'.

This will be signalled by the CRC / ECC on the drive.  You can't flip a bit 
unnoticed.  Or in fact, bits get 'flipped' constantly, therefore the highly 
sophisticated error correction code in modern drives.  If the ECC can't 
rectify such a read error, it will issue a read error to the OS.

Obviously, the raid or FS code handles this error in the usual way; this is 
what we call a bad sector, and we have routines that handle that perfectly.

2) An incomplete write due to a crash.

This can't happen on the drive itself, as the onboard cache will ensure 
everything that's in there gets written to the platter. I have no reason to 
doubt what the manufacturer promises here, but it is easy to check if one 
really wants to; just issue a couple thousand cycles of well timed <write 
block, kill power to drive> commands, and verify if it all got written.
(If not: start a class action suit against the manufacturer)

Another possibility is it happening in a higher layer, the raid code or the FS 
code.  Let's examine this further.  The raid code does not promise that that 
can't happen ("MD raid is no substitute for a UPS").  But, the FS helps here.

In the case of a journaled FS, the first that must be written is the delta. 
Then the data, then the delta is removed again.  From this we can trivially 
deduce that indeed a journaled FS will not(*) suffer write reordering; as 
that is the only way data could get written without there first being a 
journal delta on disk.  So at least that part is correct indeed(!)
So in fact, a journaled FS will either have to rely on lower layers *not* 
reordering writes, or will have to wait for the ACK on the journal delta 
before issuing the actual_data write command(!).

(*) unless it waits for the ACK mentioned above.  

Further, we thus can split up the write in separate actions:

A) the time during which the journal delta gets written
B) the time during which the data gets written
C) the time during which the journal delta gets removed.

Now at what point do or did we crash ?  If it is at A) the data is consistent, 
no matter whether the delta got written or not.  If it is at B) the data 
block is in an unknown state and the journal reflects that, so the journal 
code rolls back.  If it is at C) the data is again consistent. Depending on 
what sense the journal delta makes, there can be a rollback, or not.  In 
either case, the data still remains fully consistent. 
It's really very simple, no ?

Now to get to the real point of the discussion.  What changes when we have a 
mirror ?  Well, if you think hard about that: NOTHING.  What Peter tends to 
forget it that there is no magical mixup of drive 1's journal with drive 2's 
data (yep, THAT would wreak havoc!).

At any point in time -whether mirror 1 is chosen as true or mirror 2 gets 
chosen does not matter as we will see- the metadata+data on _that_ mirror by 
definition will be one of the cases A through C outlined above.  IT DOES NOT 
MATTER that mirror one might be at stage B and mirror two at stage C. We use 
but one mirror, and we read from that and the FS rectifies what it needs to 
rectify.  
This IS true because the raid code at boot time sees that the shutdown was not 
clean, and will sync the mirrors.  At this point, the FS layer has not even 
come into play.  Only when the resync has finished, the FS gets to examine 
its journal.  -> !! At this point the mirrors are already in sync again !! <-

If, for whatever reason, the raid code would NOT have seen the unclean 
shutdown, _then_ you may have a point, since in that special case it would be 
possible for the journal entry from mirror one (crashed during stage C) gets 
used to evaluate the data block on mirror two (being in state B). In those 
cases, bad things may happen obviously.
If I'm not mistaken, this is what happens when one has to assemble --force an 
array that has had issues.  But as far as I can see, that is the only time...

Am I making sense so far ?  (Peter, this is not adressed to you, as I already 
know your answer beforehand: I'd be "baby raid tech talk", correct ?)

So.  What possible scenarios have I overlooked until now...?

Oh yeah, the possibility number 3).

3) The inconsistent write comes from a bug in the CPU, RAM, code or such.

As Neil already pointed out, you gotta trust your CPU to work right otherwise 
all bets are off.  But even if this could happen, there is no blaming the FS 
or the raid code, as the faulty request was carried out as directed.  The 
drives may not be in sync, but neither the drive, the raid code nor the FS 
knows this (and cannot reasonably know!)  If a bit in RAM gets flipped in 
between two writes there is nothing except ECC ram that's going to help you.

Last possible theoretical case: the bug is actually IN the raid code. Well, in 
this case, the error will most certainly be reproduceable.  I cannot speak 
for the code as I haven't written nor reviewed it (nor would I be able to...) 
but this really seems far-fetched.  Lots of people use and test the code, it 
would have been spotted at some point.

Does this make any sense to anybody ?  (I sure hope so...)

Maarten

-- 

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 19:02                           ` maarten
@ 2005-01-04 19:12                             ` David Greaves
  2005-01-04 21:08                             ` Peter T. Breuer
  1 sibling, 0 replies; 172+ messages in thread
From: David Greaves @ 2005-01-04 19:12 UTC (permalink / raw)
  To: maarten; +Cc: linux-raid

maarten wrote:

>Does this make any sense to anybody ?  (I sure hope so...)
>
>Maarten
>
Oh yeah!

David


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 19:02                           ` maarten
  2005-01-04 19:12                             ` David Greaves
@ 2005-01-04 21:08                             ` Peter T. Breuer
  2005-01-04 22:02                               ` Brad Campbell
                                                 ` (3 more replies)
  1 sibling, 4 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 21:08 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> @Peter:
> I still need you to clarify what can cause such creeping corruption.

The classical cause in raid systems is 

  1) that data is only partially written to the array on system crash
     and on recovery the inappropriate choice of alternate datasets
     from the redundant possibles is propagated.

  2) corruption occurs unnoticed in a part of the redundant data that
     is not currently in use, but a disk in the array then drops out,
     bringing the data with the error into use. On recovery of the
     failed disk, the error data is then propagated over the correct 
     data.

Plus the usual causes. And anything else I can't think of just now.

> 1) A bit flipped on the platter or the drive firmware had a 'thinko'.
> 
> This will be signalled by the CRC / ECC on the drive.

Bits flip on our client disks all the time :(.  It would be nice if it
were the case that they didn't, but it isn't.  Mind you, I don't know
precisely HOW.  I suppose more bits than the CRC can recover change, or
something, and the CRC coincides.  Anyway, it happens.  Probably cpu
-mediated.  Sorry but I haven't kept any recent logs of 1-bit errors in
files on readonly file systems for you to look at.

> You can't flip a bit 
> unnoticed. 

Not by me, but then I run md5sum every day. Of course, there is a
question if the bit changed on disk, in ram, or in the cpu's fevered
miscalculations. I've seen all of those. One can tell which after a bit
more detective work.

> Or in fact, bits get 'flipped' constantly, therefore the highly 
> sophisticated error correction code in modern drives.  If the ECC can't 
> rectify such a read error, it will issue a read error to the OS.

Nope. Or at least, we see one-bit errors.

> Obviously, the raid or FS code handles this error in the usual way; this is 

This is not an error, it is a "failure"! An error is a wrong result, not
a complete failure.

> what we call a bad sector, and we have routines that handle that perfectly.

Well, as I recall the raid code, it doesn't handle it correctly - it
simply faults the disk implicated offline.

Mind you, there are indications in the comments (eg for the resync
thread)  that it was intended that reads (or writes?) be retried
there, but I don't recall any actual code for it.

> 2) An incomplete write due to a crash.
> 
> This can't happen on the drive itself, as the onboard cache will ensure 

Of course it can! I thought you were the one that didn't swallow
manufacturer's figures! 

> everything that's in there gets written to the platter. I have no reason to 
> doubt what the manufacturer promises here, but it is easy to check if one 

Oh yes you do.

> Another possibility is it happening in a higher layer, the raid code or the FS 
> code.  Let's examine this further.  The raid code does not promise that that 

There's no need to. All these modes are possible and very well known.

> In the case of a journaled FS, the first that must be written is the delta. 
> Then the data, then the delta is removed again.  From this we can trivially 
> deduce that indeed a journaled FS will not(*) suffer write reordering; as 

Eh, we can't.  Or do you mean "suffer" as in "withstand"? Yes, of
course it's vulnerable to it.

> So in fact, a journaled FS will either have to rely on lower layers *not* 
> reordering writes, or will have to wait for the ACK on the journal delta 
> before issuing the actual_data write command(!).

Stephen (not Neil, sorry) says that ext3 requires just acks after write
completed. Hans has said that reiserfs required no write reordering (i
don't know if that has changed since he said it).

(analysis of a putative journal update sequence - depending strongly on
ordered writes to the journal area)

> A) the time during which the journal delta gets written
> B) the time during which the data gets written
> C) the time during which the journal delta gets removed.
> 
> Now at what point do or did we crash ?  If it is at A) the data is consistent, 

The FS metadata is ALWAYS consistent. There is no need for this. 

> no matter whether the delta got written or not. 

Uh, that's not at issue. The question is whether it is CORRECT, not
whether it is consistent.

> If it is at B) the data 
> block is in an unknown state and the journal reflects that, so the journal 
> code rolls back.

Is a rollback correct? I maintain it is always correct.

> If it is at C) the data is again consistent. Depending on 
> what sense the journal delta makes, there can be a rollback, or not.  In 
> either case, the data still remains fully consistent. 
> It's really very simple, no ?

Yes - I don't know why you consistently dive into details and miss the
big picture! This is nonsense - the question is not if it is
consistent, but if it is CORRECT. Consistency is guaranteed. However,
it will likely be incorrect.

> Now to get to the real point of the discussion.  What changes when we have a 
> mirror ?  Well, if you think hard about that: NOTHING.  What Peter tends to 
> forget it that there is no magical mixup of drive 1's journal with drive 2's 
> data (yep, THAT would wreak havoc!).

There is. Raid knows nothing about journals. The raid read strategy
is normally 128 blocks from one disk, then 128 blocks from the next
disk  - in kernel 2.4 . In kernel 2.6 it seems to me that it reads from
the disk that it calculates the heads are best positoned for the read
(in itself a bogus calculation). As to what happens on a resync rather
than a read, well, it will read from one disk or another - so the
journals will not be mixed up - but the result will still likely
be incorrect, and always consistent (in that case).

There is nothing unusual here. Will you please stop fighting about
NOTHING?

> At any point in time -whether mirror 1 is chosen as true or mirror 2 gets 
> chosen does not matter as we will see- the metadata+data on _that_ mirror by 

And what if there are three mirrors? You don't know either the raid read
startegy or the raid resync strategy - that is plain.

> definition will be one of the cases A through C outlined above.  IT DOES NOT 
> MATTER that mirror one might be at stage B and mirror two at stage C. We use 
> but one mirror, and we read from that and the FS rectifies what it needs to 
> rectify.  

Unfortunately, EVEN given your unwarranted assumption that things are
like that, the  result is still likely to be incorrect, but will be
consistent!

> This IS true because the raid code at boot time sees that the shutdown was not 
> clean, and will sync the mirrors.

But it has no way of knowing which mirror is the correct one.

> At this point, the FS layer has not even 
> come into play.  Only when the resync has finished, the FS gets to examine 
> its journal.  -> !! At this point the mirrors are already in sync again !! <-

Sure! So?

> If, for whatever reason, the raid code would NOT have seen the unclean 
> shutdown, _then_ you may have a point, since in that special case it would be 
> possible for the journal entry from mirror one (crashed during stage C) gets 
> used to evaluate the data block on mirror two (being in state B). In those 
> cases, bad things may happen obviously.

And do you know what happens in the case of a three way mirror, with a
2-1 split on what's in the mirrored journals,  and the raid resyncs?

(I don't!)

> If I'm not mistaken, this is what happens when one has to assemble --force an 
> array that has had issues.  But as far as I can see, that is the only time...
> 
> Am I making sense so far ?  (Peter, this is not adressed to you, as I already 

Not very much. As usual you are bogged down in trivialities, and are
missing the  big picture :(. There is no need for this little baby step
analysis! We know perfectly well that crashing can leave the different
journals in different states. I even suppose that half a block can e
written to one of them (a sector), instead of a whole block. Are
journals written to in sectors or blocks? Logic would say that it
should be written in sectors, for atomicity, but I haven't checked the
ext3fs code.

And then you haven't considered the problem of what happens if only
some bytes get sent over the BUS before hitting the disk. What happens?
I don't know. I suppose bytes are acked only in units of 512.

> know your answer beforehand: I'd be "baby raid tech talk", correct ?)

More or less - this is horribly low-level, it doesn't get anywhere.

> So.  What possible scenarios have I overlooked until now...?

All of them.

> 3) The inconsistent write comes from a bug in the CPU, RAM, code or such.

It doesn't matter! You really cannot see the wood for the trees.

> As Neil already pointed out, you gotta trust your CPU to work right otherwise 
> all bets are off.

Tough - when it overheats it can and does do anything. Ditto memory.
LKML is full of Linus doing Zen debugging of an oops, saying "oooooooom,
ooooooom, you have a one bit flip in bit 7 at address  17436987,
ooooooom".

> But even if this could happen, there is no blaming the FS 
> or the raid code, as the faulty request was carried out as directed.  The 

Who's blaming! This is most odd! It simply happens, that's all.

> Does this make any sense to anybody ?  (I sure hope so...)

No. It is neither useful nor sensical, the latter largely because of
the former. APART from your interesting layout of the sequence of
steps in writing the journal. Tell me, what do you mean by "a delta"?

(to be able to rollback it is either a xor of the intended block versus
the original, or a copy of the original block plus a copy of the
intended block).

Note that it is not at all necessary that a journal work that way. 

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:08                             ` Peter T. Breuer
@ 2005-01-04 22:02                               ` Brad Campbell
  2005-01-04 23:20                                 ` Peter T. Breuer
  2005-01-04 22:21                               ` Neil Brown
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 172+ messages in thread
From: Brad Campbell @ 2005-01-04 22:02 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:

>>You can't flip a bit 
>>unnoticed. 
> 
> 
> Not by me, but then I run md5sum every day. Of course, there is a
> question if the bit changed on disk, in ram, or in the cpu's fevered
> miscalculations. I've seen all of those. One can tell which after a bit
> more detective work.
> 

I'm wondering how difficult it may be for you to extend your md5sum script to diff the pair of files 
and actually determine the extent of the corruption. bit/byte/word/.../sector/.../stripe wise?

I have 2 RAID-5 arrays here. a 3x233GiB and a 10x233GiB and I when I install new data on the drives 
I add the md5sum of that data to an existing database stored on another machine. This gets compared 
against the data on the arrays weekly and I have yet to see a silent corruption in 18 months.

I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and 
should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet.

Honestly, in my years running Linux and multiple drive arrays I have never experienced errors such 
as you are getting.

Oh.. and both my arrays are running ext3 with an internal journal (as are all my other partitions on 
all my other machines).

Perhaps I'm lucky?

Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 22:02                               ` Brad Campbell
@ 2005-01-04 23:20                                 ` Peter T. Breuer
  2005-01-05  5:44                                   ` Brad Campbell
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 23:20 UTC (permalink / raw)
  To: linux-raid

Brad Campbell <brad@wasp.net.au> wrote:
> I'm wondering how difficult it may be for you to extend your md5sum script to diff the pair of files 
> and actually determine the extent of the corruption. bit/byte/word/.../sector/.../stripe wise?

Not much.  But I don't bother.  It's a majority vote amongst all the
identical machines involved and the loser gets rewritten. The script
identifies a majority group and a minority group. If the minority is 1
it rewrites it without question.  If the minority group is bigger it
refers the notice to me.

> I have 2 RAID-5 arrays here. a 3x233GiB and a 10x233GiB and I when I install new data on the drives 
> I add the md5sum of that data to an existing database stored on another machine. This gets compared 
> against the data on the arrays weekly and I have yet to see a silent corruption in 18 months.

Looking at the lists of pending repairs over xmas, I see a pile that
will have to be investigated. I am about to do it, since you reminded me
to look at these.

> I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and 
> should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet.

No - it should not show it. 

> Honestly, in my years running Linux and multiple drive arrays I have never experienced errors such 
> as you are getting.

Then you are not trying to manage hundreds of clients at a time.

> Oh.. and both my arrays are running ext3 with an internal journal (as are all my other partitions on 
> all my other machines).
> 
> Perhaps I'm lucky?

You're both not looking in the right way and not running the right
experiment.

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 23:20                                 ` Peter T. Breuer
@ 2005-01-05  5:44                                   ` Brad Campbell
  2005-01-05  9:00                                     ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Brad Campbell @ 2005-01-05  5:44 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:
> Brad Campbell <brad@wasp.net.au> wrote:
> 
>>I do occasionally remove/re-add a drive to each array, which causes a full resync of the array and 
>>should show up any parity inconsistency by a faulty fsck or md5sum. It has not as yet.
> 
> 
> No - it should not show it. 
> 

If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk 
and put it back in, the array is going to be written from parity data that is not quite right. (The 
problem I believe you were talking about where you have two identical disks and one is inconsistent, 
which one do you read from? is similar). And thus the reconstructed array is going to have different 
contents to the array before I failed the disk.

Therefore it should show the error. No?

Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  5:44                                   ` Brad Campbell
@ 2005-01-05  9:00                                     ` Peter T. Breuer
  2005-01-05  9:14                                       ` Brad Campbell
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05  9:00 UTC (permalink / raw)
  To: linux-raid

Brad Campbell <brad@wasp.net.au> wrote:
> If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk 
> and put it back in, the array is going to be written from parity data that is not quite right. (The 
> problem I believe you were talking about where you have two identical disks and one is inconsistent, 
> which one do you read from? is similar). And thus the reconstructed array is going to have different 
> contents to the array before I failed the disk.
> 
> Therefore it should show the error. No?

It will not detect it as an error, if that is what you mean.

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  9:00                                     ` Peter T. Breuer
@ 2005-01-05  9:14                                       ` Brad Campbell
  2005-01-05  9:28                                         ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Brad Campbell @ 2005-01-05  9:14 UTC (permalink / raw)
  To: RAID Linux

Peter T. Breuer wrote:
> Brad Campbell <brad@wasp.net.au> wrote:
> 
>>If a bit has flipped on a parity stripe and thus the parity is inconsistent. When I pop out a disk 
>>and put it back in, the array is going to be written from parity data that is not quite right. (The 
>>problem I believe you were talking about where you have two identical disks and one is inconsistent, 
>>which one do you read from? is similar). And thus the reconstructed array is going to have different 
>>contents to the array before I failed the disk.
>>
>>Therefore it should show the error. No?
> 
> 
> It will not detect it as an error, if that is what you mean.

Now here we have a difference of opinion.

I'm detecting errors using md5sums and fsck.

If the drive checks out clean 1 minute, but has a bit error in a parity stripe and I remove/re-add a 
drive the array is going to rebuild that disk from the remaning data and parity. Therefore the data 
on that drive is going to differ compared to what it was previously.

Next time I do an fsck or md5sum I'm going to notice that something has changed. I'd call that an error.

Brad

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  9:14                                       ` Brad Campbell
@ 2005-01-05  9:28                                         ` Peter T. Breuer
  2005-01-05  9:43                                           ` Brad Campbell
  2005-01-05 10:04                                           ` Andy Smith
  0 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05  9:28 UTC (permalink / raw)
  To: linux-raid

Brad Campbell <brad@wasp.net.au> wrote:
> I'm detecting errors using md5sums and fsck.
> 
> If the drive checks out clean 1 minute, but has a bit error in a parity stripe and I remove/re-add a 
> drive the array is going to rebuild that disk from the remaning data and parity. Therefore the data 
> on that drive is going to differ compared to what it was previously.

Indeed.

> Next time I do an fsck or md5sum I'm going to notice that something has changed. I'd call that an error.

If your check can find that type of error, then it will detect it, but
it is intrinsically unlikely that an fsck will see it because the "real
estate" argument say that it is 99% likely that the error occurs inside
a file or in free space rather than in metadata, so it is 99% likely
that fsck will not see anything amiss.

If you do an md5sum on file contents and compare with a previous md5sum
run, then it will be detected provided that the error occurs in a file,
but assuming that your disk is 50% full, that is only 50% likely.

I.e. "it depends on your test".

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  9:28                                         ` Peter T. Breuer
@ 2005-01-05  9:43                                           ` Brad Campbell
  2005-01-05 15:09                                             ` Guy
  2005-01-05 10:04                                           ` Andy Smith
  1 sibling, 1 reply; 172+ messages in thread
From: Brad Campbell @ 2005-01-05  9:43 UTC (permalink / raw)
  To: RAID Linux

Sorry, sent this privately by mistake.

Peter T. Breuer wrote:

> If you do an md5sum on file contents and compare with a previous md5sum
> run, then it will be detected provided that the error occurs in a file,
> but assuming that your disk is 50% full, that is only 50% likely.
> 
> I.e. "it depends on your test".

brad@srv:~$ df -h | grep md0
/dev/md0              2.1T  2.1T  9.2G 100% /raid

I'd say likely :p)

Regards,
Brad


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  9:43                                           ` Brad Campbell
@ 2005-01-05 15:09                                             ` Guy
  2005-01-05 15:52                                               ` maarten
  0 siblings, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-05 15:09 UTC (permalink / raw)
  To: 'Brad Campbell', 'RAID Linux'

Dude!  That's a lot of mp3 files!  :)

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell
Sent: Wednesday, January 05, 2005 4:44 AM
To: RAID Linux
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

Sorry, sent this privately by mistake.

Peter T. Breuer wrote:

> If you do an md5sum on file contents and compare with a previous md5sum
> run, then it will be detected provided that the error occurs in a file,
> but assuming that your disk is 50% full, that is only 50% likely.
> 
> I.e. "it depends on your test".

brad@srv:~$ df -h | grep md0
/dev/md0              2.1T  2.1T  9.2G 100% /raid

I'd say likely :p)

Regards,
Brad

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05 15:09                                             ` Guy
@ 2005-01-05 15:52                                               ` maarten
  0 siblings, 0 replies; 172+ messages in thread
From: maarten @ 2005-01-05 15:52 UTC (permalink / raw)
  To: linux-raid

On Wednesday 05 January 2005 16:09, Guy wrote:
> Dude!  That's a lot of mp3 files!  :)

Indeed.  I "only" have this now:

/dev/md1              590G  590G  187M 100% /disk

md1 : active raid5 sdb3[4] sda3[3] hda3[0] hdc3[5] hde3[1] hdg3[2]
      618437888 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

...but 4 new 250 GB disks are on their way as we speak. :-)

P.S.:  This is my last post for a while, I have very important work to get 
done the rest of this week.  So see you all next time!

Regards,
Maarten



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Brad Campbell
> Sent: Wednesday, January 05, 2005 4:44 AM
> To: RAID Linux
> Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
> crashing repeatedly and hard)

> Peter T. Breuer wrote:
> > If you do an md5sum on file contents and compare with a previous md5sum
> > run, then it will be detected provided that the error occurs in a file,
> > but assuming that your disk is 50% full, that is only 50% likely.
> >
> > I.e. "it depends on your test".
>
> brad@srv:~$ df -h | grep md0
> /dev/md0              2.1T  2.1T  9.2G 100% /raid

-- 



^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  9:28                                         ` Peter T. Breuer
  2005-01-05  9:43                                           ` Brad Campbell
@ 2005-01-05 10:04                                           ` Andy Smith
  1 sibling, 0 replies; 172+ messages in thread
From: Andy Smith @ 2005-01-05 10:04 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

On Wed, Jan 05, 2005 at 10:28:53AM +0100, Peter T. Breuer wrote:
> If you do an md5sum on file contents and compare with a previous md5sum
> run, then it will be detected provided that the error occurs in a file,
> but assuming that your disk is 50% full, that is only 50% likely.

"If a bit flips in the unused area of the disk and there is no one
there to md5sum it, did it really flip at all?"

:)

Out of interest Peter could you go into some details about how you
automate the md5sum of your filesystems?  Obviously I can think of
ways I would do it but I'm interested to hear how you have it set up
first.

[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:08                             ` Peter T. Breuer
  2005-01-04 22:02                               ` Brad Campbell
@ 2005-01-04 22:21                               ` Neil Brown
  2005-01-05  0:08                                 ` Peter T. Breuer
  2005-01-04 22:29                               ` Neil Brown
  2005-01-05  0:38                               ` maarten
  3 siblings, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-04 22:21 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> 
> Uh, that's not at issue. The question is whether it is CORRECT, not
> whether it is consistent.
> 

What exactly do you mean by "correct".

If I have a program that writes some data:
   write(fd, buffer, 8192);
and then makes sure the data is on disk:
   fsync(fd);

but the computer crashes sometime between when the write call started
and the fsync called ended, then I reboot and read back that block of
data from disc, what is the "CORRECT" value that I should read back?

The answer is, of course, that there is no one "correct" value.
It would be correct to find the data that I had tried to write.  It
would also be correct to find the data that had been in the file
before I started the write.  If the size of the write is larger than
the blocksize of the filesystem, it would also be correct to find a
mixture of the old data and the new data.

Exactly the same is true at every level of the storage stack.  There
is a point in time where a write request starts, and a point in time
where the request is known to complete, and between those two times
the content of the affected area of storage is undefined, and could
have any of several (probably 2) "correct" values.

After an unclean shutdown of a raid1 array, every (working) device
has correct data on it.  They may not all be the same, but they are
all correct.

md arbitrarily chooses one of these correct values, and replicates it
across all drives.  While it is replicating, all reads are served by
the chosen drive.

NeilBrown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 22:21                               ` Neil Brown
@ 2005-01-05  0:08                                 ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05  0:08 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> > 
> > Uh, that's not at issue. The question is whether it is CORRECT, not
> > whether it is consistent.
> > 
> 
> What exactly do you mean by "correct".

Whatever you mean by it - I don't have a preference myself, though I
might have an opinion in specific situations.  It means whatever you
consider and it is up to you to make your own definition for yourself,
to your own satisfaction in particular circumstances, if you feel you
need a constructive definition in other terms (and I don't!).  I merely
gave the concept a name for you.

> If I have a program that writes some data:
>    write(fd, buffer, 8192);
> and then makes sure the data is on disk:
>    fsync(fd);
> 
> but the computer crashes sometime between when the write call started
> and the fsync called ended, then I reboot and read back that block of
> data from disc, what is the "CORRECT" value that I should read back?

I would say that if nothing on your machine or elsewhere "noticed" you
doing the write of any part of the block, then the correct answer is
"the block as it was before you wrote any of it".  However, if nothing
cares at all one way or the other, then it could be annything, what you
wrote, what you got, or even any old nonsense.

In other words, I would say "anything that conforms with what the
universe outside the program has observed of the transaction".  If you
wish to apply a general one-size-fits rule, then I would say "as many
blocks as you have written that have been read by other processes which
in turn have communicated their state elsewhere should be present on the
disk as is necessary to conform with that state".

So if you had some other process watching the file grow, you would need
to be sure that as much as that other process had seen was actually on
the disk.

Anyway, I don't care. It's up to you. I merely ask that you assign a
probability to it (and don't tell me what it is! Please).

> The answer is, of course, that there is no one "correct" value.

There is whatever one pleases you as correct.  Please do not fall into a
sollipsistic trap!  It is up to YOU to decide what is correct and assign
probabilities.  I only pointed out how those probabilities scale with
the size of a RAID array WHATEVER THEY ARE.

I don't care what they are to you. It's absurd to ask me. I only tell
you how they grow with size of array.

> After an unclean shutdown of a raid1 array, every (working) device
> has correct data on it.  They may not all be the same, but they are
> all correct.

No they are not.  They are consistent. That is different! 

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:08                             ` Peter T. Breuer
  2005-01-04 22:02                               ` Brad Campbell
  2005-01-04 22:21                               ` Neil Brown
@ 2005-01-04 22:29                               ` Neil Brown
  2005-01-05  0:19                                 ` Peter T. Breuer
  2005-01-05  0:38                               ` maarten
  3 siblings, 1 reply; 172+ messages in thread
From: Neil Brown @ 2005-01-04 22:29 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> Bits flip on our client disks all the time :(.  

You seem to be alone in reporting this.  I certainly have never
experienced anything quite like what you seem to be reporting.

Certainly there are reports of flipped bits in memory.  If you have
non-ecc memory, then this is a real risk and when it happens you
replace the memory.  Usually it happens with a sufficiently high
frequency that the computer is effectively unusable.

But bits being flipped on disk, without the drive reporting an error,
and without the filesystem very quickly becoming unusable, is (except
for your report) unheard of.

md/raid would definitely not help that sort of situation at all.

NeilBrown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 22:29                               ` Neil Brown
@ 2005-01-05  0:19                                 ` Peter T. Breuer
  2005-01-05  1:19                                   ` Jure Pe_ar
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05  0:19 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> > Bits flip on our client disks all the time :(.  
> 
> You seem to be alone in reporting this.  I certainly have never
> experienced anything quite like what you seem to be reporting.

I don't feel the need to prove it to you via actual evidence.  You
already know of mechanisms which produce such an effect:

> Certainly there are reports of flipped bits in memory. 

 .. and that is all the same to your code when it comes to resyncing.
 You don't care whether the change is real or produced in the cpu, on the
bus, or wherever. It still is what you will observe and copy.

> If you have
> non-ecc memory, then this is a real risk and when it happens you
> replace the memory.

Sure.

> Usually it happens with a sufficiently high
> frequency that the computer is effectively unusable.

Well, there are many computers that remain usable. When I see bit flips
the first thing I request the techs to do is check the memory and keep
on checking it until they find a fault. I also ask them to check the
fans, clean out dust and so on.

In a relatively small percentage of cases, it turns out that the changes
are real, on the disk, and persist from reboot to reboot, and move with
the disk when one moves it from place to place.  I don't know where
these come from - perhaps from the drive electronics, perhaps from the
disk. 

> But bits being flipped on disk, without the drive reporting an error,
> and without the filesystem very quickly becoming unusable, is (except
> for your report) unheard of.

As far as I recall it is usually a bit flipped througout a range of
consecutive addresses on disk, when it happens. I haven't been
monitoring this daily check for about a year now, however, so I don't
have any data to show you.

> md/raid would definitely not help that sort of situation at all.

Nor is there any reason to suggest it should - it just doesn't check.
It could.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  0:19                                 ` Peter T. Breuer
@ 2005-01-05  1:19                                   ` Jure Pe_ar
  2005-01-05  2:29                                     ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Jure Pe_ar @ 2005-01-05  1:19 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

On Wed, 5 Jan 2005 01:19:34 +0100
ptb@lab.it.uc3m.es (Peter T. Breuer) wrote:

> Neil Brown <neilb@cse.unsw.edu.au> wrote:
> > On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> > > Bits flip on our client disks all the time :(.  
> > 
> > You seem to be alone in reporting this.  I certainly have never
> > experienced anything quite like what you seem to be reporting.
> 
> I don't feel the need to prove it to you via actual evidence.  You
> already know of mechanisms which produce such an effect:
> 
> > Certainly there are reports of flipped bits in memory. 
> 
>  .. and that is all the same to your code when it comes to resyncing.
>  You don't care whether the change is real or produced in the cpu, on the
> bus, or wherever. It still is what you will observe and copy.

You work with PC servers, so live with it. 

If you want to have the right to complain about bits being flipped in
hardware randomly, go get a job with IBM mainframes or something. 


And since you like theoretic approach to problems, I might have a suggestion
for you: pick a linux kernel subsystem of your choice, think of it as a
state machine, roll out all the states and then check which states are not
covered by the code.
I think that will keep you busy and the result might have some value for the
community. 


-- 

Jure Pečar
http://jure.pecar.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-05  1:19                                   ` Jure Pe_ar
@ 2005-01-05  2:29                                     ` Peter T. Breuer
  0 siblings, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-05  2:29 UTC (permalink / raw)
  To: linux-raid

Jure Pe_ar <pegasus@nerv.eu.org> wrote:
> And since you like theoretic approach to problems, I might have a suggestion
> for you: pick a linux kernel subsystem of your choice, think of it as a
> state machine, roll out all the states and then check which states are not
> covered by the code.

I have no idea what you mean (I suspect you are asking about reachable
states). If you want a static analyzer for the linux kernel written by
me, you can try

  ftp://oboe.it.uc3m.es/pub/Programs/c-1.2.2.tgz


> I think that will keep you busy and the result might have some value for the
> community. 

If you wish to sneer about something, please try and put some technical
espertise and effort into it.

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:08                             ` Peter T. Breuer
                                                 ` (2 preceding siblings ...)
  2005-01-04 22:29                               ` Neil Brown
@ 2005-01-05  0:38                               ` maarten
  3 siblings, 0 replies; 172+ messages in thread
From: maarten @ 2005-01-05  0:38 UTC (permalink / raw)
  To: linux-raid

[ Spoiler: this text may or may not contain harsh language and/or insulting ] 
[  remarks, specifically in the middle part. The reader is advised to exert ] 
[  some mild caution here and there.  Sorry for that but my patience can ] 
[  and does really reach its limits, too.     -   Maarten                ]

On Tuesday 04 January 2005 22:08, Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:
> > @Peter:
> > I still need you to clarify what can cause such creeping corruption.
>
>   1) that data is only partially written to the array on system crash
>      and on recovery the inappropriate choice of alternate datasets
>      from the redundant possibles is propagated.
>
>   2) corruption occurs unnoticed in a part of the redundant data that
>      is not currently in use, but a disk in the array then drops out,
>      bringing the data with the error into use. On recovery of the
>      failed disk, the error data is then propagated over the correct
>      data.

Congrats, you just described the _symptoms_.  We all know the alledged 
symptoms, if only for you repeating them over and over and over...
My question was HOW they [can] occur.   Disks don't go around randomly 
changing bits just because they dislike you, you know.

> > 1) A bit flipped on the platter or the drive firmware had a 'thinko'.
> >
> > This will be signalled by the CRC / ECC on the drive.
>
> Bits flip on our client disks all the time :(.  It would be nice if it
> were the case that they didn't, but it isn't.  Mind you, I don't know
> precisely HOW.  I suppose more bits than the CRC can recover change, or
> something, and the CRC coincides.  Anyway, it happens.  Probably cpu
> -mediated.  Sorry but I haven't kept any recent logs of 1-bit errors in
> files on readonly file systems for you to look at.

Well I don't think we would want to crosspost this flamefest ^W discussion to 
a mailinglist where our resident linux blockdevice people hang out, but I'm 
reasonably certain that the ECC correction in drives is very solid, and very 
good at correcting multiple bit errors, or at least signalling them so an 
official read error can be issued.  If you experience bit errors that often 
I'd wager it is in another layer of your setup, be it CPU, network layer, or 
rogue scriptkiddie admins changing your files on disk.  I dunno.
What I do know is that nowadays the bits-per-square inch on media (CD, DVD and 
harddisks alike) is SO high that even during ideal circumstances the head 
will not read all the low-level bits correctly. It has a host of tricks to 
compensate for that, first and foremost error correction.  If that doesn't 
help it can retry the read, and if that still doesn't help it can / will 
adjust the head very slightly in- or outwards to see if that gives a better 
result. (in all fairness, this happens earlier, during the read of the servo 
tracks, but it may still adjust slightly).  If even after all that the read 
still fails, it issues a read error to the I/O subsystem, ie. the OS.

Now it may be conceivable that a bit gets flipped by a cosmic ray, but the 
error correction would notice that and correct it.  If too many bits got 
flipped, there comes a point that it will give up and give an error.  What it 
will NOT do at this point, AFAIK, is return the entire sector with some bit 
errors in them. It will either return a good block, or no block at all 
accompanied by a "bad sector" error. This is logical, as most of the time 
you're more interested in knowing the data in unretrievable than getting it 
back fubar'ed.  (your undetectable vs detectable, in fact)

The points where there is no ECC protection against cosmic rays are in your 
RAM.  I believe the data path between disk and controller has error checks, 
so do the other electrical paths.  So if you see random bit errors, suspect 
your memory above all and not your I/O layer.  Go out and buy some ECC ram, 
and don't forget to actually enable it in the BIOS.  But you may want to 
change data-cables to your drives nevertheless, just to be safe.

> > You can't flip a bit
> > unnoticed.
>
> Not by me, but then I run md5sum every day. Of course, there is a
> question if the bit changed on disk, in ram, or in the cpu's fevered
> miscalculations. I've seen all of those. One can tell which after a bit
> more detective work.

Hehe.  Oh yeah, sure you can.   Would you please elaborate to the group here 
how in the hell you can distinguish a bit being flipped by the CPU and one 
being flipped while in RAM ?  Cause I'd sure like to see you try...!

I suppose lots of terms like axiom, poisson and binomial etc. will be used in 
your explanation ?  Otherwise we might not believe it, you know...  ;-)
Luckily we don't yet use quantum computers, otherwise just you observing the 
bit would make it vanish, hehehe.

Back to seriousness, tho.

> Nope. Or at least, we see one-bit errors.

Yep, I'm sure you do. I'm just not sure they originate on the I/O layer.

> > Obviously, the raid or FS code handles this error in the usual way; this
> > is
>
> This is not an error, it is a "failure"! An error is a wrong result, not
> a complete failure.

Be that as it may (it's just language definitions) you perfectly understand 
what I meant: a "bad sector"-error is issued to the nearest OS layer.

> > what we call a bad sector, and we have routines that handle that
> > perfectly.
>
> Well, as I recall the raid code, it doesn't handle it correctly - it
> simply faults the disk implicated offline.

True, but that is NOT the point. The point is, the error IS detectable; the 
disk just said as much.  We're hunting for your improbable UNdetectable 
errors, and how they can technically occur.  Because you say you see them, 
but you have not shown us HOW they could even originate.  
Basically, *I* am doing your research now!

> > 2) An incomplete write due to a crash.
> >
> > This can't happen on the drive itself, as the onboard cache will ensure
>
> Of course it can! I thought you were the one that didn't swallow
> manufacturer's figures!

MTBF, no. Because that is purely marketspeak. Technical and _verifiable_ specs 
I can believe, if only for the fact that I can verify them to be true.  
I outlined already how you can do that yourself too...:

Look, it isn't rocket science.  All you'd need is a computer-controlled relay 
that switches off the drive.  Trivially made off the parallel port.  Then you 
write some short code that issues write requests and sends block to the drive 
and then shuts the drive down with varying timings in between to cover all 
possibilities.  All that in a loop which sends different data to different 
offsets each time.  Then you leave that running for a night or so.  The next 
morning you check all the offsets for your written data and compare. 

Without being overly paranoid, I think someone has already conducted such 
tests. Ask around on the various ATA mailinglists (if you care enough).

But honestly, deploying a working UPS is both more elegant, less expensive and 
more logical.  Who cares if a write gets botched during a power cut to the 
drive, you just make triple sure that that can never happen:  Simple.
OS crashes do not cut power to the drive, only PSUs and UPSes can.  So cover 
those two bases and you're set.  Child's play.

> > Another possibility is it happening in a higher layer, the raid code or
> > the FS code.  Let's examine this further.  The raid code does not promise
> > that that
>
> There's no need to. All these modes are possible and very well known.

You're like a stuck record aren't you ? We're searching for the real truth 
here, and all you say in your defense is "It's the truth!"  "It is the 
truth!"  "It IS the truth!" like a small child repeating over and over.  
You've never attempted to prove any of your wild statements, yet demand from 
us that we take your word for granted. Not so. Either you prove your point, 
or at the very least you try not to sabotage people who try to find proof. 
As I am attempting now. You're harassing me. Go away until you have a 
meaningful input !  dickwad !

> > In the case of a journaled FS, the first that must be written is the
> > delta. Then the data, then the delta is removed again.  From this we can
> > trivially deduce that indeed a journaled FS will not(*) suffer write
> > reordering; as
>
> Eh, we can't.  Or do you mean "suffer" as in "withstand"? Yes, of
> course it's vulnerable to it.

suffer as in withstand, yes.

> > So in fact, a journaled FS will either have to rely on lower layers *not*
> > reordering writes, or will have to wait for the ACK on the journal delta
> > before issuing the actual_data write command(!).

> > A) the time during which the journal delta gets written
> > B) the time during which the data gets written
> > C) the time during which the journal delta gets removed.
> >
> > Now at what point do or did we crash ?  If it is at A) the data is
> > consistent,
>
> The FS metadata is ALWAYS consistent. There is no need for this.

Well, either you agree that an error cannot originate here, or you don't. 
There is no middle way stating things like the data getting corrupt yet the 
metadata not showing that. The write gets verified, bit by bit, so I don't 
see where you're going with this...?

> > no matter whether the delta got written or not.
>
> Uh, that's not at issue. The question is whether it is CORRECT, not
> whether it is consistent.

Of course it is correct.  You want to know how the bit errors originate during 
crashes.  Thus the bit errors are obviously not written _before_ the crash.  
Because IF they did, your only recourse is to go for one of the options in 3) 
further below.
For now we're only describing how the data that the OS hands the FS lands on 
disk.  Whether the data given to us by the OS is good or not is irrelevant 
(now).  So, get with the program, please.  The delta is written first and 
that's a fact. Next step.

> > If it is at B) the data
> > block is in an unknown state and the journal reflects that, so the
> > journal code rolls back.
>
> Is a rollback correct? I maintain it is always correct.

That is not an issue, you can safely leave that to the FS to figure out. It 
will most certainly make a more logical decision than you at this point.

In any case, since the block is not completely written yet, the FS probably 
has no other choice than to roll back, since it probably misses data...
This is a question left for the relevant coders, though. It still is 
irrelevant to this discussion however.

> > If it is at C) the data is again consistent. Depending on
> > what sense the journal delta makes, there can be a rollback, or not.  In
> > either case, the data still remains fully consistent.
> > It's really very simple, no ?
>
> Yes - I don't know why you consistently dive into details and miss the
> big picture! This is nonsense - the question is not if it is
> consistent, but if it is CORRECT. Consistency is guaranteed. However,
> it will likely be incorrect.

NO.  For crying out loud !!  We're NOT EVEN talking about a mirror set here! 
That comes later on. This is a SINGLE disk, very simple, the FS gets handed 
data by the OS, the FS directs it to the MD code, the MD code hands it on 
down.  Nothing in here except for a code bug can flip your friggin' bits !!  
If you indeed think it is a code bug, skip all this chapter and go to 3). 
Otherwise, just shut the hell up !!

> > Now to get to the real point of the discussion.  What changes when we
> > have a mirror ?  Well, if you think hard about that: NOTHING.  What Peter
> > tends to forget it that there is no magical mixup of drive 1's journal
> > with drive 2's data (yep, THAT would wreak havoc!).
>
> There is. Raid knows nothing about journals. The raid read strategy
> is normally 128 blocks from one disk, then 128 blocks from the next
> disk  - in kernel 2.4 . In kernel 2.6 it seems to me that it reads from
> the disk that it calculates the heads are best positoned for the read
> (in itself a bogus calculation). As to what happens on a resync rather
> than a read, well, it will read from one disk or another - so the
> journals will not be mixed up - but the result will still likely
> be incorrect, and always consistent (in that case).

Irrelevant.  You should read on before you open your mouth and start blabbing.

> > At any point in time -whether mirror 1 is chosen as true or mirror 2 gets
> > chosen does not matter as we will see- the metadata+data on _that_ mirror
> > by
>
> And what if there are three mirrors? You don't know either the raid read
> startegy or the raid resync strategy - that is plain.

Wanna stick with the program here ?  What do you do if your students interrupt 
you and start about "But what if the theorem is incorrect and we actually 
have three possible outcomes?"  Again: shut up and read on.

> > definition will be one of the cases A through C outlined above.  IT DOES
> > NOT MATTER that mirror one might be at stage B and mirror two at stage C.
> > We use but one mirror, and we read from that and the FS rectifies what it
> > needs to rectify.
>
> Unfortunately, EVEN given your unwarranted assumption that things are
> like that, the  result is still likely to be incorrect, but will be
> consistent!

Unwarranted...!  I took you by the hand and led you all the way here. All the 
while you whined and whined that that was unneccessary, and now that we got 
here you say I did not explain nuthin' along the way ?!?  
You have some nerve, mister.

For the <incredibly thick> over here: IS there, yes or no, any other possible 
state for a disk than either state A, B or C above, at any particular time?? 
If the answer is YES, fully describe that imaginary state for us.
If the answer is NO, shut up and listen. I mean read. Oh hell... 

> > This IS true because the raid code at boot time sees that the shutdown
> > was not clean, and will sync the mirrors.
>
> But it has no way of knowing which mirror is the correct one.

Djeez.  Are you thick or what?  I say it chooses any one, at random, BECAUSE 
_after_ the rollback of the jounaled FS code it will ALWAYS be correct(YES!) 
AND consistent.

You just don't get the concept, do you ? There IS no INcorrect mirror, neither 
is there a correct mirror. They're both just mirrors in various, as yet 
undetermined, states of completing a write.  Since the journal delta is 
consistent, it WILL be able to roll back (or through, or forward, or on, or  
whatever) to a clean state. And it will.   (fer cryin' out loud...!!)

> > At this point, the FS layer has not even
> > come into play.  Only when the resync has finished, the FS gets to
> > examine its journal.  -> !! At this point the mirrors are already in sync
> > again !! <-
>
> Sure! So?

So the FS code will find an array in either state A, B or C and take it from 
there.  Just as with any normal single, non-raided disk.  Get it now?

> > If, for whatever reason, the raid code would NOT have seen the unclean
> > shutdown, _then_ you may have a point, since in that special case it
> > would be possible for the journal entry from mirror one (crashed during
> > stage C) gets used to evaluate the data block on mirror two (being in
> > state B). In those cases, bad things may happen obviously.
>
> And do you know what happens in the case of a three way mirror, with a
> 2-1 split on what's in the mirrored journals,  and the raid resyncs?

Yes. Either at random or intelligently, one is chosen(which one is entirely 
irrelevant!).  Then the raid resync follows, then the FS code finds an array 
in (hey! again!) either state A, B or C.  And it will roll back or roll on to 
reinstate the clean state. (again: do you get it now????)

> (I don't!)

Well, sure. That goes without saying.

> > If I'm not mistaken, this is what happens when one has to assemble
> > --force an array that has had issues.  But as far as I can see, that is
> > the only time...
> >
> > Am I making sense so far ?  (Peter, this is not adressed to you, as I
> > already
>
> Not very much. As usual you are bogged down in trivialities, and are
> missing the  big picture :(. There is no need for this little baby step
> analysis! We know perfectly well that crashing can leave the different
> journals in different states. I even suppose that half a block can e
> written to one of them (a sector), instead of a whole block. Are
> journals written to in sectors or blocks? Logic would say that it
> should be written in sectors, for atomicity, but I haven't checked the
> ext3fs code.

Man oh man you are pityful.  The elephant is right in front of you, if you'd 
stick out your arm you would touch it, but you keep repeating there is no 
elephant in sight.  I give up.

> And then you haven't considered the problem of what happens if only
> some bytes get sent over the BUS before hitting the disk. What happens?
> I don't know. I suppose bytes are acked only in units of 512.

No shit...!   Would that be why they call disks "block devices" ??  Your 
levels of comprehension amaze me more every time.

No of course you can't send half block or bytes or bits to a drive.  Else they 
would be called serial devices, not block devices, now wouldn't they ?
A drive will not ACK anything unless it is received completely (how obvious is 
that?)

> > know your answer beforehand: I'd be "baby raid tech talk", correct ?)
>
> More or less - this is horribly low-level, it doesn't get anywhere.

Some people seem to disagree with you.  Let's just leave it at that shall we ?

> > So.  What possible scenarios have I overlooked until now...?
>
> All of them.

Oh really.     (God. Is there no end to this.)

> > 3) The inconsistent write comes from a bug in the CPU, RAM, code or such.
>
> It doesn't matter! You really cannot see the wood for the trees.

I see only Peters right now, and I know I will have nightmares over you.

> > As Neil already pointed out, you gotta trust your CPU to work right
> > otherwise all bets are off.
>
> Tough - when it overheats it can and does do anything. Ditto memory.
> LKML is full of Linus doing Zen debugging of an oops, saying "oooooooom,
> ooooooom, you have a one bit flip in bit 7 at address  17436987,
> ooooooom".

How this even remotely relates to MD raid, or even I/O in general, completely 
eludes me.  And everyone else, I suppose.  
But for academic purposes, I'd like to see you discuss something with Linus.  
He is way more short-tempered than I am, if you read LKML you'd know that.
But hey, it's 2005, maybe it's time to add a chapter to the infamous Linus vs 
AST archives.  You might qualify.    Oh well, never mind...

> > But even if this could happen, there is no blaming the FS
> > or the raid code, as the faulty request was carried out as directed.  The
>
> Who's blaming! This is most odd! It simply happens, that's all.

Yeah...   That is the nature of computers innit ? Unpredictable bastards is 
what they are.  Math is also soooooo unpredictable, I really hate that.
(do I really need to place a smiley here?)

> > Does this make any sense to anybody ?  (I sure hope so...)
>
> No. It is neither useful nor sensical, the latter largely because of
> the former. APART from your interesting layout of the sequence of
> steps in writing the journal. Tell me, what do you mean by "a delta"?

The entry in the journal that contains the info on what a next data-write will 
be, where it will take place, and how to reconstruct the data in case of 
<problem>.  (As if that wasn't obvious by now.)

> (to be able to rollback it is either a xor of the intended block versus
> the original, or a copy of the original block plus a copy of the
> intended block).

I have no deep knowledge of how the intricacies of a journaled FS work.  If I 
would have, we would not have had this discussion in the first place since I 
would have said yesterday "Peter you're wrong" and that would've ended all of  
this right then and there. (oh yes!) 
If you care to know, go pester other lists about it, or read some reiserfs or 
ext3 code and find out for yourself.

> Note that it is not at all necessary that a journal work that way.

To me the sole thing I care about is that it can repair the missing block and 
how it manages that is of no great concern to me.  I do not have to know how 
a pentium is made in order to use it and program for it.

Maarten

-- 

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  2:07                 ` Neil Brown
  2005-01-04  2:16                   ` Ewan Grantham
@ 2005-01-04  9:40                   ` Peter T. Breuer
  2005-01-04 14:03                     ` David Greaves
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04  9:40 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@cse.unsw.edu.au> wrote:
> On Tuesday January 4, ptb@lab.it.uc3m.es wrote:
> > > > Then the probability of an error occuring UNdetected on a n-disk raid
> > > > array is
> > > > 
> > > >        (n-1)p + np'
> > > >   
> > 
> > > The probability of an event occurring lies between 0 and 1 inclusive.
> > > You have given a formula for a probability which could clearly evaluate
> > > to a number greater than 1.  So it must be wrong.
> > 
> > The hypothesis here is that p is vanishingly small.  I.e. this is a Poisson
> > distribution - the analysis assumes that only one event can occcur per
> > unit time.  Take the unit too be one second if you like.  Does that make
> > it true enough for you?
> 
> Sorry, I didn't see any such hypothesis stated and I don't like to
> assUme.

You don't have to. It is conventional. It doesn't need saying.

> So what you are really saying is that:
>   for sufficiently small p and p' (i.e. p-squared terms can be ignored)
>   the probability of an error occurring undetected approximates
>      (n-1)p + np'
> 
> this may be true, but I'm still having trouble understanding what your
> p and p' really mean.

Examine your conscience. They're dependent on you. All I say is that
they exist. They represent two different classes of error, one
detectible by whatever thing like fsck you run as an "experiment", and
one not.

But you are right in that I have been sloppy about defining what I
mean. For one thing I have mixed probailities "per unit time" and
multiplied them by probabilities associated with a single observation
(your experiment with fsck or whatever) made at a certain moment. I do
that because I know that it would make no difference if I integrated up
the the instantaneous probabilities and then multiplied.

Thus if you want to be more formal, you want to stick some integral
signs in and get (n-1) /p dt  + n /p' dt. Or if you wanted to calculate
in terms of mean times to a detected event, well, you'd modify that
again. But the principle remains the same: the probability of a single
undetectible error rises in proportion to the number of disks n, and the
probability of a detectible error going undetected rises in proportion
to n-1, because your experiment to detect the error will only test one
of the possible disks at the crucial point.

> > I mean an error occurs that can be detected (by the experiment you run,
> > which is prsumably an fsck, but I don't presume to dictate to you).
> > 
> 
> The whole point of RAID is that fsck should NEVER see any error caused
> by drive failure.

Then I guess you have helped clarify to yourself what type of errors
falls in which class! Apparently errors caused by drive failure fall in
the class of "indetectible error" for you!

But in any case, you are wrong, because it is quite possible for an
error to spontaneously arise on a disk which WOULD be detected by fsck.
What does fsck detect normally if it is not that! 

> I think we have a major communication failure here, because I have no
> idea what sort of failure scenario you are imagining.

I am not imagining. It is up to you.

> > Likewise, I don't know. It's whatever error your experiment
> > (presumably an fsck) will miss.
> 
> But 'fsck's primary purpose is not to detect errors on the disk. 

Of course it is (it does not mix and make cakes - it precisely and
exactly detects errors on the disk it is run on, and repairs the
filesystem to either work around those errors, or repairs the errors
themselves).

> It is
> to repair a filesystem after an unclean shutdown.

Those are "errors on the disk". It is of no interest to fsck how they
are caused. Fsck simply has a certain capacity for detecting anomalies
(and fixing them). If you have a better test than fsck, by all means
run it!

> It can help out a
> bit after disk corruption, but usually disk corruption (apart from
> very minimal problems) causes fsck to fail to do anything useful.

I would have naively said you were right simply by the real estate
argument - fsck checks only metadata, and metadata occupies abut 1% of
the disk real estate only.

Nevertheless experience suggests that it is very good at detecting when
strange _physical_ things have happened on the disk - I presume that is
because physical strangenesses affect a block or two at a time, and are
much more likely than a bit error to hit some metadata amongst that.
Certainly single bit errors occur relatively undetected by fsck (in
conformity with the real estate argument), as I know because I check
the md5sums of all files on all machines daily, and they change
spontaneously without human intervention :).  In readonly areas!  (the
rate is probably about 1 bit per disk per three months, on average, but
I'd have to check that to see if my estimate from memory is accurate).

Fsck never finds those. But I do. Shrug - so our definitions of
detectible and undetectible error are different. 

> > They happen all the time - just write a 1 to disk A and a zero to disk
> > B in the middle of the data in some file, and you will have an
> > undetectible error (vis a vis your experimental observation, which is
> > presumably an fsck).
> 
> But this doesn't happen.  You *don't* write 1 to disk A and 0 to disk
> B.

Then write a 1 to disk A and DON'T write a 1 to disk B, but do it over a
patch where there is a 0 already.  There is no need for you to make such
hard going of this! Invent your own examples, please.

> I admit that this can actually happen occasionally (but certainly not

It happens EVERY time I choose to do it. Or a software agent of my
choice decides to do it :). I decide to do it with probability p' (;-).
Call me Murphy. Or Maxwell.

> "all the time"). But when it does, there will be subsequent writes to
> both A and B with new, correct, data.  During the intervening time

There may or there may not - but if I wish it there will not. I don't
see why you have such trouble!

> that block will not be read from A or B.

You are imagining some particular mechanism that I, and I presume the
rest of us, are not.  I think you are thinking of raid and how it works.
Please clean your thoughts of it ..  this part of the argument has
nothing particularly to do with raid or any implementation of it.  It is
more generic than that.  It is simply the probability of something going
"wrong" on n disks and the question of whether you can detect that
wrongness with some particular test of yours (and HERE is where raid is
slightly involved) that only reads from one of the n disks for each
block that it does read.

> If there is a system crash before correct, consistent data is written,

Exactly.

> then on restart, disk B will not be read at all until disk A as been

Why do you think so? I know of no mechanism in RAID that records to
which of the two disks paired data has been written and to which it has
not!

Please clarify - this is important. If you are thinking of the "event
count" that is stamped on the superblocks, that is only updated from
time to time as far as I know! Can you please specify (for my
curiousity) exactly when it is updated? That would be useful to know.

> completely copied on it.
> 
> So again, I fail to see your failure scenario.

Try harder! Neil, there is no need for you to make such hard going of
it! If you like, pay a co-worker to put a 1 on one disk and a 0 on
another, and see if you can detect it! Errors arise spontaneously on
disks, and and then there are errors caused by being written by
overheated cpus which write a 1 where they meant a 0, just before dying,
and then there are errors caused by stuck bits in RAM, and so on.  And
THEN there are errors caused by wrting ONE of a pair of paired writes to
a mirror pair, just before the system crashes.

It is not hard to think of such things.

> > > or high level software error (i.e. the wrong data was written - and
> > > that doesn't really count).
> > 
> > It counts just fine, since it's what does happen :- consider a system
> > crash that happens AFTER one of a pair of writes to the two disk
> > components has completed, but BEFORE the second has completed.  Then on
> > reboot your experiment (an fsck) has the task of finding the error
> > (which exists at least as a discrepency between the two disks), if it
> > can, and shouting at you about it.
> 
> No.  RAID will not let you see that discrepancy

Of course it won't - that's the point. Raid won't even know it's there!

> and will not let the
> discrepancy last any longer that it takes to read on drive and write
> the other.

WHICH drive does it read and which does it write? It ha no way of
knowing which, does it?

> Maybe I'm beginning to understand your failure scenario.
> It involves different data being written to the drives. Correct?

That is one possible way, sure. But the error on the drive can also
change spontaneously! Look, here are some outputs from the daily md5sum
run on a group of identical machines:

/etc/X11/fvwm2/menudefs.hook: (7) b4262c2eea5fa82d4092f63d6163ead5
   : lm003 lm005 lm006 lm007 lm008 lm009 lm010
/etc/X11/fvwm2/menudefs.hook: (1) 36e47f9e6cde8bc120136a06177c2923
   : lm011

That file on one of them mutated overnight.

> That only happens if:
>   1/ there is a software error
>   2/ there is an admin error

And if there is a hardware error. Hardware can do what it likes.
Anyway, I don't care HOW.

> You seem to be saying that if this happens, then raid is less reliable
> than non-raid.

No, I am saying nothing of the kind. I am simply pointing at the
probabilities.

> There may be some truth in this, but it is irrelevant.
> The likelyhood of such a software error or admin error happening on a
> well-managed machine is substantially less than the likelyhood of a
> drive media error, and raid will protect from drive media errors.

No it won't! I don't know why you say this either - oh, your definition
of "error" must be "when the drive returns a failure for a sector or
block  read". Sorry, I don't mean anything so specific. I mean anything
at all that might be considered an error, such as the mutating bits in
the daily check shown above. 

> So using raid might reduce reliability in a tiny number of cases, but
> will increase it substantially in a vastly greater number of cases.

Look at the probabilities, nothing else.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  9:40                   ` Peter T. Breuer
@ 2005-01-04 14:03                     ` David Greaves
  2005-01-04 14:07                       ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: David Greaves @ 2005-01-04 14:03 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:

>Then I guess you have helped clarify to yourself what type of errors
>falls in which class! Apparently errors caused by drive failure fall in
>the class of "indetectible error" for you!
>
>But in any case, you are wrong, because it is quite possible for an
>error to spontaneously arise on a disk which WOULD be detected by fsck.
>What does fsck detect normally if it is not that! 
>  
>
It checks the filesystem metadata - not the data held in the filesystem.

David





^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 14:03                     ` David Greaves
@ 2005-01-04 14:07                       ` Peter T. Breuer
  2005-01-04 14:43                         ` David Greaves
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 14:07 UTC (permalink / raw)
  To: linux-raid

David Greaves <david@dgreaves.com> wrote:
> Peter T. Breuer wrote:
> 
> >Then I guess you have helped clarify to yourself what type of errors
> >falls in which class! Apparently errors caused by drive failure fall in
> >the class of "indetectible error" for you!
> >
> >But in any case, you are wrong, because it is quite possible for an
> >error to spontaneously arise on a disk which WOULD be detected by fsck.
> >What does fsck detect normally if it is not that! 
> >
> It checks the filesystem metadata - not the data held in the filesystem.

So you should deduce that your test (if fsck be it) won't detect errors
in the files data, but only errors in the filesystem metadata.

So? Is there some problem here?

(yes, and one could add a md5sum per block to a fs, but I don't know a
fs that does).

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 14:07                       ` Peter T. Breuer
@ 2005-01-04 14:43                         ` David Greaves
  2005-01-04 15:12                           ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: David Greaves @ 2005-01-04 14:43 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter,

Can I make a serious attempt to sum up your argument as:

Disks suffer from random *detectable* corruption events on (or after) 
write (eg media or transient cache being hit by a cosmic ray, cpu 
fluctuations during write, e/m or thermal variations).

Disks suffer from random *undetectable* corruption events on (or after) 
write (eg media or transient cache being hit by a cosmic ray, cpu 
fluctuations during write, e/m or thermal variations)

Raid disks have more 'corruption-susceptible' data capacity per useable 
data capacity and so the probability of a corruption event is higher. 
Since a detectable error is detected it can be retried and dealt with.

This leaves the fact that essentially, raid disks are less reliable than 
non-raid disks wrt undetectable corruption events.

However, we need to carry out risk analysis to decide if the increase in 
susceptibility to certain kinds of corruption (cosmic rays) is 
acceptable given the reduction in susceptibility to other kinds (bearing 
or head failure).

David

tentative definitions:
detectable = noticed by normal OS I/O. ie CRC sector failure etc
undetectable = noticed by special analysis (fsck, md5sum verification etc)

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 14:43                         ` David Greaves
@ 2005-01-04 15:12                           ` Peter T. Breuer
  2005-01-04 16:54                             ` David Greaves
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 15:12 UTC (permalink / raw)
  To: linux-raid

David Greaves <david@dgreaves.com> wrote:
> Disks suffer from random *detectable* corruption events on (or after) 
> write (eg media or transient cache being hit by a cosmic ray, cpu 
> fluctuations during write, e/m or thermal variations).

Well, and also people hitting the off switch (or the power going off)
during a write sequence to a mirror, but after one of a pair of mirror
writes has gone to disk, but before the other of the pair has.

(If you want to say "but the fs is journalled", then consider what if 
the write is to the journal ...).

> Disks suffer from random *undetectable* corruption events on (or after) 
> write (eg media or transient cache being hit by a cosmic ray, cpu 
> fluctuations during write, e/m or thermal variations)

Yes. This is not different from what I have said. I didn't have any
particular scenario in mind.

But I see that you are correct in pointing out that some error
posibilities arer _created_ by the presence of raid that would not
ordinarily be present. So there is some scaling with the
number of disks that needs clarification.

> Raid disks have more 'corruption-susceptible' data capacity per useable 
> data capacity and so the probability of a corruption event is higher. 

Well, the probability is larger no matter what the nature of the event.
In principle, and vry apprximately, there are simply more places (and
times!) for it to happen TO.

Yes, you may say but those errors that are produced by the cpu don't
scale, nor do those that are produced by software. I'd demur. If you
think about each kind you have in mind you'll see that they do scale:
for example, the cpu has to work twice as often to write to two raid
disks as it does to have to write to one disk, so the opportunities for
IT to get something wrong are doubled.  Ditto software.  And of course,
since it is writing twice as often , the chance of being interrupted at
an inopportune time by a power failure are also doubled.

See?

> Since a detectable error is detected it can be retried and dealt with.

No. I made no such assumption. I don't know or care what you do with a
detectable error. I only say that whatever your test is, it detects it!
IF it looks at the right spot, of course. And on raid the chances of
doing that are halved, because it has to choose which disk to read.

> This leaves the fact that essentially, raid disks are less reliable than 
> non-raid disks wrt undetectable corruption events.

Well, that too. There is more real estate.

But this "corruption"  word seems to me to imply that you think I was
imagining errors produced by cosmic rays. I made no such restriction.

> However, we need to carry out risk analysis to decide if the increase in 
> susceptibility to certain kinds of corruption (cosmic rays) is 

Ahh. Yes you do. No I don't! This is your own invention, and I said no
such thing. By "errors", I meant anything at all that you consider to be
an error. It's up to you.  And I see no reason to restrict the term to
what is produced by something like "cosmic rays". "People hitting the
off switch at the wrong time" counts just as much, as far as I know.

I would guess that you are trying to classify errors by the way their
probabilities scale with number of disks. I made no such distinction,
in principle.  I simply classified errors according to whether you could
(in principle, also) detect them or not, whatever your test is.

> acceptable given the reduction in susceptibility to other kinds (bearing 
> or head failure).

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 15:12                           ` Peter T. Breuer
@ 2005-01-04 16:54                             ` David Greaves
  2005-01-04 17:42                               ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: David Greaves @ 2005-01-04 16:54 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:

>David Greaves <david@dgreaves.com> wrote:
>  
>
>>Disks suffer from random *detectable* corruption events on (or after) 
>>write (eg media or transient cache being hit by a cosmic ray, cpu 
>>fluctuations during write, e/m or thermal variations).
>>    
>>
>
>Well, and also people hitting the off switch (or the power going off)
>during a write sequence to a mirror, but after one of a pair of mirror
>writes has gone to disk, but before the other of the pair has.
>
>(If you want to say "but the fs is journalled", then consider what if 
>the write is to the journal ...).
>  
>
Hmm.
In neither case would a journalling filesystem be corrupted.

The md driver (somehow) gets to decide which half of the mirror is 'best'.

If the journal uses the fully written half of the mirror then it's replayed.
If the journal uses the partially written half of the mirror then it's 
not replayed.
It's just the same as powering off a normal non-resilient device.

(Is your point here back to the failure to guarantee write ordering? I 
thought Neil answered that?)


but lets carry on...

>>Disks suffer from random *undetectable* corruption events on (or after) 
>>write (eg media or transient cache being hit by a cosmic ray, cpu 
>>fluctuations during write, e/m or thermal variations)
>>    
>>
>
>Yes. This is not different from what I have said. I didn't have any
>particular scenario in mind.
>
>But I see that you are correct in pointing out that some error
>posibilities arer _created_ by the presence of raid that would not
>ordinarily be present. So there is some scaling with the
>number of disks that needs clarification.
>
>  
>
>>Raid disks have more 'corruption-susceptible' data capacity per useable 
>>data capacity and so the probability of a corruption event is higher. 
>>    
>>
>
>Well, the probability is larger no matter what the nature of the event.
>In principle, and vry apprximately, there are simply more places (and
>times!) for it to happen TO.
>  
>
exactly what I meant.

>Yes, you may say but those errors that are produced by the cpu don't
>scale, nor do those that are produced by software.
>
No, I don't say that.

> I'd demur. If you
>think about each kind you have in mind you'll see that they do scale:
>for example, the cpu has to work twice as often to write to two raid
>disks as it does to have to write to one disk, so the opportunities for
>IT to get something wrong are doubled.  Ditto software.  And of course,
>since it is writing twice as often , the chance of being interrupted at
>an inopportune time by a power failure are also doubled.
>  
>
I agree - obvious really.

>See?
>  
>
yes

>
>  
>
>>Since a detectable error is detected it can be retried and dealt with.
>>    
>>
>
>No. I made no such assumption. I don't know or care what you do with a
>detectable error. I only say that whatever your test is, it detects it!
>IF it looks at the right spot, of course. And on raid the chances of
>doing that are halved, because it has to choose which disk to read.
>  
>
I did when I defined detectable.... tentative definitions:
detectable = noticed by normal OS I/O. ie CRC sector failure etc
undetectable = noticed by special analysis (fsck, md5sum verification etc)

And a detectable error occurs on the underlying non-raid device - so the 
chances are not halved since we're talking about write errors which go 
to both disks. Detectable read errors are retried until they succeed - 
if they fail then I submit that a "write (or after)" corruption occured.

Hmm.
It also occurs to me that undetectable errors are likely to be temporary 
- nothing's broken but a bit flipped during the write/store process (or 
the power went before it hit the media). Detectable errors are more 
likely to be permanent (since most detection algorithms probably have a 
retry).

>>This leaves the fact that essentially, raid disks are less reliable than 
>>non-raid disks wrt undetectable corruption events.
>>    
>>
>
>Well, that too. There is more real estate.
>
>But this "corruption"  word seems to me to imply that you think I was
>imagining errors produced by cosmic rays. I made no such restriction.
>  
>
No, I was attempting to convey "random, undetectable, small, non 
systematic" (ie I can't spot cosmic rays hitting the disk - and even if 
I could, only a very few would cause damage) vs significant physical 
failure "drive smoking and horrid graunching noise" (smoke and noise 
being valid detection methods!).

They're only the same if you have a no process for dealing with errors.

>>However, we need to carry out risk analysis to decide if the increase in 
>>susceptibility to certain kinds of corruption (cosmic rays) is 
>>    
>>
>
>Ahh. Yes you do. No I don't! This is your own invention, and I said no
>such thing. By "errors", I meant anything at all that you consider to be
>an error. It's up to you.  And I see no reason to restrict the term to
>what is produced by something like "cosmic rays". "People hitting the
>off switch at the wrong time" counts just as much, as far as I know.
>  
>
You're talking about causes - I'm talking about classes of error.

(I live in telco-land so most datacentres I know have more chance of 
suffering cosmic ray damage than Joe Random user pulling the plug - but 
conceptually these events are the same).

Hitting the power off switch doesn't cause a physical failure - it 
causes inconsistency in the data.

I introduce risk analysis to justify accepting the 'real estate 
undetectable corruption vulnerability' risk increase of raid versus the 
ability to cope with detectable errors.

>I would guess that you are trying to classify errors by the way their
>probabilities scale with number of disks.
>
Nope - detectable vs undetectable.

> I made no such distinction,
>in principle.  I simply classified errors according to whether you could
>(in principle, also) detect them or not, whatever your test is.
>  
>
Also, it strikes me that raid can actually find undetectable errors by 
doing a bit-comparison scan.
Non-resilient devices with only one copy of each bit can't do that.
raid 6 could even fix undetectable errors.

A detectable error on a non-resilient media means you have no faith in 
the (possibly corrupt) data.
An undetectable error on a non-resilient media means you have faith in 
the (possibly corrupt) data.

Raid ultimately uses non-resilient media and propagates and uses this 
faith to deliver data to you.


David

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 16:54                             ` David Greaves
@ 2005-01-04 17:42                               ` Peter T. Breuer
  2005-01-04 19:12                                 ` David Greaves
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 17:42 UTC (permalink / raw)
  To: linux-raid

David Greaves <david@dgreaves.com> wrote:
> >(If you want to say "but the fs is journalled", then consider what if 
> >the write is to the journal ...).

> Hmm.
> In neither case would a journalling filesystem be corrupted.

A joournalled file system is always _consistent_. That does no mean it
is correct!

> The md driver (somehow) gets to decide which half of the mirror is 'best'.

Yep - and which is correct?

> If the journal uses the fully written half of the mirror then it's replayed.
> If the journal uses the partially written half of the mirror then it's 
> not replayed.

Which is correct?

> It's just the same as powering off a normal non-resilient device.

Well, I see what you mean - yes, it is the same in terms of the total
event space.  It's just that with a single disk, the possible outcomes
are randomized only over time, as you repeat the experiment.  Here you
have randomization of outcomes over space as well, depending on which
disk you test (or how you interleave the test across the disks).

And the question remains - which outcome is correct?

Well, I'll answer that.  Assuming that the fs layer is only notified
when BOTH journal writes have happened, and tcp signals can be sent
off-machine or something like that, then the correct result is the 
rollback, not the completion, as the world does not expect there to
have been a completion given the data it has got.

It's as I said. One always wants to rollback. So one doesn't want the
journal to bother with data at all. 

> (Is your point here back to the failure to guarantee write ordering? I 
> thought Neil answered that?)

I don't see what that has to do with anything (Neil said that write
ordering is not preserved, but that writes are not acked until they have
occurred - which would allow write order to be preserved if you were
interested in doing so; you simply have to choose "synchronous write").

> >No. I made no such assumption. I don't know or care what you do with a
> >detectable error. I only say that whatever your test is, it detects it!
> >IF it looks at the right spot, of course. And on raid the chances of
> >doing that are halved, because it has to choose which disk to read.

> I did when I defined detectable.... tentative definitions:
> detectable = noticed by normal OS I/O. ie CRC sector failure etc
> undetectable = noticed by special analysis (fsck, md5sum verification etc)

A detectable error is one you detect with whatever your test is.  If
your test is fsck, then that's the kind of error that is detected by the
detection that you do ... the only condition I imposed for the analysis
was that the test be conducted on the raid array, not on its underlying
components.

> And a detectable error occurs on the underlying non-raid device - so the 
> chances are not halved since we're talking about write errors which go 
> to both disks. Detectable read errors are retried until they succeed - 
> if they fail then I submit that a "write (or after)" corruption occured.

I don't understand you here - you seem to be confusing hardware
mechanisms with ACTUAL errors/outcomes.  It is the business of your
hardware to do something for you: how and what it does is immaterial to
the analysis.  The question is whether that something ends up being
CORRECT or INCORRECT, in terms of YOUR wishes.  Whether the hardware
consisders something an error or not and what it does about it is
immaterial here.  It may go back in time and ask your grandmother what
is your favorite colour, as far as I care - all that is important is
what ENDS UP on the disk, and whether YOU consider that an error or not.

So you are on some wild goose chase of your own here, I am afraid!

> It also occurs to me that undetectable errors are likely to be temporary 

You are again on a trip of your own :( undetectable errors are errors you
cannot detect with your test, and that is all! There is no implication.

> - nothing's broken but a bit flipped during the write/store process (or 
> the power went before it hit the media). Detectable errors are more 
> likely to be permanent (since most detection algorithms probably have a 
> retry).

I think that for some reason you are considering that a test (a
detection test) is carried out at every moment of time.  No.  Only ONE
test is ever carried out.  It is the test you apply when you do the
observation: the experiment you run decides at that single point wether
the disk (the raid array) has errors or not.  In practical terms, you do
it usualy when you boot the raid array, and run fsck on its file system.

OK? 

You simply leave an experiment running for a while (leave the array up,
let monkeys play on it, etc.) and then you test it. That test detects
some errors. However, there are two types of errors - those you can
detect with your test, and those you cannot detect. My analysis simply
gave the probabilities for those on the array, in terms of basic
parameters for the probabilities per an individual disk.

I really do not see why people make such a fuss about this!

> >>However, we need to carry out risk analysis to decide if the increase in 
> >>susceptibility to certain kinds of corruption (cosmic rays) is 
> >>
> >
> >Ahh. Yes you do. No I don't! This is your own invention, and I said no
> >such thing. By "errors", I meant anything at all that you consider to be
> >an error. It's up to you.  And I see no reason to restrict the term to
> >what is produced by something like "cosmic rays". "People hitting the
> >off switch at the wrong time" counts just as much, as far as I know.
> >  
> >
> You're talking about causes - I'm talking about classes of error.

No, I'm talking about classes of error! You're talking about causes. :)

> 
> Hitting the power off switch doesn't cause a physical failure - it 
> causes inconsistency in the data.

I don't understand you - it causes errors just like cosmic rays do (and
we can even set out and describe the mechanisms involved).  The word
"failure" is meaningless to me here.

> >I would guess that you are trying to classify errors by the way their
> >probabilities scale with number of disks.
> >
> Nope - detectable vs undetectable.

Then what's the problem? An undetectable error is one you cannot detect
via your test. Those scale with real estate. A detectible error is one
you can spot with your test (on the array, not its components).  The
missed detectible errors scale as n-1, where n is the number of disks in
the array.

Thus a single disk suffers from no missed detectible errors, and a
2-disk raid array does.

That's all.

No fuss, no muss!

> Also, it strikes me that raid can actually find undetectable errors by 
> doing a bit-comparison scan.

No, it can't, by definition. Undetectible errors are undetectible. If
you change your test, you change the class of errors that are
undetectible.

That's all.

> Non-resilient devices with only one copy of each bit can't do that.
> raid 6 could even fix undetectable errors.

Then they are not "undetectible".

The analisis in not affected by your changing the definition of what is
in the undetectible class of error and what is not. It stands. I have
made no assumption at all on what they are. I simply pointed out how
the probabilities scale for a raid array.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 17:42                               ` Peter T. Breuer
@ 2005-01-04 19:12                                 ` David Greaves
  0 siblings, 0 replies; 172+ messages in thread
From: David Greaves @ 2005-01-04 19:12 UTC (permalink / raw)
  To: Peter T. Breuer; +Cc: linux-raid

Peter T. Breuer wrote:

>A joournalled file system is always _consistent_. That does no mean it
>is correct!
>  
>
To my knowledge no computers have the philosophical wherewithall to 
provide that service ;)

If one is rude enough to stab a journalling filesystem in the back as it 
tries to save your data it promises only to be consistent when it is 
revived - it won't provide application correctness..

I think we agree on that.

>>The md driver (somehow) gets to decide which half of the mirror is 'best'.
>>    
>>
>Yep - and which is correct?
>  
>
Both are 'correct' - they simply represent different points in the 
series of system calls made before the power went.

>Which is correct?
>  
>
<grumble> ditto

>And the question remains - which outcome is correct?
>  
>
same answer I'm afraid.

>Well, I'll answer that.  Assuming that the fs layer is only notified
>when BOTH journal writes have happened, and tcp signals can be sent
>off-machine or something like that, then the correct result is the 
>rollback, not the completion, as the world does not expect there to
>have been a completion given the data it has got.
>
>It's as I said. One always wants to rollback. So one doesn't want the
>journal to bother with data at all.
>
<cough>bullshit</cough> ;)

I write a,b,c and d to the filesystem

we begin our story when a,b and c all live on the fs device (raid or 
not), all synced up and consistent.
I start to write d
it hits journal mirror A
it hits journal mirror B
it finalises on journal mirror B
I yank the plug
The mirrors are inconsistent
The filesystem is consistent
I reboot

scenario 1) the md device comes back using A
the journal isn't finalised - it's ignored
the filesystem contains a,b and c
Is that correct?

scenario 2) the md device comes back using B
the journal is finalised - it's rolled forward
the filesystem contains a,b,c and d
Is that correct?

Both are correct.

So, I think that deals with correctness and journalling - now on to 
errors...

>>>No. I made no such assumption. I don't know or care what you do with a
>>>detectable error. I only say that whatever your test is, it detects it!
>>>IF it looks at the right spot, of course. And on raid the chances of
>>>doing that are halved, because it has to choose which disk to read.
>>>      
>>>
>>I did when I defined detectable.... tentative definitions:
>>detectable = noticed by normal OS I/O. ie CRC sector failure etc
>>undetectable = noticed by special analysis (fsck, md5sum verification etc)
>>    
>>
>
>A detectable error is one you detect with whatever your test is.  If
>your test is fsck, then that's the kind of error that is detected by the
>detection that you do ... the only condition I imposed for the analysis
>was that the test be conducted on the raid array, not on its underlying
>components.
>  
>
well, if we're going to get anywhere here we need to be clear about things.
There are all kinds of errors - raid and redundancy will help with some 
and not others.

An md device does have underlying components and to refuse to allow 
tests to compare them you remove one of the benefits of raid - 
redundancy. It may make it easier to model mathmatically - but then the 
model is wrong.

We need to make sure we're talking about bits on a device
md reads devices and it writes them.

We need to understand what an error is - stop talking bollocks about 
"whatever the test is". This is *not* a math problem - it's simply not 
well enough defined yet. Lets get back to reality to decide what to model.

I proposed definitions and tests (the ones used in the real world where 
we don't run fsck) and you've ignored them.

I'll repeat them:
detectable = noticed by normal OS I/O. ie CRC sector failure etc
undetectable = noticed by special analysis (fsck, md5sum verification etc)

I'll add 'component device comparison' to the special analysis list.

No error is truly undetectable - if it were then it wouldn't matter 
would it?

>>- nothing's broken but a bit flipped during the write/store process (or 
>>the power went before it hit the media). Detectable errors are more 
>>likely to be permanent (since most detection algorithms probably have a 
>>retry).
>>    
>>
>
>I think that for some reason you are considering that a test (a
>detection test) is carried out at every moment of time.  No.  Only ONE
>test is ever carried out.  It is the test you apply when you do the
>observation: the experiment you run decides at that single point wether
>the disk (the raid array) has errors or not.  In practical terms, you do
>it usualy when you boot the raid array, and run fsck on its file system.
>
>OK? 
>You simply leave an experiment running for a while (leave the array up,
>let monkeys play on it, etc.) and then you test it. That test detects
>some errors. However, there are two types of errors - those you can
>detect with your test, and those you cannot detect. My analysis simply
>gave the probabilities for those on the array, in terms of basic
>parameters for the probabilities per an individual disk.
>
>I really do not see why people make such a fuss about this!
>  
>
We care about our data and raid has some vulnerabilites to corruption.
We need to understand these to fix them - your analysis is woolly and 
unhelpful and, although it may have certain elements that are 
mathmatically correct - your model has flaws that mean that the 
conclusions are not applicable.

>>>>However, we need to carry out risk analysis to decide if the increase in 
>>>>susceptibility to certain kinds of corruption (cosmic rays) is 
>>>>
>>>>        
>>>>
>>>Ahh. Yes you do. No I don't! This is your own invention, and I said no
>>>such thing. By "errors", I meant anything at all that you consider to be
>>>an error. It's up to you.  And I see no reason to restrict the term to
>>>what is produced by something like "cosmic rays". "People hitting the
>>>off switch at the wrong time" counts just as much, as far as I know.
>>> 
>>>
>>>      
>>>
>>You're talking about causes - I'm talking about classes of error.
>>    
>>
>
>No, I'm talking about classes of error! You're talking about causes. :)
>  
>
No, by comparing the risk between classes of error (detectable and not) 
I'm talking about classes of errror - by arguing about cosmic rays and 
power switches you _are_ talking about causes.

Personally I think there is a massive difference between the risk of 
detectable errors and undetectable ones. Many orders of magnitude.

>>Hitting the power off switch doesn't cause a physical failure - it 
>>causes inconsistency in the data.
>>    
>>
>I don't understand you - it causes errors just like cosmic rays do (and
>we can even set out and describe the mechanisms involved).  The word
>"failure" is meaningless to me here.
>  
>
yes, you appear to have selectively quoted and ignored what I said a 
line earlier:
 > (I live in telco-land so most datacentres I know have more chance of 
suffering cosmic ray damage than Joe Random user pulling the plug - but 
conceptually these events are the same).


When that happens I begin to think that further discussion is meaningless.

>>>I would guess that you are trying to classify errors by the way their
>>>probabilities scale with number of disks.
>>>
>>>      
>>>
>>Nope - detectable vs undetectable.
>>    
>>
>
>Then what's the problem? An undetectable error is one you cannot detect
>via your test. Those scale with real estate. A detectible error is one
>you can spot with your test (on the array, not its components).  The
>missed detectible errors scale as n-1, where n is the number of disks in
>the array.
>
>Thus a single disk suffers from no missed detectible errors, and a
>2-disk raid array does.
>
>That's all.
>
>No fuss, no muss!
>  
>
and so obviously wrong!
An md device does have underlying components and to refuse to allow 
tests to compare them you remove one of the benefits of raid - redundancy.


>>Also, it strikes me that raid can actually find undetectable errors by 
>>doing a bit-comparison scan.
>>    
>>
>
>No, it can't, by definition. Undetectible errors are undetectible. If
>you change your test, you change the class of errors that are
>undetectible.
>
>That's all.
>
>  
>
>>Non-resilient devices with only one copy of each bit can't do that.
>>raid 6 could even fix undetectable errors.
>>    
>>
>
>Then they are not "undetectible".
>  
>
They are. Read my definition. They are not detected in normal operation 
with some kind of event notification/error return code; hence undetectable.
However bit comparison with known good or md5 sums or with a mirror can 
spot such bit flips.
They are still 'undetectable' in normal operation.
Be consistent in your terminology.

>The analisis in not affected by your changing the definition of what is
>in the undetectible class of error and what is not. It stands. I have
>made no assumption at all on what they are. I simply pointed out how
>the probabilities scale for a raid array.
>  
>
What analysis - you are waving vague and changing definitions about and 
talk about grandma's favourite colour

David

PS any dangling sentences are because I just found so many 
inconsistencies that I gave up.


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 20:41         ` Peter T. Breuer
  2005-01-03 23:19           ` Peter T. Breuer
@ 2005-01-04  0:45           ` maarten
  2005-01-04 10:14             ` Peter T. Breuer
  1 sibling, 1 reply; 172+ messages in thread
From: maarten @ 2005-01-04  0:45 UTC (permalink / raw)
  To: linux-raid

On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:
> > Just for laughs, I calculated this chance also for a three-way raid-1
> > setup

> > Let us (randomly) assume there is a 10% chance of a disk failure.
>
> No, call it "p". That is the correct name. And I presume you mean "an
> error", not "a failure".

You presume correctly.

> > We therefore have eight possible scenarios:
>
> Oh, puhleeeeze.  Infantile arithmetic instead of elementary probabilistic
> algebra is not something I wish to suffer through ...

Maybe not.  Your way of explaining may make sense to a math expert, I tried to 
explain it in a form other humans might comprehend, and that was on purpose.

Your way may be correct, or it may not be, I'll leave that up to other people. 
To me, it looks like you complicate it and obfuscate it, like someone can 
code a one-liner in perl which is completely correct yet cannot be read by 
anyone but the author...  In other words, you try to impress me with your 
leet math skills but my explanation was both easier to read and potentially 
reached a far bigger audience.

Now excuse me if my omitting "p" in my calculation made you lose your 
concentration... or something.  Further comments to be found below.

> There is no need for you to consider these scenarios. The probability
> is 3p^2, which is tiny. Forget it. (actually 3p^2(1-p), but forget the
> cube term).

If you're going to prove something in calculations, you DO NOT 'forget' a tiny 
probability.  This is not science, it's math.  Who is to say p will always be 
0.1 ?  In another scenario in another calculation p might be as high as 0.9 !

> > Scenarios G and H are special, the chances
> > of that occurring are calculated seperately.
>
> No, they are NOT special. one of them is the chance that everything is
> OK, which is (1-p)^3, or approx 1-3p (surprise surprise). The other is
> the completely forgetable probaility p^3 that all three are bad at that
> spot.

Again, you cannot go around setting (1-p)^3 to 1-3p.  P is a variable which is 
not known to you (therefore it is a variable) thus might as well be 0.9.  Is 
0.1^3 the same to you as 1-2.7 ?   Not really huh, is it ?

> This is excruciatingly poor baby math!

Oh, well then my math seems on par with your admin skills... :-p

> Or approx 1-p.

Which is approx the same number as what I said.

> It should  be  p! It is one minus your previous result.
>
> SIgh ...      0 (1-3p) + 1/3 3p  = p
>
> > Which, again, is exactly the same chance a single disk will get
> > corrupted, as we assumed above in line one is 10%.  Ergo, using raid-1
> > does not make the risks of bad data creeping in any worse.  Nor does it
> > make it better either.
>
> All false. And baby false at that. Annoying! 

Are your reading skills lacking ?  I stated that the chance of reading bad 
data was 0.1, which is equal to p, so we're in agreement it is p (0.1).

> Look, the chance of an undetected detectable failure occuring is
>          0 (1-3p) + 2/3 3p
>
>          = 2p
>
> and it grows with the number n of disks, as you may expect, being
> proportional to n-1.

I see no proof whatsoever of that.  Is that your proof, that single line ?
Do you comment your code as badly as you do your math ?

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  0:45           ` maarten
@ 2005-01-04 10:14             ` Peter T. Breuer
  2005-01-04 13:24               ` Maarten
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 10:14 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > maarten <maarten@ultratux.net> wrote:
> > > Just for laughs, I calculated this chance also for a three-way raid-1
> > > setup
> 
> > > Let us (randomly) assume there is a 10% chance of a disk failure.
> >
> > No, call it "p". That is the correct name. And I presume you mean "an
> > error", not "a failure".
> 
> You presume correctly.
> 
> > > We therefore have eight possible scenarios:
> >
> > Oh, puhleeeeze.  Infantile arithmetic instead of elementary probabilistic
> > algebra is not something I wish to suffer through ...
> 
> Maybe not.  Your way of explaining may make sense to a math expert, I tried to 

It would make sense to a 16 year old, since that's about where you get
to be certified as competent in differential calculus and probability
theory, if my memory of my high school math courses is correct.  This is
pre-university stuff by a looooooong way.

The problem is that I never have a 9-year old child available when I
need one ...

> explain it in a form other humans might comprehend, and that was on purpose.

If that's your definition of a human, I'm not sure I want to see them!

> Your way may be correct, or it may not be, I'll leave that up to other people. 

What do you see as incorrect?

> To me, it looks like you complicate it and obfuscate it, like someone can 

No, I simplify and make it clear.

> Now excuse me if my omitting "p" in my calculation made you lose your 
> concentration... or something.  Further comments to be found below.

It does, because what you provide is a sort of line-noise instead of
just "p". 

Not abstracting away from the detail to the information content behind
it is perhaps a useful trait in a sysadmin.

> > There is no need for you to consider these scenarios. The probability
> > is 3p^2, which is tiny. Forget it. (actually 3p^2(1-p), but forget the
> > cube term).
> 
> If you're going to prove something in calculations, you DO NOT 'forget' a tiny 

You forget it because it is tiny.  As tiny as you or I could wish to
make it.  Puhleeze.  This is just Poisson distributions.

> probability.  This is not science, it's math.

Therefore you forget it. All of differential calculus works like that.
Forget the square term - it vanishes. All terms of the series beyond
the first can be ignored as you go to the limiting situation.

> Who is to say p will always be 0.1 ? 

Me.  Or you.  But it will always be far less.  Say in the 1/10^40 range
for a time interval of one second.  You can look up such numbers for
yourself at manufacturers sites - I vaguely recall they appear on their
spec sheets.

> In another scenario in another calculation p might be as high as 0.9 !

This is a probabilty PER UNIT TIME.  Choose the time interval to make it
as small as you like.

> > > Scenarios G and H are special, the chances
> > > of that occurring are calculated seperately.
> >
> > No, they are NOT special. one of them is the chance that everything is
> > OK, which is (1-p)^3, or approx 1-3p (surprise surprise). The other is
> > the completely forgetable probaility p^3 that all three are bad at that
> > spot.
> 
> Again, you cannot go around setting (1-p)^3 to 1-3p.

Of course I can.  I think you must have failed differential calculus!
The derivative of (1-p)^3 near p=0 is -3.  That is to say that the
approximation series for (1-p)^3 is 1 - 3p + o(p).  And by o(p) I mean a
term that when divided by p tends to zero as p tends to 0.

In other words, something that you can forget.

> P is a variable which is 
> not known to you (therefore it is a variable)

It's a value. That I call it "p" does not make it variable.

> thus might as well be 0.9.  Is 

Your logic fails here - it is exactly as small as I wish it to be,
because I get to choose the interval of time (the "scale") involved.

> 0.1^3 the same to you as 1-2.7 ?   Not really huh, is it ?

If your jaw were to drop any lower it would drag on the ground :(.  This
really demonstrates amazing ignorance of very elementary high school
math.

Perhaps you've forgotten it all!  Then how do you move your hand from
point A to point B?  How do you deal with the various moments of inertia
involved, and the feedback control, all under the affluence of incohol
and gravity too?

Maybe it's a Kalman filter.  You try it with the other hand first, and
follow that with the hand you want, compensating for the differences you
see.

> > This is excruciatingly poor baby math!
> 
> Oh, well then my math seems on par with your admin skills... :-p

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 10:14             ` Peter T. Breuer
@ 2005-01-04 13:24               ` Maarten
  2005-01-04 14:05                 ` Peter T. Breuer
  0 siblings, 1 reply; 172+ messages in thread
From: Maarten @ 2005-01-04 13:24 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:
> > On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > > maarten <maarten@ultratux.net> wrote:

> It would make sense to a 16 year old, since that's about where you get
> to be certified as competent in differential calculus and probability
> theory, if my memory of my high school math courses is correct.  This is
> pre-university stuff by a looooooong way.

Oh wow.  So you deduced I did not study math at university ?
Well, that IS an eye-opener for me.  I was unaware studying math was a 
requirement to engage in conversation on the linux-raid mailinglist ?
Or is this not the list I think it is ?

> The problem is that I never have a 9-year old child available when I
> need one ...

Um, check again... he's sitting right there with you I think.

> You forget it because it is tiny.  As tiny as you or I could wish to
> make it.  Puhleeze.  This is just Poisson distributions.

>
> Therefore you forget it. All of differential calculus works like that.
> Forget the square term - it vanishes. All terms of the series beyond
> the first can be ignored as you go to the limiting situation.

And that is precisely what false assumption you're making ! We ARE not going 
to the limiting situation. We are discussing the probabilities in failures of 
media.  You cannot assume we will be talking about harddisks, and neither is 
the failure rate in harddisk anywhere near limit zero. Drive manufacturers 
might want you to believe that through publishing highly theoretical MTBFs, 
but that doesn't make it so that any harddrive has a life expectancy of 20+ 
years, as the daily facts prove all the time.
You cannot assume p to be vanishingly small.  Maybe p really is the failure 
rate in 20 year old DAT tapes that were stored at 40 degrees C.  Maybe it is 
the failure rate of wet floppydisks. You cannot make assumptions about p.  

The nice thing in math is that you can make great calculations when you 
assume a variable is limit zero or limit infinite.  The bad thing is, you 
cannot assume that things in real life act like predefined math variables.

> > Who is to say p will always be 0.1 ?
>
> Me.  Or you.  But it will always be far less.  Say in the 1/10^40 range
> for a time interval of one second.  You can look up such numbers for
> yourself at manufacturers sites - I vaguely recall they appear on their
> spec sheets.

Yes, and p will be in the range of 1.0 for time intervals 10^40 seconds. Got 
another wisecrack ?  Of course p will approach zero when you make time 
interval t approach zero !!  And yes, judging a time of 1 second as a 
realistic time interval to measure a disk drives' failure rate over time 
certainly qualifies as making t limit zero.

Goddamnit, why am I even discussing this with you.  Are you a troll ??   

> Your logic fails here - it is exactly as small as I wish it to be,
> because I get to choose the interval of time (the "scale") involved.

No, you don't.  You're as smart with numbers as drive manufactures are, 
letting people believe it is okay to sell a drive with a MTBF of 300000 
hours, yet with a one year warrantee.  I say if you trust your own MTBF put 
your money where your mouth is and extend the warrantee to something 
believable. 
You do the same thing here.  I can also make such calculations:  I can safely 
say that at this precise second (note I say second) you are not thinking 
clearly.  I now can prove that you never are thinking clearly, simply by 
making time interval t of one second limit zero, and hey, whaddayaknow, your 
intellect goes to zero too.  Neat trick huh ?

> If your jaw were to drop any lower it would drag on the ground :(.  This
> really demonstrates amazing ignorance of very elementary high school
> math.

That is because I came here as a linux admin, not a math whiz. I think we have 
already established that you do not surpass my admin skills, in another 
branch of this thread, yes ? (boy is that ever an understatement !)

A branch which you, wisely, left unanswered after at least two people besides 
myself pointed out to you how fscked up (pun intended ;) your server rooms 
and / or procedures are.  

So now you concentrate on the math angle, where you can shine your cambridge 
medals (whatever that is still worth, in light of this) and outshine "baby 
math" people all you want.  I got news for you: I may not be fluent anymore 
in math terminology, but I certainly have the intellect and intelligence to 
detect and expose a bullshitter.

>
> Perhaps you've forgotten it all!  Then how do you move your hand from
> point A to point B?  How do you deal with the various moments of inertia
> involved, and the feedback control, all under the affluence of incohol
> and gravity too?

Well that's simple.  I actually obey the law.  Such as the law of gravity, and 
the law that says any outcome of a probability calculation cannot be other 
than between zero and one, inclusive.  You clearly do not.  (though I still 
think you obey gravity law, since obviously you're still here)

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 13:24               ` Maarten
@ 2005-01-04 14:05                 ` Peter T. Breuer
  2005-01-04 15:31                   ` Maarten
  0 siblings, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 14:05 UTC (permalink / raw)
  To: linux-raid

Maarten <maarten@ultratux.net> wrote:
> On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote:
> > maarten <maarten@ultratux.net> wrote:
> > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > > > maarten <maarten@ultratux.net> wrote:
> 
> > It would make sense to a 16 year old, since that's about where you get
> > to be certified as competent in differential calculus and probability
> > theory, if my memory of my high school math courses is correct.  This is
> > pre-university stuff by a looooooong way.
> 
> Oh wow.  So you deduced I did not study math at university ?

Well, I deduced that you did not get to the level expected of a 16 year
old.

> Well, that IS an eye-opener for me.  I was unaware studying math was a 

One doesn't "study" math, one _does_ math, just as one _does_ walking
down the street, talking, and opening fridge doors. Your competency
at it gets certified in school and uni, that's all.

> requirement to engage in conversation on the linux-raid mailinglist ?

Looks like it, or else one gets bogged down in inane conversations at
about the level of "what is an editor".

Look - a certain level of mathematical competence is required in the
technical world.  You cannot get away without it.  Being able to do math
to the level expected of an ordinary 16 year old is certainly expected
by me in order to be able to talk, walk, and eat pizza with you as an
ordinary human being.  

As to the Poisson distribution, I know it forms part of the university
syllabus even for laboratory techs first degrees, because the head tech
here, who has been doing his degree in tech stuff here while he admins,
used to keep coming and asking me things like "how do I calculate the
expected time between two collisions given that packets are emitted
from both ends according to a Poisson distribution with mean mu".

And that's ordinary 16 year-old math stuff. Stands to reason they would
teach it in a university at a crummy place like this.

> Or is this not the list I think it is ?

It's not the list you think it is, I think.

> > The problem is that I never have a 9-year old child available when I
> > need one ...
> 
> Um, check again... he's sitting right there with you I think.

OK! Now he can explain this man page to me, which they say a 9-year old
child can understand (must be the solaris man page for "passwd" again).

> > Therefore you forget it. All of differential calculus works like that.
> > Forget the square term - it vanishes. All terms of the series beyond
> > the first can be ignored as you go to the limiting situation.
> 
> And that is precisely what false assumption you're making ! We ARE not going 

There is no false assumption. This is "precisely" what you are getting
wrong. I do not want to argue with you, I just want you to GET IT.

> to the limiting situation.

Yes we are.

> We are discussing the probabilities in failures of 
> media.

PER UNIT TIME, for gawd's sake. Choose a small unit. One small enough
to please you. Then make it "vanishingly small", slowly, please. Then
we can all take a rest while you get it.

> You cannot assume we will be talking about harddisks, and neither is 

I don't.

> the failure rate in harddisk anywhere near limit zero.

Eh? It's tiny per tiny unit of time. As you would expect, naively.

> Drive manufacturers 
> might want you to believe that through publishing highly theoretical MTBFs, 

If the MTBF is one year (and we are not discussing failure, but error),
then the probability of a failure PER DAY is 1/365, or about 0.03. That
would make the square of that probability PER DAY about 0.0009, or
negligible in the scale of the linear term. This _is_ mathematics.

And if you want to consider the probability PER HOUR, then the
probability of failure is about 0.0012. Per minute it is about 0.00002.
Per second it is about 0.0000003. The square term is 0.0000000000009,
or completely negligible.

And that is failure, not error.

But we don't care. Just take a time unit that is pleasingly small, and
consider the probability per THAT unit. And please keep the unit to
yourself.

> but that doesn't make it so that any harddrive has a life expectancy of 20+ 
> years, as the daily facts prove all the time.

It does mean it. It means precisely that (given certain experimental
conditions). If you want to calculate the MTBF in a real dusty noisy
environment, I would say it is about ten years. That is, 10% chance of
failure per year.

If they say it is 20 years and not 10 years, well I believe that too,
but they must be keeping the monkeys out of the room.

> You cannot assume p to be vanishingly small.

I don't have to - make it so.  Then yes, I can.

> Maybe p really is the failure 
> rate in 20 year old DAT tapes that were stored at 40 degrees C.  Maybe it is 

You simply are spouting nonsense. Please cease. It is _painful_. Like
hearing somebody trying to sing when they cannot sing, or having to
admire amateur holiday movies. 

P is vanishingly small when you make it so. Do so! It doesn't require
anything but a choice on your part to choose a sufficiently small unit
of time to scale it to.

> the failure rate of wet floppydisks. You cannot make assumptions about p.  

I don't. You on the other hand ...

> The nice thing in math is that you can make great calculations when you 
> assume a variable is limit zero or limit infinite.  The bad thing is, you 

Complete and utter bullshit. You ought to be ashamed.

> cannot assume that things in real life act like predefined math variables.

Rest of crank math removed. One can't reason with people who simply
don't have the wherewithal to recognise that the problem is inside
them.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 14:05                 ` Peter T. Breuer
@ 2005-01-04 15:31                   ` Maarten
  2005-01-04 16:21                     ` Peter T. Breuer
  2005-01-04 19:57                     ` Mikael Abrahamsson
  0 siblings, 2 replies; 172+ messages in thread
From: Maarten @ 2005-01-04 15:31 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 15:05, Peter T. Breuer wrote:
> Maarten <maarten@ultratux.net> wrote:
> > On Tuesday 04 January 2005 11:14, Peter T. Breuer wrote:
> > > maarten <maarten@ultratux.net> wrote:
> > > > On Monday 03 January 2005 21:41, Peter T. Breuer wrote:
> > > > > maarten <maarten@ultratux.net> wrote:

> > Well, that IS an eye-opener for me.  I was unaware studying math was a
>
> One doesn't "study" math, one _does_ math, just as one _does_ walking
> down the street, talking, and opening fridge doors. Your competency
> at it gets certified in school and uni, that's all.

I know a whole mass of people who can't calculate what chance the toss of a 
coin has.  Or who don't know how to verify their money change is correct.
So it seems math is not an essential skill, like walking and talking is.
I'll not even go into gambling, which is immensely popular.  I'm sure there 
are even mathematicians who gamble.  How do you figure that ?? 

> > but that doesn't make it so that any harddrive has a life expectancy of
> > 20+ years, as the daily facts prove all the time.
>
> It does mean it. It means precisely that (given certain experimental
> conditions). If you want to calculate the MTBF in a real dusty noisy
> environment, I would say it is about ten years. That is, 10% chance of
> failure per year.
>
> If they say it is 20 years and not 10 years, well I believe that too,
> but they must be keeping the monkeys out of the room.

Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate sheet 
below where they go on record with a whopping 1200.000 hours MTBF.  That 
translates to 137 years.  Now can you please state here and now that you 
actually believe that figure ?  Cause it would show that you have indeed 
fully and utterly lost touch with reality.  No sane human being would take 
seagate for their word seen as we all experience many many more drive 
failures within the first 10 years, let alone 20, to even remotely support 
that outrageous MTBF claim.

All this goes to show -again- that you can easily make statistics which do not 
resemble anything remotely possible in real life.  Seagate determines MTBF by 
setting up 1.200.000 disks, running them for one hour, applying some magic 
extrapolation wizardry which should (but clearly doesn't) properly account 
for aging, and hey presto, we've designed a drive with a statistical average 
life expectancy of 137 years.  Hurray.  

Any reasonable person will ignore that MTBF as gibberish, and many people 
would probably even state as much as that NONE of those drives will still 
work after 137 years. (too bad there's no-one to collect the prize money)

So, the trick seagate does is akin to your trick of defining t as small as you 
like and [then] proving that p goes to zero.  Well newsflash, you can't 
determine anything useful from running 1000 drives for one hour, and probably 
even less from running 3.600.000 drives for one second.  The idea alone is 
preposterous.

http://www.seagate.com/cda/products/discsales/marketing/
detail/0,1081,551,00.html

> > Maybe p really is the failure
> > rate in 20 year old DAT tapes that were stored at 40 degrees C.  Maybe it
> > is
>
> You simply are spouting nonsense. Please cease. It is _painful_. Like
> hearing somebody trying to sing when they cannot sing, or having to
> admire amateur holiday movies.

Nope.  I want you to provide a formula which shows how likely a failure is.  
It is entirely my prerogative to test that formula with media with a massive 
failure rate.  I want to build a raid-1 array out of 40 pieces of 5.25" 
25-year old floppy drives, and who's stopping me.  
What is my expected failure rate ?

> Rest of crank math removed. One can't reason with people who simply
> don't have the wherewithal to recognise that the problem is inside
> them.

This sentence could theoretically equally well apply to you, couldn't it ?

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 15:31                   ` Maarten
@ 2005-01-04 16:21                     ` Peter T. Breuer
  2005-01-04 20:55                       ` maarten
  2005-01-04 19:57                     ` Mikael Abrahamsson
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 16:21 UTC (permalink / raw)
  To: linux-raid

Maarten <maarten@ultratux.net> wrote:
> I'll not even go into gambling, which is immensely popular.  I'm sure there 
> are even mathematicians who gamble.  How do you figure that ?? 

I know plenty who do.  They win.  A friend of mine made his living at
the institute of advanced studies at princeton for two years after his
grant ran out by winning at blackjack in casinos all over the states.
(never play him at poker!  I used to lose all my matchsticks ..)

> > If they say it is 20 years and not 10 years, well I believe that too,
> > but they must be keeping the monkeys out of the room.
> 
> Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate sheet 
> below where they go on record with a whopping 1200.000 hours MTBF.  That 
> translates to 137 years.

I believe that too.  They REALLY have kept the monkeys well away.
They're only a factor of ten out from what I think it is, so I certainly
believe them.  And they probably discarded the ones that failed burn-in
too.

> Now can you please state here and now that you 
> actually believe that figure ?

Of course. Why wouldn't I? They are stating something like 1% lossage
per year under perfect ideal conditions, no dust, no power spikes, no
a/c overloads, etc. I'd easily belueve that.

> Cause it would show that you have indeed 
> fully and utterly lost touch with reality.  No sane human being would take 
> seagate for their word seen as we all experience many many more drive 
> failures within the first 10 years,

Of course we do. Why wouldn't we? That doesn't make their figures
wrong!

> let alone 20, to even remotely support 
> that outrageous MTBF claim.

The number looks believable to me. Do they reboot every day? I doubt
it. It's not outrageous. Just optimistic for real-world conditions.
(And yes, I have ten year old disks, or getting on for it, and they
still work).

> All this goes to show -again- that you can easily make statistics which do not 

No, it means that statistics say what they say, and I understand them
fine, thanks.

> resemble anything remotely possible in real life.  Seagate determines MTBF by 
> setting up 1.200.000 disks, running them for one hour, applying some magic 
> extrapolation wizardry which should (but clearly doesn't) properly account 
> for aging, and hey presto, we've designed a drive with a statistical average 
> life expectancy of 137 years.  Hurray.  

That's a fine technique. It's perfectly OK. I suppose they did state
the standard deviation of their estimator?

> Any reasonable person will ignore that MTBF as gibberish,

No they wouldn't - it looks a perfectly reasonable figure to me, just
impossibly optimisitic for the real world, which contains dust, water
vapour, mains spikes, reboots every day, static electrickery, and a
whole load of other gubbins that doesn't figure in their tests at all.

> and many people 
> would probably even state as much as that NONE of those drives will still 
> work after 137 years. (too bad there's no-one to collect the prize money)

They wouldn't expect them to. If the mtbf is 137 years, then of a batch
of 1000, approx 0.6 and a bit PERCENT would die per year.  Now you get
to multiply.  99.3^n % is ...  well, anyway, it isn't linear, but they
would all be expected to die out by 137y.  Anyone got some logarithms?

> So, the trick seagate does is akin to your trick of defining t as small as you 

Nonsense. Please stop this bizarre crackpottism of yours. I don't have
any numrical disabilities, and if you do, that's your problem, and it
should give yu a guide where you need to work to improve.

> Nope.  I want you to provide a formula which shows how likely a failure is.  

That's your business. 

But it doesn't seem likely that you'll manage it.

> It is entirely my prerogative to test that formula with media with a massive 
> failure rate.  I want to build a raid-1 array out of 40 pieces of 5.25" 
> 25-year old floppy drives, and who's stopping me.  
> What is my expected failure rate ?

Oh, about 20 times the failure rate with one floppy.  If the mtbf for
one floppy is x (so the probabilty of failure is p = 1/x per unit time),
then the raid will fail after two floppies die, which is expected to be
at APPROX 1/(40p) + 1/(39p) = x(1/40 + 1/39) or approximately x/20
units of time from now (I shoould really tell you the expected time to
the second event in a poisson distro, but you can do that for me ..., I
simply point you to the crude calculation above as being roughly good
enough).

It will last one twentieth as long as a single floppy (thanks to the
redundancy).

> > Rest of crank math removed. One can't reason with people who simply
> > don't have the wherewithal to recognise that the problem is inside
> > them.
> 
> This sentence could theoretically equally well apply to you, couldn't it ?

!!

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 16:21                     ` Peter T. Breuer
@ 2005-01-04 20:55                       ` maarten
  2005-01-04 21:11                         ` Peter T. Breuer
  2005-01-04 21:38                         ` Peter T. Breuer
  0 siblings, 2 replies; 172+ messages in thread
From: maarten @ 2005-01-04 20:55 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 17:21, Peter T. Breuer wrote:
> Maarten <maarten@ultratux.net> wrote:

> > Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate
> > sheet below where they go on record with a whopping 1200.000 hours MTBF. 
> > That translates to 137 years.
>
> I believe that too.  They REALLY have kept the monkeys well away.
> They're only a factor of ten out from what I think it is, so I certainly
> believe them.  And they probably discarded the ones that failed burn-in
> too.
>
> > Now can you please state here and now that you
> > actually believe that figure ?
>
> Of course. Why wouldn't I? They are stating something like 1% lossage
> per year under perfect ideal conditions, no dust, no power spikes, no
> a/c overloads, etc. I'd easily belueve that.

No spindle will take 137 years of abuse at the incredibly high speed of 10000 
rpm and not show enough wear so that the heads will either collide with the 
platters or read on adjacent tracks.  Any mechanic can tell you this.
I don't care what kind of special diamond bearings you use, it's just not 
feasible.  We could even start a debate of how much decay we would see in the 
silicon junctions in the chips, but that is not useful nor on-topic.  Let's 
just say that the transistor barely exists 50 years and it is utter nonsense 
to try to say anything meaningful about what 137 years will do to 
semiconductors and their molecular structures over that vast a timespan.  
Remember, it was not too long ago they said CDs were indestructible (by time 
elapsed, not by force, obviously). And look what they say now. 

I don't see where you come up with 1% per year.  Remember that MTBF means MEAN 
time between failures, so for every single drive that dies in year one, one 
other drive has to double its life expectancy to twice 137, which is 274 
years.  If your reasoning is correct with one drive dying per year, the 
remaining bunch after 50 years will have to survive another 250(!) years, on 
average.  ...But wait, you're still not convinced, eh ?

Also, I'm not used to big data centers buying disks by the container, but from 
what I've heard no-one can actually say that they lose as little as 1 drive a 
year for any hundred drives bought. Those figures are (much) higher.
You yourself said in a previous post you expected 10% per year, and that is 
WAY off the 1% mark you now state 'believeable'.  How come ?

> > Cause it would show that you have indeed
> > fully and utterly lost touch with reality.  No sane human being would
> > take seagate for their word seen as we all experience many many more
> > drive failures within the first 10 years,
>
> Of course we do. Why wouldn't we? That doesn't make their figures
> wrong!

Yes it does. By _definition_ even. It clearly shows that one cannot account 
for tens, nay hundreds, of years wear and tear by just taking a very small 
sample of drives and having them tested for a very small amount of time.

Look, _everybody_ knows this.  No serious admin will not change their drives 
after five years as a rule, or 10 years at the most. And that is not simply 
due to Moore's law.  The failure rate just gets too high, and economics 
dictate that they must be decommissioned.  After "only" 10 years...!

> > let alone 20, to even remotely support
> > that outrageous MTBF claim.
>
> The number looks believable to me. Do they reboot every day? I doubt

Of course they don't.  They never reboot. MTBF is not measured in adverse 
conditions.  Even so, neither do disks in a data centre...

> it. It's not outrageous. Just optimistic for real-world conditions.
> (And yes, I have ten year old disks, or getting on for it, and they
> still work).

Some of em do, yes. Not all of them. 
(to be fair, the MTBF in those days was much lower than now (purportedly)).

> > All this goes to show -again- that you can easily make statistics which
> > do not
>
> No, it means that statistics say what they say, and I understand them
> fine, thanks.

Uh-huh.  So explain to me why drive manufacturers do not give a 10 year 
warrantee. I say because they know full well that they would go bankrupt if 
they did since not 8% but rather 50% or more would return in that time.

> > resemble anything remotely possible in real life.  Seagate determines
> > MTBF by setting up 1.200.000 disks, running them for one hour, applying
> > some magic extrapolation wizardry which should (but clearly doesn't)
> > properly account for aging, and hey presto, we've designed a drive with a
> > statistical average life expectancy of 137 years.  Hurray.
>
> That's a fine technique. It's perfectly OK. I suppose they did state
> the standard deviation of their estimator?

Call them and find out; you're the math whiz. 
And I'll say it again: if some statistical technique yields wildly different 
results than the observable, verifiable real world does, then there is 
something wrong with said technique, not with the real world.
The real world is our frame of reference, not some dreamed-up math model which 
attempts to describe the world. And if they do collide, a math theory gets 
thrown out, not the real world observations instead...! 

> > Any reasonable person will ignore that MTBF as gibberish,
>
> No they wouldn't - it looks a perfectly reasonable figure to me, just
> impossibly optimisitic for the real world, which contains dust, water
> vapour, mains spikes, reboots every day, static electrickery, and a
> whole load of other gubbins that doesn't figure in their tests at all.

Test labs have dust, water vapours and mains spikes too, albeit as little as 
possible. They're testing on earth, not on some utopian other parallel world.  
Good colo's do a good job to eliminate most adverse effects.  In any case, 
dust is not a great danger to disks (but it is to fans), heat is. Especially 
quick heat buildup, hence powercycles are amongst the worst.  Drives don't 
really like the expansion of materials that occurs when temperatures rise, 
nor the extra friction that a higher temperature entails. 

> > and many people
> > would probably even state as much as that NONE of those drives will still
> > work after 137 years. (too bad there's no-one to collect the prize money)
>
> They wouldn't expect them to. If the mtbf is 137 years, then of a batch
> of 1000, approx 0.6 and a bit PERCENT would die per year.  Now you get
> to multiply.  99.3^n % is ...  well, anyway, it isn't linear, but they
> would all be expected to die out by 137y.  Anyone got some logarithms?

Look up what the word "mean" from mtbf means, and recompute.

Maarten

-- 
When I answered where I wanted to go today, they just hung up -- Unknown

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 20:55                       ` maarten
@ 2005-01-04 21:11                         ` Peter T. Breuer
  2005-01-04 21:38                         ` Peter T. Breuer
  1 sibling, 0 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 21:11 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> On Tuesday 04 January 2005 17:21, Peter T. Breuer wrote:
> > Maarten <maarten@ultratux.net> wrote:
> 
> 
> > > Nope, not 10 years, not 20 years, not even 40 years.  See this Seagate
> > > sheet below where they go on record with a whopping 1200.000 hours MTBF. 
> > > That translates to 137 years.
> >
> > I believe that too.  They REALLY have kept the monkeys well away.
> > They're only a factor of ten out from what I think it is, so I certainly
> > believe them.  And they probably discarded the ones that failed burn-in
> > too.
> >
> > > Now can you please state here and now that you
> > > actually believe that figure ?
> >
> > Of course. Why wouldn't I? They are stating something like 1% lossage
> > per year under perfect ideal conditions, no dust, no power spikes, no
> > a/c overloads, etc. I'd easily belueve that.
> 
> No spindle will take 137 years of abuse at the incredibly high speed of 10000 
> rpm and not show enough wear so that the heads will either collide with the 

Nor does anyone say it will! That's the mtbf, that's all. It's a
parameter in a statistical distribrution. The inverse of the
probability of failure per unit time.

Peter


^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 20:55                       ` maarten
  2005-01-04 21:11                         ` Peter T. Breuer
@ 2005-01-04 21:38                         ` Peter T. Breuer
  2005-01-04 23:29                           ` Guy
  1 sibling, 1 reply; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-04 21:38 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> I don't see where you come up with 1% per year.

Because that is 1/137 approx (hey, is that planks constant or
something...)

> Remember that MTBF means MEAN 
> time between failures,

I.e. it's the inverse of the probability of failure per unit time, in a
Poisson distribution.  A Poisson distribution only has one parameter
and that's it! The standard deviation is that too. No, I don't recall
the third moment offhand.

> so for every single drive that dies in year one, one 
> other drive has to double its life expectancy to twice 137, which is 274 

Complete nonsense. Please go back to remedial statistics.

> years.  If your reasoning is correct with one drive dying per year, the 

Who said that? I said the probability of failure is 1% per year. Not
one drive per year! If you have a hundred drives, you expect about one
death in the first year.

> remaining bunch after 50 years will have to survive another 250(!) years, on 
> average.  ...But wait, you're still not convinced, eh ?

Complete and utter disgraceful nonsense! Did you even get as far as the
11-year old standard in your math?

> Also, I'm not used to big data centers buying disks by the container, but from 
> what I've heard no-one can actually say that they lose as little as 1 drive a 
> year for any hundred drives bought. Those figures are (much) higher.

Of course not - I would say 10% myself. A few years ago it was 20%, but
I believe that recently the figure may have fallen as low as 5%. That's
perfectly consistent with their spec.

> You yourself said in a previous post you expected 10% per year, and that is 
> WAY off the 1% mark you now state 'believeable'.  How come ?

Because it is NOT "way off the 1% mark". It is close to it! Especially
when you bear in mind that real disks are exposed to a much more
stressful environment than the manufacturers testing labs.  Heck, we
can't even get FANs that last longer than 3 to 6 months in the
atmosphere here (dry, thin, dusty, heat reaching 46C in summer, dropping
below zero in wnter).

Is the problem simply "numbers" with you?

> > Of course we do. Why wouldn't we? That doesn't make their figures
> > wrong!
> 
> Yes it does. By _definition_ even.

No it doesn't.

> It clearly shows that one cannot account 
> for tens, nay hundreds, of years wear and tear by just taking a very small 
> sample of drives and having them tested for a very small amount of time.

Nor does anyone suggest that one should!  Where do you get this from?
Of course their figures don't reflect your environment, or mine.  If you
want to duplicte their figures, you have to duplicate their environment!
Ask them how, if you're interested.

> Look, _everybody_ knows this.  No serious admin will not change their drives 
> after five years as a rule,

Well, three years is when we change, but that's because everything is
changed every three years, since it depreciates to zero in that time,
im accounting terms. But I have ten year old disks working fine (says
he, wincing at the seagate fireballs and barracudas screaming ..).

> or 10 years at the most. And that is not simply 
> due to Moore's law.  The failure rate just gets too high, and economics 
> dictate that they must be decommissioned.  After "only" 10 years...!

Of course! So? I really don't see why you think that is anything to do
with the mtbf, whhich is only the single parameter telling you what the
scale of the poisson distribution for moment-to-moment failure is!

I really don't get why you don't get this!

Don't you know what the words mean?

Then it's no wonder that whatever you say around the area makes very
little sense, and why you have the feeling that THEY are saying
nonsense, rather than that you are UNDERSTANDING nonsense, which is the
case!

Please go and learn some stats!

> > No, it means that statistics say what they say, and I understand them
> > fine, thanks.
> 
> Uh-huh.  So explain to me why drive manufacturers do not give a 10 year 
> warrantee.

Because if they did they would have to replace 100% of their disks. If
there is a 10y mtbf in the real world (as I estimate), then very few of
them would make it to ten years.

> I say because they know full well that they would go bankrupt if 
> they did since not 8% but rather 50% or more would return in that time.

No, the mtbf in our conditions is somewhere like 10y. That means that
almost none would make it to ten years. 10% would die each year. 90%
would remain. After 5 years 59% would remain. After 10 years 35% would
remain.

> > That's a fine technique. It's perfectly OK. I suppose they did state
> > the standard deviation of their estimator?
> 
> Call them and find out; you're the math whiz. 

It doesn't matter. It's good enough as a guide.

> And I'll say it again: if some statistical technique yields wildly different 
> results than the observable, verifiable real world does, then there is 

But it doesn't! 

> something wrong with said technique, not with the real world.

They are not trying to estimate the mtbf in YOUR world, but in THEIRS.
Those are different. If you want to emulate them, so be it! I don't.

> The real world is our frame of reference, not some dreamed-up math model which 

There is nothing wrong with their model. It doesn't reflect your world.

> attempts to describe the world. And if they do collide, a math theory gets 
> thrown out, not the real world observations instead...! 

You are horribly confused!  Please do not try and tell real
statisticians that YOU do not understand the model, and that therefore
THEY should change them.  You can simply understand them. They only are
applying accelerating techniques. They take 1000 disks and run them for
a year - if 10 die, then they know the mtbf is about 100y. That does
not mean that disks will last a 100y! It means that 10 in every
thousand will die within one year.

That seems to be your major confusion.

And that's only for starters. They then have to figure out how the mtbf
changes with time! But really, you don't care about that, since it's
only the mtbf during the first five years that you care about, as you
said.

So what are you on about?

> > They wouldn't expect them to. If the mtbf is 137 years, then of a batch
> > of 1000, approx 0.6 and a bit PERCENT would die per year.  Now you get
> > to multiply.  99.3^n % is ...  well, anyway, it isn't linear, but they
> > would all be expected to die out by 137y.  Anyone got some logarithms?
> 
> Look up what the word "mean" from mtbf means, and recompute.

I know what it means - you don't.  It is the inverse of the probability
of failure in any moment of time. A strange way of stating that
parameter, but then I guess it's just that people are more used to
seeing it expresed in ohms than mho.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:38                         ` Peter T. Breuer
@ 2005-01-04 23:29                           ` Guy
  0 siblings, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-04 23:29 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

I think in this example, if you had 137 disks, you should expect an average
of 1 failed drive per year.  But, I would bet after 5 years you would have
much more than 5 failed disks!

Guy 

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Tuesday, January 04, 2005 4:38 PM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

maarten <maarten@ultratux.net> wrote:
> I don't see where you come up with 1% per year.

Because that is 1/137 approx (hey, is that planks constant or
something...)

> Remember that MTBF means MEAN 
> time between failures,

I.e. it's the inverse of the probability of failure per unit time, in a
Poisson distribution.  A Poisson distribution only has one parameter
and that's it! The standard deviation is that too. No, I don't recall
the third moment offhand.

> so for every single drive that dies in year one, one 
> other drive has to double its life expectancy to twice 137, which is 274 

Complete nonsense. Please go back to remedial statistics.

> years.  If your reasoning is correct with one drive dying per year, the 

Who said that? I said the probability of failure is 1% per year. Not
one drive per year! If you have a hundred drives, you expect about one
death in the first year.

> remaining bunch after 50 years will have to survive another 250(!) years,
on 
> average.  ...But wait, you're still not convinced, eh ?

Complete and utter disgraceful nonsense! Did you even get as far as the
11-year old standard in your math?

> Also, I'm not used to big data centers buying disks by the container, but
from 
> what I've heard no-one can actually say that they lose as little as 1
drive a 
> year for any hundred drives bought. Those figures are (much) higher.

Of course not - I would say 10% myself. A few years ago it was 20%, but
I believe that recently the figure may have fallen as low as 5%. That's
perfectly consistent with their spec.

> You yourself said in a previous post you expected 10% per year, and that
is 
> WAY off the 1% mark you now state 'believeable'.  How come ?

Because it is NOT "way off the 1% mark". It is close to it! Especially
when you bear in mind that real disks are exposed to a much more
stressful environment than the manufacturers testing labs.  Heck, we
can't even get FANs that last longer than 3 to 6 months in the
atmosphere here (dry, thin, dusty, heat reaching 46C in summer, dropping
below zero in wnter).

Is the problem simply "numbers" with you?

> > Of course we do. Why wouldn't we? That doesn't make their figures
> > wrong!
> 
> Yes it does. By _definition_ even.

No it doesn't.

> It clearly shows that one cannot account 
> for tens, nay hundreds, of years wear and tear by just taking a very small

> sample of drives and having them tested for a very small amount of time.

Nor does anyone suggest that one should!  Where do you get this from?
Of course their figures don't reflect your environment, or mine.  If you
want to duplicte their figures, you have to duplicate their environment!
Ask them how, if you're interested.

> Look, _everybody_ knows this.  No serious admin will not change their
drives 
> after five years as a rule,

Well, three years is when we change, but that's because everything is
changed every three years, since it depreciates to zero in that time,
im accounting terms. But I have ten year old disks working fine (says
he, wincing at the seagate fireballs and barracudas screaming ..).

> or 10 years at the most. And that is not simply 
> due to Moore's law.  The failure rate just gets too high, and economics 
> dictate that they must be decommissioned.  After "only" 10 years...!

Of course! So? I really don't see why you think that is anything to do
with the mtbf, whhich is only the single parameter telling you what the
scale of the poisson distribution for moment-to-moment failure is!

I really don't get why you don't get this!

Don't you know what the words mean?

Then it's no wonder that whatever you say around the area makes very
little sense, and why you have the feeling that THEY are saying
nonsense, rather than that you are UNDERSTANDING nonsense, which is the
case!

Please go and learn some stats!

> > No, it means that statistics say what they say, and I understand them
> > fine, thanks.
> 
> Uh-huh.  So explain to me why drive manufacturers do not give a 10 year 
> warrantee.

Because if they did they would have to replace 100% of their disks. If
there is a 10y mtbf in the real world (as I estimate), then very few of
them would make it to ten years.

> I say because they know full well that they would go bankrupt if 
> they did since not 8% but rather 50% or more would return in that time.

No, the mtbf in our conditions is somewhere like 10y. That means that
almost none would make it to ten years. 10% would die each year. 90%
would remain. After 5 years 59% would remain. After 10 years 35% would
remain.

> > That's a fine technique. It's perfectly OK. I suppose they did state
> > the standard deviation of their estimator?
> 
> Call them and find out; you're the math whiz. 

It doesn't matter. It's good enough as a guide.

> And I'll say it again: if some statistical technique yields wildly
different 
> results than the observable, verifiable real world does, then there is 

But it doesn't! 

> something wrong with said technique, not with the real world.

They are not trying to estimate the mtbf in YOUR world, but in THEIRS.
Those are different. If you want to emulate them, so be it! I don't.

> The real world is our frame of reference, not some dreamed-up math model
which 

There is nothing wrong with their model. It doesn't reflect your world.

> attempts to describe the world. And if they do collide, a math theory gets

> thrown out, not the real world observations instead...! 

You are horribly confused!  Please do not try and tell real
statisticians that YOU do not understand the model, and that therefore
THEY should change them.  You can simply understand them. They only are
applying accelerating techniques. They take 1000 disks and run them for
a year - if 10 die, then they know the mtbf is about 100y. That does
not mean that disks will last a 100y! It means that 10 in every
thousand will die within one year.

That seems to be your major confusion.

And that's only for starters. They then have to figure out how the mtbf
changes with time! But really, you don't care about that, since it's
only the mtbf during the first five years that you care about, as you
said.

So what are you on about?

> > They wouldn't expect them to. If the mtbf is 137 years, then of a batch
> > of 1000, approx 0.6 and a bit PERCENT would die per year.  Now you get
> > to multiply.  99.3^n % is ...  well, anyway, it isn't linear, but they
> > would all be expected to die out by 137y.  Anyone got some logarithms?
> 
> Look up what the word "mean" from mtbf means, and recompute.

I know what it means - you don't.  It is the inverse of the probability
of failure in any moment of time. A strange way of stating that
parameter, but then I guess it's just that people are more used to
seeing it expresed in ohms than mho.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 15:31                   ` Maarten
  2005-01-04 16:21                     ` Peter T. Breuer
@ 2005-01-04 19:57                     ` Mikael Abrahamsson
  2005-01-04 21:05                       ` maarten
  1 sibling, 1 reply; 172+ messages in thread
From: Mikael Abrahamsson @ 2005-01-04 19:57 UTC (permalink / raw)
  To: linux-raid

On Tue, 4 Jan 2005, Maarten wrote:

> failures within the first 10 years, let alone 20, to even remotely support 
> that outrageous MTBF claim.

One should note that environment seriously affects MTBF, even on 
non-movable parts, and probably even more on movable parts.

I've talked to people in the reliability business, and they use models 
that say that MTBF for a part at 20 C as opposed to 40 C can differ by a 
factor of 3 or 4, or even more. A lot of people skimp on cooling and then 
get upset when their drives fail.

I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will 
have less than 1/10th of that at 55-60C.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 19:57                     ` Mikael Abrahamsson
@ 2005-01-04 21:05                       ` maarten
  2005-01-04 21:26                         ` Alvin Oga
  2005-01-04 21:46                         ` Guy
  0 siblings, 2 replies; 172+ messages in thread
From: maarten @ 2005-01-04 21:05 UTC (permalink / raw)
  To: linux-raid

On Tuesday 04 January 2005 20:57, Mikael Abrahamsson wrote:
> On Tue, 4 Jan 2005, Maarten wrote:
> > failures within the first 10 years, let alone 20, to even remotely
> > support that outrageous MTBF claim.
>
> One should note that environment seriously affects MTBF, even on
> non-movable parts, and probably even more on movable parts.

Yes.  Heat especially above all else.

> I've talked to people in the reliability business, and they use models
> that say that MTBF for a part at 20 C as opposed to 40 C can differ by a
> factor of 3 or 4, or even more. A lot of people skimp on cooling and then
> get upset when their drives fail.
>
> I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will
> have less than 1/10th of that at 55-60C.

Yes. I know that full well.  Therefore my server drives are mounted directly 
behind two monstrous 12cm fans...  I don't take no risks.  :-)

Still, two western digitals have died within the first or second year in that 
enclosure. So much for MTBF vs. real world expectancy I guess.

It should be public knowledge by now that heat is the number 1 killer for 
harddisks.  However, you still see PC cases everywhere where disks are 
sandwiched together and with no possible airflow at all. Go figure... 

Maarten

-- 

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:05                       ` maarten
@ 2005-01-04 21:26                         ` Alvin Oga
  2005-01-04 21:46                         ` Guy
  1 sibling, 0 replies; 172+ messages in thread
From: Alvin Oga @ 2005-01-04 21:26 UTC (permalink / raw)
  To: maarten; +Cc: linux-raid



On Tue, 4 Jan 2005, maarten wrote:

> Yes. I know that full well.  Therefore my server drives are mounted directly 
> behind two monstrous 12cm fans...  I don't take no risks.  :-)

exactly... lots of air for the drives ( treat it like a cpu ) that it
should be kept cool as possible
 
> Still, two western digitals have died within the first or second year in that 
> enclosure. So much for MTBF vs. real world expectancy I guess.

wd is famous for various reasons ..

> It should be public knowledge by now that heat is the number 1 killer for 
> harddisks.  However, you still see PC cases everywhere where disks are 
> sandwiched together and with no possible airflow at all. Go figure... 

its a conspiracy, to get you/us to buy new disks when the old one dies

but if we all kept a 3" fan cooling each disk ... inside the pcs,
there'd be less disk failures
	- and equal amounts of fresh cooler air coming in as 
	hot air going out

c ya
alvin


^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04 21:05                       ` maarten
  2005-01-04 21:26                         ` Alvin Oga
@ 2005-01-04 21:46                         ` Guy
  1 sibling, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-04 21:46 UTC (permalink / raw)
  To: 'maarten', linux-raid

I have a PC with 2 disks, these disks are much too hot to touch for more
than a second or less.  The system has been like that for 3-4 years.  I have
no idea how they lasted so long!  1 is an IBM the other is Seagate.  Both
are 18 Gig SCSI disks.  The Seagate is 10,000 RPM.

As you said: "Go figure..."!  :)

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Tuesday, January 04, 2005 4:05 PM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

On Tuesday 04 January 2005 20:57, Mikael Abrahamsson wrote:
> On Tue, 4 Jan 2005, Maarten wrote:
> > failures within the first 10 years, let alone 20, to even remotely
> > support that outrageous MTBF claim.
>
> One should note that environment seriously affects MTBF, even on
> non-movable parts, and probably even more on movable parts.

Yes.  Heat especially above all else.

> I've talked to people in the reliability business, and they use models
> that say that MTBF for a part at 20 C as opposed to 40 C can differ by a
> factor of 3 or 4, or even more. A lot of people skimp on cooling and then
> get upset when their drives fail.
>
> I'd venture to guess that a drive that has an MTBF of 1.2M at 25C will
> have less than 1/10th of that at 55-60C.

Yes. I know that full well.  Therefore my server drives are mounted directly

behind two monstrous 12cm fans...  I don't take no risks.  :-)

Still, two western digitals have died within the first or second year in
that 
enclosure. So much for MTBF vs. real world expectancy I guess.

It should be public knowledge by now that heat is the number 1 killer for 
harddisks.  However, you still see PC cases everywhere where disks are 
sandwiched together and with no possible airflow at all. Go figure... 

Maarten

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 17:46     ` maarten
  2005-01-03 19:52       ` maarten
@ 2005-01-03 20:22       ` Peter T. Breuer
  2005-01-03 23:05         ` Guy
  2005-01-04  0:08         ` maarten
  2005-01-03 21:36       ` Guy
  2 siblings, 2 replies; 172+ messages in thread
From: Peter T. Breuer @ 2005-01-03 20:22 UTC (permalink / raw)
  To: linux-raid

maarten <maarten@ultratux.net> wrote:
> The chance of a PSU blowing up or lightning striking is, reasonably, much less 
> than an isolated disk failure.  If this simple fact is not true for you 

Oh?  We have about 20 a year.  Maybe three of them are planned.  But
those are the worst ones!  - the electrical department's method of
"testing" the lines is to switch off the rails then pulse them up and
down.  Surge tests or something.  When we can we switch everything off
beforehand.  But then we also get to deal with the amateur contributions
from the city power people.

Yes, my PhD is in electrical engineering. Have I sent them sarcastic
letters explaining  how to test lines using a dummy load? Yes. Does the
physics department also want to place them in a vat of slowly reheating
liquid nitrogen? Yes. Does it make any difference? No.

I should have kept the letter I got back when I asked them exactly WHAT
it was they thought they had been doing when they sent round a pompous
letter explaining how they had been up all night "helping" the town
power people to get back on line, after an outage took out the
half-million or so people round here. Waiting for the phonecall saying
"you can turn it back on now", I think.

That letter was a riot.

I plug my stuff into the ordinary mains myself.  It fails less often
than the "secure circuit" plugs we have that are meant to be wired to
their smoking giant UPS that apparently takes half the city output to
power up.

> personally, you really ought to reevaluate the quality of your PSU (et al) 
> and / or the buildings' defenses against a lightning strike...

I don't think so. You can argue with the guys with the digger tool and
a weekend free.

> > However, I don't see how you can expect to replace a failed disk
> > without taking down the system. For that reason you are expected to be
> > running "spare disks" that you can virtually insert hot into the array
> > (caveat, it is possible with scsi, but you will need to rescan the bus,
> > which will take it out of commission for some seconds, which may
> > require you to take the bus offline first, and it MAY be possible with
> > recent IDE buses that purport to support hotswap - I don't know).
> 
> I think the point is not what actions one has to take at time T+1 to replace 
> the disk, but rather whether at time T, when the failure first occurs, the 
> system survives the failure or not.
> 
> > (1) how likely is it that a disk will fail without taking down the system
> > (2) how likely is it that a disk will fail
> > (3) how likely is it that a whole system will fail
> >
> > I would say that (2) is about 10% per year. I would say that (3) is
> > about 1200% per year. It is therefore difficult to calculate (1), which
> > is your protection scenario, since it doesn't show up very often in the
> > stats!
> 
> I don't understand your math.  For one, percentage is measured from 0 to 100, 

No, it's measured from 0 to infinity.  Occasionally from negative
infinity to positive infinity.  Did I mention that I have two degrees in
pure mathematics?  We can discuss nonstandard interpretations of Peano's
axioms then.

> not from 0 to 1200.  What is that, 12 twelve times 'absolute certainty' that 
> something will occur ?

Yep.  Approximately.  Otherwise known as the expectaion that twelve
events will occur per year.  One a month.  I would have said "one a
month" if I had not been being precise.

> But besides that, I'd wager that from your list number (3) has, by far, the 
> smallest chance of occurring.

Except of course, that you would lose, since not only did I SAY that it
had the highest chance, but I gave a numerical estimate for it that is
120 times as high as that I gave for (1).

> Choosing between (1) and (2) is more difficult, 

Well, I said it doesn't matter, because everything is swamped by (3).

> my experiences with IDE disks are definitely that it will take the system 
> down, but that is very biased since I always used non-mirrored swap.

It's the same principle.  There exists a common mode for failure.
Bayesian calculations then tell you that there is a strong liklihood of
the whole system coming down in conjunction with the disk coming down.

> I sure can understand a system dying if it loses part of its memory...
> 
> > > ** A disk failing is the most common failure a system can have (IMO).
> 
> I fully agree.
> 
> > Not in my experience. See above. I'd say each disk has about a 10%
> > failure expectation per year. Whereas I can guarantee that an
> > unexpected  system failure will occur about once a month, on every
> > important system.

There you are. I said it again.

> Whoa !  What are you running, windows perhaps ?!? ;-)

No. Ordinary hardware.

> No but seriously, joking aside, you have 12 system failures per year ?

At a very minimum. Almost none of them caused by hardware.

Hey, I even took down my own home server by accident over new year!
Spoiled its 222 day uptime.

> I would not be alone in thinking that figure is VERY high.  My uptimes 

It isn't.  A random look at servers tells me:

   bajo          up   77+00:23,     1 user,   load 0.28, 0.39, 0.48
   balafon       up   25+08:30,     0 users,  load 0.47, 0.14, 0.05
   dino          up   77+01:15,     0 users,  load 0.00, 0.00, 0.00
   guitarra      up   19+02:15,     0 users,  load 0.20, 0.07, 0.04
   itserv        up   77+11:31,     0 users,  load 0.01, 0.02, 0.01
   itserv2       up   20+00:40,     1 user,   load 0.05, 0.13, 0.16
   lmserv        up   77+11:32,     0 users,  load 0.34, 0.13, 0.08
   lmserv2       up   20+00:49,     1 user,   load 0.14, 0.20, 0.23
   nbd           up   24+04:12,     0 users,  load 0.08, 0.08, 0.02
   oboe          up   77+02:39,     3 users,  load 0.00, 0.00, 0.00
   piano         up   77+11:55,     0 users,  load 0.00, 0.00, 0.00
   trombon       up   24+08:14,     2 users,  load 0.00, 0.00, 0.00
   violin        up   77+12:00,     4 users,  load 0.00, 0.00, 0.00
   xilofon       up   73+01:08,     0 users,  load 0.00, 0.00, 0.00
   xml           up   33+02:29,     5 users,  load 0.60, 0.64, 0.67

(one net). Looks like a major power outage 77 days ago, and a smaller
event 24 and 20 days ago. The event at 20 days ago looks like
sysadmins. Both Trombon and Nbd survived it and tey're on separate
(different) UPSs. The servers which are up 77 days are on a huge UPS
that Lmserv2 and Itserv2 should also be on, as far as I know. So
somebody took them off the UPS wihin ten minutes of each other. Looks
like maintenance moving racks.

OK, not once every month, more like between onece every 20 days and
once every 77 days, say once every 45 days.

> generally are in the three-digit range, and most *certainly* not in the low 
> 2-digit range. 

Well, they have no chance to be here. There are several planned power
outs a year for the electrical department to do their silly tricks
with. When that happens they take the weekend over it.

> > If you think about it that is quite likely, since a system is by
> > definition a complicated thing. And then it is subject to all kinds of
> > horrible outside influences, like people rewiring the server room in
> > order to reroute cables under the floor instead of through he ceiling,
> > and the maintenance people spraying the building with insecticide,
> > everywhere, or just "turning off the electricity in order to test it"
> > (that happens about four times a year here - hey, I remember when they
> > tested the giant UPS by turning off the electricity! Wrong switch.
> > Bummer).
> 
> If you have building maintenance people and other random staff that can access 
> your server room unattended and unmonitored, you have far worse problems than 
> making decicions about raid lavels.  IMNSHO.   

Oh, they most certainly can't access the server rooms. The techs would
have done that on their own, but they would (obviously) have needed to
move the machines for that, and turn them off. Ah . But yes, the guy
with the insecticide has the key to everywhere, and is probably a
gardener. I've seen him at it. He sprays all the corners of the
corridors, along the edge of the wall and floor, then does the same
inside the rooms.

The point is that most foul-ups are created by the humans, whether
technoid or gardenoid, or hole-diggeroid.

> By your description you could almost be the guy the joke with the recurring 7 
> o'clock system crash is about (where the cleaning lady unplugs the server 
> every morning in order to plug in her vacuum cleaner) ;-) 

Oh, the cleaning ladies do their share of damage. They are required BY
LAW to clean the keyboards. They do so by picking them up in their left
hand at the lower left corner, and rubbing a rag over them.

Their left hand is where the ctl and alt keys are.

Solution is not to leave keyboard in the room. Use a whaddyamacallit
switch and attach one keyboard to that whenever one needs to access
anything.. Also use thwapping great power cables one inch thck that
they cannot move.

And I won't mention the learning episodes with the linux debugger monitor
activated by pressing "pause".

Once I watched the lady cleaning my office. She SPRAYED the back of the
monitor! I YELPED! I tried to explain to her about voltages, and said
that she would't clean her tv at home that way - oh yes she did!

> > Yes, you can try and keep these systems out of harms way on a
> > colocation site, or something, but by then you are at professional
> > level paranoia. For "home systems", whole system failures are far more
> > common than disk failures.
> 
> Don't agree. 

You may not agree, but you would be rather wrong in persisting in that
idea in face of evidence that you can easily accumulate yourself, like
the figures I randomly checked above.

> Not only do disk failures occur more often than full system 
> failures,

No they don't - by about 12 to 1.

> disk failures are also much more time-consuming to recover from. 

No they aren't - we just put in another one, and copy the standard
image over it (or in the case of a server, copy its twin, but then
servers don't blow disks all that often, but when they do they blow 
ALL of them as well, as whatever blew one will blow the others in due
course - likely heat).

> Compare changing a system board or PSU with changing a drive and finding, 
> copying and verifying a backup (if you even have one that's 100% up to date)

We have. For one thing we have identical pairs of servers, abslutely
equal, md5summed and checked. The idenity-dependent scripts on them
check who they are on and do the approprate thing depending on who they
find they are on.

And all the clients are the same, as clients. Checked daily.

> > > ** In a computer room with about 20 Unix systems, in 1 year I have seen
> > > 10 or so disk failures and no other failures.
> >
> > Well, let's see. If each system has 2 disks, then that would be 25% per
> > disk per year, which I would say indicates low quality IDE disks, but
> > is about the level I would agree with as experiential.
> 
> The point here was, disk failures being more common than other failures...

But they aren't. If you have only 25% chance of failure per disk per
year, then that makes system outages much more likely, since they
happen at about one per month (here!).

If it isn't faulty scsi cables, it will be overheating cpus. Dust in
the very dry air here kills all fan bearings within 6 months to one
year. 

My defense against that is to heavily underclock all machines.

> 
> > No way! I hate tapes. I backup to other disks.
> 
> Then for your sake, I hope they're kept offline, in a safe.

No, they're kept online. Why? What would be the point of having them in
a safe? Then they'd be unavailable!

The general scheme is that sites cross-backup each other.

>
> > > ** My computer room is for development and testing, no customer access.
> >
> > Unfortunately, the admins do most of the sabotage.
> 
> Change admins.  

Can't. They're as good as they get. Hey, *I* even do the sabotage
sometimes. I'm probably only abut 99% accurate, and I can certainly
write a hundred commands in a day.

> I could understand an admin making typing errors and such, but then again that 
> would not usually lead to a total system failure.

Of course it would. You try working remotely to upgrade the sshd and
finally killing off the old one, only to discover that you kill the
wrong one and lock yourself out, while the deadman script on the server
tries fruitlessly to restart a misconfigured server, and then finally
decides after an hour to give up and reboot as a last resort, then
can't bring itself back up because of something else you did that you
were intending to finish but didn't get the opportunity to.

> Some daemon not working, 
> sure.  Good admins review or test their changes,

And sometimes miss the problem.

> for one thing, and in most 
> cases any such mistake is rectified much simpler and faster than a failed 
> disk anyway. Except maybe for lilo errors with no boot media available. ;-\ 

Well, you can go out to the site in the middle of the night to reboot!
Changes are made out of working hours so as not to disturb the users.

> > Yes you did. You can see from the quoting that you did.
> 
> Or the quoting got messed up.  That is known to happen in threads.

Shrug.

> > > but it may be more current than 1 or
> > > more of the other disks.  But this would be similar to what would happen
> > > to a non-RAID disk (some data not written).
> >
> > No, it would not be similar. You don't seem to understand the
> > mechanism. The mechanism for corruption is that there are two different
> > versions of the data available when the system comes back up, and you
> > and the raid system don't know which is more correct. Or even what it
> > means to be "correct". Maybe the earlier written data is "correct"!
> 
> That is not the whole truth.  To be fair, the mechanism works like this:
> With raid, you have a 50% chance the wrong, corrupted, data is used.
> Without raid, thus only having a single disk, the chance of using the 
> corrupted data is 100% (obviously, since there is only one source)

That is one particular spin on it. 

> Or, much more elaborate:
> 
> Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5

There's no need to. Call it "p".

> With raid, you always have a 50% chance of reading faultty data IF one of the 
> drives holds faulty data. 

That is, the probability of corruption occuring and it THEN being
detected is 0.5 .  However, the probabilty that it occurred is 2p, not
p, since there are two disks (forget the tiny p^2 possibility). So
we have

  p = probability of corruption occuring AND it being detected.

> For the drives itself, the chance of both disks 
> being wrong is 0.5x0.5=0.25(scenario A).  Similarly, 25 % chance both disks 
> are good (scenario B). The chance of one of the disks being wrong is 50% 
> (scenarios C & D together).  In scenarios A & B the outcome is certain. In 
> scenarios C & D the chance of the raid choosing the false mirror is 50%.
> Accumulating those chances one can say that the chance of reading false data 
> is:
> in scenario A: 100%

              p^2

> in scenario B: 0%
              0
> scenario C: 50%
              0.5p
> scenario D: 50%
              0.5p

> Doing the math, the outcome is still (200% divided by four)= 50%.

Well, it's p + p^2. But I said to neglect the square term.

> Ergo: the same as with a single disk.  No change.

Except that it is not the case. With a single disk you are CERTAIN to
detect the problem (if it is detectable) when you run the fsck at
reboot.  With a RAID1 mirror you are only 50% likely to detect the
detectable problem, because you may choose to read the "wrong" (correct
:) disk at the crucial point in the fsck.  Then you have to hope that
the right disk fails next, when it fails, or else you will be left holding
the detectably wrong, unchecked data.

So in the scenario of a single detectable corruption:

A: probability of a detectable error occuring and NOT being detected on
   a single disk system is

       zero

B: probability of a detectable error occuring and NOT being detected on
    a two disk system is

        p

Cute, no? You could have deduced that from your figures too, but you
were all fired up about the question of a detectable error occurring
AND being detected to think about it occuring AND NOT being detected.

Even though that is what interests us! "silent corruption".

> > > In contrast, on a single disk they have a 100% chance of detection (if
> > > you look!) and a 100% chance of occuring, wrt normal rate.
> > > ** Are you talking about the disk drive detecting the error?
> 
> No, you have a zero chance of detection, since there is nothing to compare TO.

That is not the case. You have every chance in the world of detecting
it - you know what fsck does.

If you like we can consider detectable and indetectable errors
separtely.

> Raid-1 at least gives you a 50/50 chance to choose the right data.  With a 
> single disk, the chance of reusing the corrupted data is 100% and there is no 
> mechanism to detect the odd 'tumbled bit' at all.

False.

> > You wouldn't necesarily know  which of the two data sources was
> > "correct".
> 
> No, but you have a theoretical choice, and a 50% chance of being right.
> Not so without raid, where you get no choice, and a 100% chance of getting the 
> wrong data, in the case of a corruption.

Review the calculation.

Peter

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 20:22       ` Peter T. Breuer
@ 2005-01-03 23:05         ` Guy
  2005-01-04  0:08         ` maarten
  1 sibling, 0 replies; 172+ messages in thread
From: Guy @ 2005-01-03 23:05 UTC (permalink / raw)
  To: 'Peter T. Breuer', linux-raid

Peter said:
"Except that it is not the case. With a single disk you are CERTAIN to
detect the problem (if it is detectable) when you run the fsck at reboot."

Guy says:
As a guess, fsck checks less than 1% of the disk.  No user data is checked.
So, virtually all errors would go un-detected.  But a RAID system could
detect the errors.  Any yes, RAID6 could correct a single disk error.  Even
multi disk errors, as long as only 1 error per stripe/block.

Your data center has problems well beyond the worse stories I have ever
heard!  My home systems tend to have much better uptime than any of your
systems.

  5:05pm  up 33 days, 16:31,  1 user,  load average: 0.12, 0.03, 0.01
17:05:20  up 28 days, 15:06,  1 user,  load average: 0.03, 0.04, 0.00

Both were re-booted by me, not some strange failures.  When I re-booted the
first one, it had over 7 months of uptime.  At work I had 2 systems with
over 2 years uptime, and one of them made it to 3 years (yoda).  The 3 year
system was connected to the Internet and was used for customer demos.  So,
very low usage, but the 2 year system (lager) was an internal email server,
test server, router, had Informix, ...  I must admit, I rebooted the 2 year
system by accident!  But it was a proper reboot, not a crash.  The Y2K
patches did not require a re-boot.

This is from an email I sent 10/26/2000:
"Subject: Happy birthday Yoda!
  6:13pm  up 730 days, 23:59,  1 user,  load average: 0.10, 0.12, 0.12
  6:14pm  up 731 days,  1 user,  load average: 0.08, 0.12, 0.12"

It was a leap year!  So 1 extra day.

Sent 7/5/2001:
"Kathy,
    Yoda will be 1000 days up Jul 22 @ 18:14, thats a Sunday.

Guy"

We did give yoda a 3 year birthday party.  It was not my idea.  I told my
wife about the party, my wife baked cupcakes.  Again, not my idea!

We moved to a different building, everything had to be shutdown.
I sent this email 6/20/2002:
"The final up-time report!
Yoda wins with 3.64 years!

Yoda will be shut down in a few minutes!

comet
  4:29pm  up 6 days, 49 mins,  15 users,  load average: 1.31, 0.95, 0.53
trex
  4:31pm  up 38 days, 16:34,  13 users,  load average: 0.04, 0.02, 0.02
yoda
  4:31pm  up 1332 days, 22:17,  0 users,  load average: 0.17, 0.13, 0.12
falcon
  4:31pm  up 38 days, 16:29,  14 users,  load average: 0.00, 0.00, 0.01
right
  4:31pm  up 38 days, 15:42,  7 users,  load average: 2.28, 1.86, 1.68
saturn
  4:31pm  up 16 days,  6:14,  1 user,  load average: 0.01, 0.02, 0.02
citi1
  4:31pm  up 63 days, 23:20,  7 users,  load average: 0.27, 0.23, 0.21
lager
  4:46pm  up 606 days,  1:47,  5 users,  load average: 0.00, 0.00, 0.00

Guy"

As you can see, some of us don't have major problem as you describe.

Oh, somewhere you said you have on-line disk backups.  This is very bad.  If
you had a lighting strike, or fire, it could takeout all of your disks at
the same time.  Your backup copies should be off-site, tape or disk, it does
not matter.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Peter T. Breuer
Sent: Monday, January 03, 2005 3:22 PM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

maarten <maarten@ultratux.net> wrote:
> The chance of a PSU blowing up or lightning striking is, reasonably, much
less 
> than an isolated disk failure.  If this simple fact is not true for you 

Oh?  We have about 20 a year.  Maybe three of them are planned.  But
those are the worst ones!  - the electrical department's method of
"testing" the lines is to switch off the rails then pulse them up and
down.  Surge tests or something.  When we can we switch everything off
beforehand.  But then we also get to deal with the amateur contributions
from the city power people.

Yes, my PhD is in electrical engineering. Have I sent them sarcastic
letters explaining  how to test lines using a dummy load? Yes. Does the
physics department also want to place them in a vat of slowly reheating
liquid nitrogen? Yes. Does it make any difference? No.

I should have kept the letter I got back when I asked them exactly WHAT
it was they thought they had been doing when they sent round a pompous
letter explaining how they had been up all night "helping" the town
power people to get back on line, after an outage took out the
half-million or so people round here. Waiting for the phonecall saying
"you can turn it back on now", I think.

That letter was a riot.

I plug my stuff into the ordinary mains myself.  It fails less often
than the "secure circuit" plugs we have that are meant to be wired to
their smoking giant UPS that apparently takes half the city output to
power up.

> personally, you really ought to reevaluate the quality of your PSU (et al)

> and / or the buildings' defenses against a lightning strike...

I don't think so. You can argue with the guys with the digger tool and
a weekend free.

> > However, I don't see how you can expect to replace a failed disk
> > without taking down the system. For that reason you are expected to be
> > running "spare disks" that you can virtually insert hot into the array
> > (caveat, it is possible with scsi, but you will need to rescan the bus,
> > which will take it out of commission for some seconds, which may
> > require you to take the bus offline first, and it MAY be possible with
> > recent IDE buses that purport to support hotswap - I don't know).
> 
> I think the point is not what actions one has to take at time T+1 to
replace 
> the disk, but rather whether at time T, when the failure first occurs, the

> system survives the failure or not.
> 
> > (1) how likely is it that a disk will fail without taking down the
system
> > (2) how likely is it that a disk will fail
> > (3) how likely is it that a whole system will fail
> >
> > I would say that (2) is about 10% per year. I would say that (3) is
> > about 1200% per year. It is therefore difficult to calculate (1), which
> > is your protection scenario, since it doesn't show up very often in the
> > stats!
> 
> I don't understand your math.  For one, percentage is measured from 0 to
100, 

No, it's measured from 0 to infinity.  Occasionally from negative
infinity to positive infinity.  Did I mention that I have two degrees in
pure mathematics?  We can discuss nonstandard interpretations of Peano's
axioms then.

> not from 0 to 1200.  What is that, 12 twelve times 'absolute certainty'
that 
> something will occur ?

Yep.  Approximately.  Otherwise known as the expectaion that twelve
events will occur per year.  One a month.  I would have said "one a
month" if I had not been being precise.

> But besides that, I'd wager that from your list number (3) has, by far,
the 
> smallest chance of occurring.

Except of course, that you would lose, since not only did I SAY that it
had the highest chance, but I gave a numerical estimate for it that is
120 times as high as that I gave for (1).

> Choosing between (1) and (2) is more difficult, 

Well, I said it doesn't matter, because everything is swamped by (3).

> my experiences with IDE disks are definitely that it will take the system 
> down, but that is very biased since I always used non-mirrored swap.

It's the same principle.  There exists a common mode for failure.
Bayesian calculations then tell you that there is a strong liklihood of
the whole system coming down in conjunction with the disk coming down.

> I sure can understand a system dying if it loses part of its memory...
> 
> > > ** A disk failing is the most common failure a system can have (IMO).
> 
> I fully agree.
> 
> > Not in my experience. See above. I'd say each disk has about a 10%
> > failure expectation per year. Whereas I can guarantee that an
> > unexpected  system failure will occur about once a month, on every
> > important system.

There you are. I said it again.

> Whoa !  What are you running, windows perhaps ?!? ;-)

No. Ordinary hardware.

> No but seriously, joking aside, you have 12 system failures per year ?

At a very minimum. Almost none of them caused by hardware.

Hey, I even took down my own home server by accident over new year!
Spoiled its 222 day uptime.

> I would not be alone in thinking that figure is VERY high.  My uptimes 

It isn't.  A random look at servers tells me:

   bajo          up   77+00:23,     1 user,   load 0.28, 0.39, 0.48
   balafon       up   25+08:30,     0 users,  load 0.47, 0.14, 0.05
   dino          up   77+01:15,     0 users,  load 0.00, 0.00, 0.00
   guitarra      up   19+02:15,     0 users,  load 0.20, 0.07, 0.04
   itserv        up   77+11:31,     0 users,  load 0.01, 0.02, 0.01
   itserv2       up   20+00:40,     1 user,   load 0.05, 0.13, 0.16
   lmserv        up   77+11:32,     0 users,  load 0.34, 0.13, 0.08
   lmserv2       up   20+00:49,     1 user,   load 0.14, 0.20, 0.23
   nbd           up   24+04:12,     0 users,  load 0.08, 0.08, 0.02
   oboe          up   77+02:39,     3 users,  load 0.00, 0.00, 0.00
   piano         up   77+11:55,     0 users,  load 0.00, 0.00, 0.00
   trombon       up   24+08:14,     2 users,  load 0.00, 0.00, 0.00
   violin        up   77+12:00,     4 users,  load 0.00, 0.00, 0.00
   xilofon       up   73+01:08,     0 users,  load 0.00, 0.00, 0.00
   xml           up   33+02:29,     5 users,  load 0.60, 0.64, 0.67

(one net). Looks like a major power outage 77 days ago, and a smaller
event 24 and 20 days ago. The event at 20 days ago looks like
sysadmins. Both Trombon and Nbd survived it and tey're on separate
(different) UPSs. The servers which are up 77 days are on a huge UPS
that Lmserv2 and Itserv2 should also be on, as far as I know. So
somebody took them off the UPS wihin ten minutes of each other. Looks
like maintenance moving racks.

OK, not once every month, more like between onece every 20 days and
once every 77 days, say once every 45 days.

> generally are in the three-digit range, and most *certainly* not in the
low 
> 2-digit range. 

Well, they have no chance to be here. There are several planned power
outs a year for the electrical department to do their silly tricks
with. When that happens they take the weekend over it.

> > If you think about it that is quite likely, since a system is by
> > definition a complicated thing. And then it is subject to all kinds of
> > horrible outside influences, like people rewiring the server room in
> > order to reroute cables under the floor instead of through he ceiling,
> > and the maintenance people spraying the building with insecticide,
> > everywhere, or just "turning off the electricity in order to test it"
> > (that happens about four times a year here - hey, I remember when they
> > tested the giant UPS by turning off the electricity! Wrong switch.
> > Bummer).
> 
> If you have building maintenance people and other random staff that can
access 
> your server room unattended and unmonitored, you have far worse problems
than 
> making decicions about raid lavels.  IMNSHO.   

Oh, they most certainly can't access the server rooms. The techs would
have done that on their own, but they would (obviously) have needed to
move the machines for that, and turn them off. Ah . But yes, the guy
with the insecticide has the key to everywhere, and is probably a
gardener. I've seen him at it. He sprays all the corners of the
corridors, along the edge of the wall and floor, then does the same
inside the rooms.

The point is that most foul-ups are created by the humans, whether
technoid or gardenoid, or hole-diggeroid.

> By your description you could almost be the guy the joke with the
recurring 7 
> o'clock system crash is about (where the cleaning lady unplugs the server 
> every morning in order to plug in her vacuum cleaner) ;-) 

Oh, the cleaning ladies do their share of damage. They are required BY
LAW to clean the keyboards. They do so by picking them up in their left
hand at the lower left corner, and rubbing a rag over them.

Their left hand is where the ctl and alt keys are.

Solution is not to leave keyboard in the room. Use a whaddyamacallit
switch and attach one keyboard to that whenever one needs to access
anything.. Also use thwapping great power cables one inch thck that
they cannot move.

And I won't mention the learning episodes with the linux debugger monitor
activated by pressing "pause".

Once I watched the lady cleaning my office. She SPRAYED the back of the
monitor! I YELPED! I tried to explain to her about voltages, and said
that she would't clean her tv at home that way - oh yes she did!

> > Yes, you can try and keep these systems out of harms way on a
> > colocation site, or something, but by then you are at professional
> > level paranoia. For "home systems", whole system failures are far more
> > common than disk failures.
> 
> Don't agree. 

You may not agree, but you would be rather wrong in persisting in that
idea in face of evidence that you can easily accumulate yourself, like
the figures I randomly checked above.

> Not only do disk failures occur more often than full system 
> failures,

No they don't - by about 12 to 1.

> disk failures are also much more time-consuming to recover from. 

No they aren't - we just put in another one, and copy the standard
image over it (or in the case of a server, copy its twin, but then
servers don't blow disks all that often, but when they do they blow 
ALL of them as well, as whatever blew one will blow the others in due
course - likely heat).

> Compare changing a system board or PSU with changing a drive and finding, 
> copying and verifying a backup (if you even have one that's 100% up to
date)

We have. For one thing we have identical pairs of servers, abslutely
equal, md5summed and checked. The idenity-dependent scripts on them
check who they are on and do the approprate thing depending on who they
find they are on.

And all the clients are the same, as clients. Checked daily.

> > > ** In a computer room with about 20 Unix systems, in 1 year I have
seen
> > > 10 or so disk failures and no other failures.
> >
> > Well, let's see. If each system has 2 disks, then that would be 25% per
> > disk per year, which I would say indicates low quality IDE disks, but
> > is about the level I would agree with as experiential.
> 
> The point here was, disk failures being more common than other failures...

But they aren't. If you have only 25% chance of failure per disk per
year, then that makes system outages much more likely, since they
happen at about one per month (here!).

If it isn't faulty scsi cables, it will be overheating cpus. Dust in
the very dry air here kills all fan bearings within 6 months to one
year. 

My defense against that is to heavily underclock all machines.

> 
> > No way! I hate tapes. I backup to other disks.
> 
> Then for your sake, I hope they're kept offline, in a safe.

No, they're kept online. Why? What would be the point of having them in
a safe? Then they'd be unavailable!

The general scheme is that sites cross-backup each other.

>
> > > ** My computer room is for development and testing, no customer
access.
> >
> > Unfortunately, the admins do most of the sabotage.
> 
> Change admins.  

Can't. They're as good as they get. Hey, *I* even do the sabotage
sometimes. I'm probably only abut 99% accurate, and I can certainly
write a hundred commands in a day.

> I could understand an admin making typing errors and such, but then again
that 
> would not usually lead to a total system failure.

Of course it would. You try working remotely to upgrade the sshd and
finally killing off the old one, only to discover that you kill the
wrong one and lock yourself out, while the deadman script on the server
tries fruitlessly to restart a misconfigured server, and then finally
decides after an hour to give up and reboot as a last resort, then
can't bring itself back up because of something else you did that you
were intending to finish but didn't get the opportunity to.

> Some daemon not working, 
> sure.  Good admins review or test their changes,

And sometimes miss the problem.

> for one thing, and in most 
> cases any such mistake is rectified much simpler and faster than a failed 
> disk anyway. Except maybe for lilo errors with no boot media available.
;-\ 

Well, you can go out to the site in the middle of the night to reboot!
Changes are made out of working hours so as not to disturb the users.

> > Yes you did. You can see from the quoting that you did.
> 
> Or the quoting got messed up.  That is known to happen in threads.

Shrug.

> > > but it may be more current than 1 or
> > > more of the other disks.  But this would be similar to what would
happen
> > > to a non-RAID disk (some data not written).
> >
> > No, it would not be similar. You don't seem to understand the
> > mechanism. The mechanism for corruption is that there are two different
> > versions of the data available when the system comes back up, and you
> > and the raid system don't know which is more correct. Or even what it
> > means to be "correct". Maybe the earlier written data is "correct"!
> 
> That is not the whole truth.  To be fair, the mechanism works like this:
> With raid, you have a 50% chance the wrong, corrupted, data is used.
> Without raid, thus only having a single disk, the chance of using the 
> corrupted data is 100% (obviously, since there is only one source)

That is one particular spin on it. 

> Or, much more elaborate:
> 
> Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5

There's no need to. Call it "p".

> With raid, you always have a 50% chance of reading faultty data IF one of
the 
> drives holds faulty data. 

That is, the probability of corruption occuring and it THEN being
detected is 0.5 .  However, the probabilty that it occurred is 2p, not
p, since there are two disks (forget the tiny p^2 possibility). So
we have

  p = probability of corruption occuring AND it being detected.

> For the drives itself, the chance of both disks 
> being wrong is 0.5x0.5=0.25(scenario A).  Similarly, 25 % chance both
disks 
> are good (scenario B). The chance of one of the disks being wrong is 50% 
> (scenarios C & D together).  In scenarios A & B the outcome is certain. In

> scenarios C & D the chance of the raid choosing the false mirror is 50%.
> Accumulating those chances one can say that the chance of reading false
data 
> is:
> in scenario A: 100%

              p^2

> in scenario B: 0%
              0
> scenario C: 50%
              0.5p
> scenario D: 50%
              0.5p

> Doing the math, the outcome is still (200% divided by four)= 50%.

Well, it's p + p^2. But I said to neglect the square term.

> Ergo: the same as with a single disk.  No change.

Except that it is not the case. With a single disk you are CERTAIN to
detect the problem (if it is detectable) when you run the fsck at
reboot.  With a RAID1 mirror you are only 50% likely to detect the
detectable problem, because you may choose to read the "wrong" (correct
:) disk at the crucial point in the fsck.  Then you have to hope that
the right disk fails next, when it fails, or else you will be left holding
the detectably wrong, unchecked data.

So in the scenario of a single detectable corruption:

A: probability of a detectable error occuring and NOT being detected on
   a single disk system is

       zero

B: probability of a detectable error occuring and NOT being detected on
    a two disk system is

        p

Cute, no? You could have deduced that from your figures too, but you
were all fired up about the question of a detectable error occurring
AND being detected to think about it occuring AND NOT being detected.

Even though that is what interests us! "silent corruption".

> > > In contrast, on a single disk they have a 100% chance of detection (if
> > > you look!) and a 100% chance of occuring, wrt normal rate.
> > > ** Are you talking about the disk drive detecting the error?
> 
> No, you have a zero chance of detection, since there is nothing to compare
TO.

That is not the case. You have every chance in the world of detecting
it - you know what fsck does.

If you like we can consider detectable and indetectable errors
separtely.

> Raid-1 at least gives you a 50/50 chance to choose the right data.  With a

> single disk, the chance of reusing the corrupted data is 100% and there is
no 
> mechanism to detect the odd 'tumbled bit' at all.

False.

> > You wouldn't necesarily know  which of the two data sources was
> > "correct".
> 
> No, but you have a theoretical choice, and a 50% chance of being right.
> Not so without raid, where you get no choice, and a 100% chance of getting
the 
> wrong data, in the case of a corruption.

Review the calculation.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 20:22       ` Peter T. Breuer
  2005-01-03 23:05         ` Guy
@ 2005-01-04  0:08         ` maarten
  1 sibling, 0 replies; 172+ messages in thread
From: maarten @ 2005-01-04  0:08 UTC (permalink / raw)
  To: linux-raid

On Monday 03 January 2005 21:22, Peter T. Breuer wrote:
> maarten <maarten@ultratux.net> wrote:
> > The chance of a PSU blowing up or lightning striking is, reasonably, much
> > less than an isolated disk failure.  If this simple fact is not true for
> > you
>
> Oh?  We have about 20 a year.  Maybe three of them are planned.  But
> those are the worst ones!  - the electrical department's method of
> "testing" the lines is to switch off the rails then pulse them up and
> down.  Surge tests or something.  When we can we switch everything off
> beforehand.  But then we also get to deal with the amateur contributions
> from the city power people.

It goes on and on below, but this your first paragraph is already striking(!)
You actually say that the planned outages are worse than the others!
OMG.  Who taught you how to plan ?  Isn't planning the act of anticipating 
things, and acting accordingly so as to minimize the impact ?
So your planning is so bad that the planned maintenance is actually worse than 
the impromptu outages.   I...  I am speechless.  Really. You take the cake.

But from the rest of your post it also seems you define a "total system 
failure" as something entirely different as the rest of us (presumably).
You count either planned or unplanned outages as failures, whereas most of us 
would call that downtime, not system failure, let alone "total".
If you have a problematic UPS system, or mentally challenged UPS engineers, 
that does not constitute a failure IN YOUR server.  Same for a broken 
network.  Total system failures is where the single computer system we're 
focussing on goes down or is unresponsive. You can't say "your server" is 
down when all that is happening is someone pulled the UTP from your remote 
console...! 

> Yes, my PhD is in electrical engineering. Have I sent them sarcastic
> letters explaining  how to test lines using a dummy load? Yes. Does the
> physics department also want to place them in a vat of slowly reheating
> liquid nitrogen? Yes. Does it make any difference? No.

I don't know what you're on about, nor do I really care.  I repeat: your UPS 
or powercompany failing does not constitute a _server_ failure.  
It is downtime.
Downtime != system failure, (although the reverse obviously is).

We shall forthwith define a system failure as a state where there are 
_repairs_ neccessary to the server, for it to start working again. 
Not just the reconnection of mains plugs. Okay with that ?

> > I don't understand your math.  For one, percentage is measured from 0 to
> > 100,
>
> No, it's measured from 0 to infinity.  Occasionally from negative
> infinity to positive infinity.  Did I mention that I have two degrees in
> pure mathematics?  We can discuss nonstandard interpretations of Peano's
> axioms then.

Sigh.  Look up what "per cent" means (it's Latin).
Also, since you seem to pride yourself on your leet math skills, remember that 
your professor said that chance can be between 0 (false) and 1 (true).
Two or 12 cannot be an outcome of any probability calculation.  

> > But besides that, I'd wager that from your list number (3) has, by far,
> > the smallest chance of occurring.
>
> Except of course, that you would lose, since not only did I SAY that it
> had the highest chance, but I gave a numerical estimate for it that is
> 120 times as high as that I gave for (1).

Then your data center cannot seriously call itself that. Or your staff cannot 
call themselves capable.  Choose whatever suits you. 12 outages a year...
Bwaaah.
Even a random home windows box has less outages than that(!).

> > Choosing between (1) and (2) is more difficult,
>
> Well, I said it doesn't matter, because everything is swamped by (3).

Which I disagreed with. I stated (3) is normally the _least_ likely.

> > my experiences with IDE disks are definitely that it will take the system
> > down, but that is very biased since I always used non-mirrored swap.
>
> It's the same principle.  There exists a common mode for failure.
> Bayesian calculations then tell you that there is a strong liklihood of
> the whole system coming down in conjunction with the disk coming down.

Nope, there isn't.  Bayesian or not, hotswap drives on hardware raid cards 
prove you wrong, day in day out.  So either you're talking about linux with 
md specifically, or you should wake up and smell the coffee.

> > > Not in my experience. See above. I'd say each disk has about a 10%
> > > failure expectation per year. Whereas I can guarantee that an
> > > unexpected  system failure will occur about once a month, on every
> > > important system.
>
> There you are. I said it again.

You quote yourself and you agree with that. Now why doesn't that surprise me ?

> Hey, I even took down my own home server by accident over new year!
> Spoiled its 222 day uptime.

Your user error hardly counts as total system failure, don't you think ? 

> > I would not be alone in thinking that figure is VERY high.  My uptimes
>
> It isn't.  A random look at servers tells me:
>
>    bajo          up   77+00:23,     1 user,   load 0.28, 0.39, 0.48
>    balafon       up   25+08:30,     0 users,  load 0.47, 0.14, 0.05
>    dino          up   77+01:15,     0 users,  load 0.00, 0.00, 0.00
>    guitarra      up   19+02:15,     0 users,  load 0.20, 0.07, 0.04
>    itserv        up   77+11:31,     0 users,  load 0.01, 0.02, 0.01
>    itserv2       up   20+00:40,     1 user,   load 0.05, 0.13, 0.16
>    lmserv        up   77+11:32,     0 users,  load 0.34, 0.13, 0.08
>    lmserv2       up   20+00:49,     1 user,   load 0.14, 0.20, 0.23
>    nbd           up   24+04:12,     0 users,  load 0.08, 0.08, 0.02
>    oboe          up   77+02:39,     3 users,  load 0.00, 0.00, 0.00
>    piano         up   77+11:55,     0 users,  load 0.00, 0.00, 0.00
>    trombon       up   24+08:14,     2 users,  load 0.00, 0.00, 0.00
>    violin        up   77+12:00,     4 users,  load 0.00, 0.00, 0.00
>    xilofon       up   73+01:08,     0 users,  load 0.00, 0.00, 0.00
>    xml           up   33+02:29,     5 users,  load 0.60, 0.64, 0.67
>
> (one net). Looks like a major power outage 77 days ago, and a smaller
> event 24 and 20 days ago. The event at 20 days ago looks like
> sysadmins. Both Trombon and Nbd survived it and tey're on separate
> (different) UPSs. The servers which are up 77 days are on a huge UPS
> that Lmserv2 and Itserv2 should also be on, as far as I know. So
> somebody took them off the UPS wihin ten minutes of each other. Looks
> like maintenance moving racks.

Okay, once again: your loss of power has nothing to do with a server failure.
You can't say that your engine died and needs repair just because you forgot 
to fill the gas tank.  You just add gas and away you go.  No repair. No 
damage. Just downtime. Inconvenient as it may be, but that is not relevant.

> Well, they have no chance to be here. There are several planned power
> outs a year for the electrical department to do their silly tricks
> with. When that happens they take the weekend over it.

First off, since that is planned, it is _your_ job to be there beforehand and 
properly shutdown all those systems proir to losing the power. Secondly, 
reevaluate your UPS setup...!!!  How is it even possible we're discussing 
such obvious measures.  UPS'es are there for a reason.  If your upstream UPS 
systems are unreliable, then add your own UPSes, one per server if need be.
It really isn't rocket science...

> > If you have building maintenance people and other random staff that can
> > access your server room unattended and unmonitored, you have far worse
> > problems than making decicions about raid lavels.  IMNSHO.
>
> Oh, they most certainly can't access the server rooms. The techs would
> have done that on their own, but they would (obviously) have needed to
> move the machines for that, and turn them off. Ah . But yes, the guy
> with the insecticide has the key to everywhere, and is probably a
> gardener. I've seen him at it. He sprays all the corners of the
> corridors, along the edge of the wall and floor, then does the same
> inside the rooms.

Oh really.  Nice.  Do you even realize that since your gardener or whatever 
can access everything, and will spray stuff around indiscriminately, he could 
very well incinerate your server room (or the whole building for that matter) 

It's really very simple.  You tell him that he has two options:
A) He agrees to only enter the server rooms in case of immediate emergency and 
will refrain from entering the room without your supervision in all other 
cases. You let him sign a paper stating as much.  
or 
B) You will change the lock on the server room thus disallowing all access.
You agree you will personally carry out all 'maintenance' in that room.

> The point is that most foul-ups are created by the humans, whether
> technoid or gardenoid, or hole-diggeroid.

And that is exactly why you should make sure their access is limited !

> > By your description you could almost be the guy the joke with the
> > recurring 7 o'clock system crash is about (where the cleaning lady
> > unplugs the server every morning in order to plug in her vacuum cleaner)
> > ;-)
>
> Oh, the cleaning ladies do their share of damage. They are required BY
> LAW to clean the keyboards. They do so by picking them up in their left
> hand at the lower left corner, and rubbing a rag over them.

Whoa, what special country are you at ?  In my neck of the woods, I can 
disallow any and all cleaning if I deem it is hazardous to the cleaner and / 
or the equipment.  Next, you'll start telling me that they clean your backup 
tapes and/or enclosures with a rag and soap and that you are required by law 
to grant them that right...?
Do you think they have cleaning crews in nuclear facilities ?  If so, do you 
think they are allowed (by law, no less) to go even near the control panels 
that regulate the reactor process ?  (nope, I didn't think you did) 

> Their left hand is where the ctl and alt keys are.
>
> Solution is not to leave keyboard in the room. Use a whaddyamacallit
> switch and attach one keyboard to that whenever one needs to access
> anything.. Also use thwapping great power cables one inch thck that
> they cannot move.

Oh my. Oh my. Oh my.  I cannot believe you.  Have you ever heard of locking 
the console, perhaps ?!?  You know, the state where nothing else than typing 
your password will do anything ?  You can do that _most_certainly_ with KVM 
switches, in case your OS is too stubborn to disregard the various three 
finger combinations we all know.

> And I won't mention the learning episodes with the linux debugger monitor
> activated by pressing "pause".

man xlock.  man vlock.   djeez...  is this newbie time now ?

> Once I watched the lady cleaning my office. She SPRAYED the back of the
> monitor! I YELPED! I tried to explain to her about voltages, and said
> that she would't clean her tv at home that way - oh yes she did!

Exactly my point.  My suggestion to you (if simply explaining doesn't help): 
Call the cleaner over to an old unused 14" CRT. Spray a lot of water-based, 
or better, flammable stuff into and onto the back of it.  Wait for the smoke 
or the sparks to come flying...!    stand back and enjoy. ;-)

> You may not agree, but you would be rather wrong in persisting in that
> idea in face of evidence that you can easily accumulate yourself, like
> the figures I randomly checked above.

Nope.  However, I will admit that -in light of everything you said- your 
environment is very unsafe, very unreliable and frankly just unfit to house a 
data center worth its name.  I'm sure others will agree with me.

You can't just go around saying that 12 power outages per year are _normal_ 
and expected. You can't pretend something very very wrong is going on at your 
site.  I've experienced 1 (count 'em: one) power outage in our last colo in 
over four years, and boy did my management give them (the colo facility) hell 
over it !   

> > Not only do disk failures occur more often than full system
> > failures,
>
> No they don't - by about 12 to 1.

Only in your world, yes.

> > disk failures are also much more time-consuming to recover from.
>
> No they aren't - we just put in another one, and copy the standard
> image over it (or in the case of a server, copy its twin, but then
> servers don't blow disks all that often, but when they do they blow
> ALL of them as well, as whatever blew one will blow the others in due
> course - likely heat).

If you had used a colo, you wouldn't have dust lead to a premature fan failure 
(in my experience). There is no smoking in colo facilities expressly for that 
reason (and the fire hazard, obviously).  But even then, you could remotely 
monitor the fan health, and /or the temperature. 

I still stand by my statement: disks are more time consuming than other 
failures to repair.  Motherboards don't need data being restored to them.
Much less finding out how complete the data backup was, and verifying that all 
works again as expected.

> > Compare changing a system board or PSU with changing a drive and finding,
> > copying and verifying a backup (if you even have one that's 100% up to
> > date)
>
> We have. For one thing we have identical pairs of servers, abslutely
> equal, md5summed and checked. The idenity-dependent scripts on them
> check who they are on and do the approprate thing depending on who they
> find they are on.

Good for you. Well planned.  It just amazes me now more than ever that the 
rest of the setup seems so broken / unstable.  On the other hand, with 12 
power outages yearly, you most definitely need two redundant servers.

> > The point here was, disk failures being more common than other
> > failures...
>
> But they aren't. If you have only 25% chance of failure per disk per
> year, then that makes system outages much more likely, since they
> happen at about one per month (here!).

With the emphasis on your word "(here!)", yes.

> If it isn't faulty scsi cables, it will be overheating cpus. Dust in
> the very dry air here kills all fan bearings within 6 months to one
> year.

Colo facilities have a strict no smoking rule, and air filters to clean what 
enters.  I can guarantee you that a good fan in a good colo will live 4++ 
years.  Excuse me but dry air, my ***.  Clean air is not dependent on 
dryness.  It is dependent on cleanness.

> My defense against that is to heavily underclock all machines.

Um, yeah.  Great thinking.  Do you underclock the PSU also, and the disks ?
Maybe you could run a scsi 15000 rpm drive at 10000, see what that gives ?
Sorry for getting overly sarcastic here, but there really is no end to the 
stupidities, is there ?

> > > No way! I hate tapes. I backup to other disks.
> >
> > Then for your sake, I hope they're kept offline, in a safe.
>
> No, they're kept online. Why? What would be the point of having them in
> a safe? Then they'd be unavailable!

I'll give you a few pointers then:
If your disks are online instead of in a safe, they are vulnerable to:

* Intrusions / viruses
* User / admin error (you yourself stated how often this happens!)
* Fire
* Lightning strike
* Theft 

> > Change admins.
>
> Can't. They're as good as they get. Hey, *I* even do the sabotage
> sometimes. I'm probably only abut 99% accurate, and I can certainly
> write a hundred commands in a day.

Every admin makes mistakes.  But most see it before it has dire consequences.

> > I could understand an admin making typing errors and such, but then again
> > that would not usually lead to a total system failure.
>
> Of course it would. You try working remotely to upgrade the sshd and
> finally killing off the old one, only to discover that you kill the
> wrong one and lock yourself out, while the deadman script on the server

Yes, been there done that...

> tries fruitlessly to restart a misconfigured server, and then finally
> decides after an hour to give up and reboot as a last resort, then
> can't bring itself back up because of something else you did that you
> were intending to finish but didn't get the opportunity to.

This will happen only once (if you're good), maybe twice (if you're adequate) 
but if it happens to you three times or more, then you need to find a 
different line of work, or start drinking less and paying more attention at 
your work.  I'm not kidding.  The good admin is not he who never makes 
mistakes, but he who (quickly) learns from it.

> > Some daemon not working,
> > sure.  Good admins review or test their changes,
>
> And sometimes miss the problem.

Yes, but apache not restarting due to a typo hardly constitutes a system 
failure.  Come on now!

> > for one thing, and in most
> > cases any such mistake is rectified much simpler and faster than a failed
> > disk anyway. Except maybe for lilo errors with no boot media available.
> > ;-\
>
> Well, you can go out to the site in the middle of the night to reboot!
> Changes are made out of working hours so as not to disturb the users.

Sometimes, depending on the SLA the client has.  In any case, I do tend to 
schedule complex, error-prone work for when I am at the console.
Look, any way you want to turn it, messing with reconfiguring bootmanagers 
when not at the console is asking for trouble.  If you have no other 
recourse, test it first with a local machine with the exact same setup.

For instance, I learned from my sshd error to always start a second sshd on 
port 2222 prior to killing off the main one.  You could also have a 'screen' 
session running with a sleep 600 followed by some rescue command.  
Be creative.  Be cautious (or paranoid).  Learn.

> > That is not the whole truth.  To be fair, the mechanism works like this:
> > With raid, you have a 50% chance the wrong, corrupted, data is used.
> > Without raid, thus only having a single disk, the chance of using the
> > corrupted data is 100% (obviously, since there is only one source)
>
> That is one particular spin on it.

It is _the_ spin on it.

> > Ergo: the same as with a single disk.  No change.
>
> Except that it is not the case. With a single disk you are CERTAIN to
> detect the problem (if it is detectable) when you run the fsck at
> reboot.  With a RAID1 mirror you are only 50% likely to detect the
> detectable problem, because you may choose to read the "wrong" (correct
> :) disk at the crucial point in the fsck.  Then you have to hope that
> the right disk fails next, when it fails, or else you will be left holding
> the detectably wrong, unchecked data.

First off, fsck doesn't usually run at reboot.  Just the journal is replayed.
Only when severe errors are there, there will be a forced fsck.  You're not 
telling me that you fsck your 600 gigabyte arrays upon each reboot, yes ?
It will give you multiple hours added downtime if you do.

Secondly, if you _are_ that paranoid about it that you indeed do a fsck, what 
is keeping you from breaking the mirror, fsck the underlying physical devices 
and reassemble if all is okay.  Added benefit: if all is not well, you get to 
choose which half of the mirror you decide to keep.  Problem solved.

And third, I am not too convinced the error detection is able to detect all 
errors.  For starters, if a crash occurred while disk one was completely 
written but disk two had not yet begun, both checksums would be correct, so 
no fsck would notice.  Secondly, I doubt that the checksum mechanism is that 
good. It's just a trivial checksum, it's bound to overlook some errors.

And finally: If you would indeed end up with the "detectably wrong, unchecked 
data", you can still run an fsck on it, just as with the single disk. The 
fsck will repair it (or not), just as with the single disk you would've had.

In any case, seen as you do 12 reboots a year :-P the chances a very very slim 
you hit the wrong ("right") half of the disk at all those 12 times, so you'll 
surely notice the corruption at some point.  

Note that despite all this I am all for an enhancement to mdadm providing a 
method to check the parity for correctness.  But this is besides the point. 

> > No, you have a zero chance of detection, since there is nothing to
> > compare TO.
>
> That is not the case. You have every chance in the world of detecting
> it - you know what fsck does.

Well, when have you last fsck'ed a terabyte size array without an immediate 
need for it ?  I know I haven't -> my time is too valueable to wait half a 
day, or more, for that fsck to finish.

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* RE: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 17:46     ` maarten
  2005-01-03 19:52       ` maarten
  2005-01-03 20:22       ` Peter T. Breuer
@ 2005-01-03 21:36       ` Guy
  2005-01-04  0:15         ` maarten
  2 siblings, 1 reply; 172+ messages in thread
From: Guy @ 2005-01-03 21:36 UTC (permalink / raw)
  To: 'maarten', linux-raid

Maarten said:
"Doing the math, the outcome is still (200% divided by four)= 50%.
Ergo: the same as with a single disk.  No change."

Guy said:
"I bet a non-mirror disk has similar risk as a RAID1."

Guy and Maarten agree, but Maarten does a better job of explaining it!  :)

I also agree with most of what Maarten said below, but not mirroring swap???

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of maarten
Sent: Monday, January 03, 2005 12:47 PM
To: linux-raid@vger.kernel.org
Subject: Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10
crashing repeatedly and hard)

On Monday 03 January 2005 12:31, Peter T. Breuer wrote:
> Guy <bugzilla@watkins-home.com> wrote:
> > "Also sprach Guy:"

>    1) lightning strikes rails, or a/c goes out and room full of servers
>       overheats. All lights go off.
>
>    2) when sysadmin arrives to sort out the smoking wrecks, he finds
>       that 1 in 3 random disks are fried - they're simply the points
>       of failure that died first, and they took down the hardware with
>       them.
>
>    3) sysadmin buys or jury-rigs enough pieces of nonsmoking hardware
>       to piece together the raid arrays from the surviving disks, and
>       hastily does a copy to somewhere very safe and distant, while
>       an assistant holds off howling hordes outside the door with
>       a shutgun.
>
> In this scenario, a disk simply acts as the weakest link in a fuse
> chain, and the whole chain goes down.  But despite my dramatisation it
> is likely that a hardware failure will take out or damage your hardware!
> Ide disks live on an electric bus conected to other hardware.  Try a
> shortcircuit and see what happens.  You can't even yank them out while
> the bus is operating if you want to keep your insurance policy.

The chance of a PSU blowing up or lightning striking is, reasonably, much
less 
than an isolated disk failure.  If this simple fact is not true for you 
personally, you really ought to reevaluate the quality of your PSU (et al) 
and / or the buildings' defenses against a lightning strike...

> However, I don't see how you can expect to replace a failed disk
> without taking down the system. For that reason you are expected to be
> running "spare disks" that you can virtually insert hot into the array
> (caveat, it is possible with scsi, but you will need to rescan the bus,
> which will take it out of commission for some seconds, which may
> require you to take the bus offline first, and it MAY be possible with
> recent IDE buses that purport to support hotswap - I don't know).

I think the point is not what actions one has to take at time T+1 to replace

the disk, but rather whether at time T, when the failure first occurs, the 
system survives the failure or not.

> (1) how likely is it that a disk will fail without taking down the system
> (2) how likely is it that a disk will fail
> (3) how likely is it that a whole system will fail
>
> I would say that (2) is about 10% per year. I would say that (3) is
> about 1200% per year. It is therefore difficult to calculate (1), which
> is your protection scenario, since it doesn't show up very often in the
> stats!

I don't understand your math.  For one, percentage is measured from 0 to
100, 
not from 0 to 1200.  What is that, 12 twelve times 'absolute certainty' that

something will occur ?
But besides that, I'd wager that from your list number (3) has, by far, the 
smallest chance of occurring. Choosing between (1) and (2) is more
difficult, 
my experiences with IDE disks are definitely that it will take the system 
down, but that is very biased since I always used non-mirrored swap.
I sure can understand a system dying if it loses part of its memory...

> > ** A disk failing is the most common failure a system can have (IMO).

I fully agree.

> Not in my experience. See above. I'd say each disk has about a 10%
> failure expectation per year. Whereas I can guarantee that an
> unexpected  system failure will occur about once a month, on every
> important system.

Whoa !  What are you running, windows perhaps ?!? ;-)
No but seriously, joking aside, you have 12 system failures per year ?
I would not be alone in thinking that figure is VERY high.  My uptimes 
generally are in the three-digit range, and most *certainly* not in the low 
2-digit range. 

> If you think about it that is quite likely, since a system is by
> definition a complicated thing. And then it is subject to all kinds of
> horrible outside influences, like people rewiring the server room in
> order to reroute cables under the floor instead of through he ceiling,
> and the maintenance people spraying the building with insecticide,
> everywhere, or just "turning off the electricity in order to test it"
> (that happens about four times a year here - hey, I remember when they
> tested the giant UPS by turning off the electricity! Wrong switch.
> Bummer).

If you have building maintenance people and other random staff that can
access 
your server room unattended and unmonitored, you have far worse problems
than 
making decicions about raid lavels.  IMNSHO.   

By your description you could almost be the guy the joke with the recurring
7 
o'clock system crash is about (where the cleaning lady unplugs the server 
every morning in order to plug in her vacuum cleaner) ;-) 

> Yes, you can try and keep these systems out of harms way on a
> colocation site, or something, but by then you are at professional
> level paranoia. For "home systems", whole system failures are far more
> common than disk failures.

Don't agree.  Not only do disk failures occur more often than full system 
failures, disk failures are also much more time-consuming to recover from. 
Compare changing a system board or PSU with changing a drive and finding, 
copying and verifying a backup (if you even have one that's 100% up to date)

> > ** In a computer room with about 20 Unix systems, in 1 year I have seen
> > 10 or so disk failures and no other failures.
>
> Well, let's see. If each system has 2 disks, then that would be 25% per
> disk per year, which I would say indicates low quality IDE disks, but
> is about the level I would agree with as experiential.

The point here was, disk failures being more common than other failures...

> No way! I hate tapes. I backup to other disks.

Then for your sake, I hope they're kept offline, in a safe.

> > ** My computer room is for development and testing, no customer access.
>
> Unfortunately, the admins do most of the sabotage.

Change admins.  
I could understand an admin making typing errors and such, but then again
that 
would not usually lead to a total system failure.  Some daemon not working, 
sure.  Good admins review or test their changes, for one thing, and in most 
cases any such mistake is rectified much simpler and faster than a failed 
disk anyway. Except maybe for lilo errors with no boot media available. ;-\ 

> Yes you did. You can see from the quoting that you did.

Or the quoting got messed up.  That is known to happen in threads.

> > but it may be more current than 1 or
> > more of the other disks.  But this would be similar to what would happen
> > to a non-RAID disk (some data not written).
>
> No, it would not be similar. You don't seem to understand the
> mechanism. The mechanism for corruption is that there are two different
> versions of the data available when the system comes back up, and you
> and the raid system don't know which is more correct. Or even what it
> means to be "correct". Maybe the earlier written data is "correct"!

That is not the whole truth.  To be fair, the mechanism works like this:
With raid, you have a 50% chance the wrong, corrupted, data is used.
Without raid, thus only having a single disk, the chance of using the 
corrupted data is 100% (obviously, since there is only one source)

Or, much more elaborate:

Let's assume the chance of a disk corruption occurring is 50%, ie. 0.5

With raid, you always have a 50% chance of reading faultty data IF one of
the 
drives holds faulty data.  For the drives itself, the chance of both disks 
being wrong is 0.5x0.5=0.25(scenario A).  Similarly, 25 % chance both disks 
are good (scenario B). The chance of one of the disks being wrong is 50% 
(scenarios C & D together).  In scenarios A & B the outcome is certain. In 
scenarios C & D the chance of the raid choosing the false mirror is 50%.
Accumulating those chances one can say that the chance of reading false data

is:
in scenario A: 100%
in scenario B: 0%
scenario C: 50%
scenario D: 50%

Doing the math, the outcome is still (200% divided by four)= 50%.
Ergo: the same as with a single disk.  No change.

> > In contrast, on a single disk they have a 100% chance of detection (if
> > you look!) and a 100% chance of occuring, wrt normal rate.
> > ** Are you talking about the disk drive detecting the error?

No, you have a zero chance of detection, since there is nothing to compare
TO.
Raid-1 at least gives you a 50/50 chance to choose the right data.  With a 
single disk, the chance of reusing the corrupted data is 100% and there is
no 
mechanism to detect the odd 'tumbled bit' at all.

> > How?
> > ** Compare the 2 halves or the RAID1, or check the parity of RAID5.
>
> You wouldn't necesarily know  which of the two data sources was
> "correct".

No, but you have a theoretical choice, and a 50% chance of being right.
Not so without raid, where you get no choice, and a 100% chance of getting
the 
wrong data, in the case of a corruption.

Maarten

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-03 21:36       ` Guy
@ 2005-01-04  0:15         ` maarten
  2005-01-04 11:21           ` Michael Tokarev
  0 siblings, 1 reply; 172+ messages in thread
From: maarten @ 2005-01-04  0:15 UTC (permalink / raw)
  To: linux-raid

On Monday 03 January 2005 22:36, Guy wrote:
> Maarten said:
> "Doing the math, the outcome is still (200% divided by four)= 50%.
> Ergo: the same as with a single disk.  No change."
>
> Guy said:
> "I bet a non-mirror disk has similar risk as a RAID1."
>
> Guy and Maarten agree, but Maarten does a better job of explaining it!  :)
>
> I also agree with most of what Maarten said below, but not mirroring
> swap???

Yeah... bad choice in hindsight.  
But, there once was a time, a long long time ago, that the software-raid howto 
explicitly stated that running swap on raid was a bad idea, and that by 
telling the kernel all swap partitions had the same priority, the kernel 
itself would already 'raid' the swap, ie. divide equally between the swap 
spaces. I'm sure you can read it back somewhere.

Now we know better, and we realize that that will indeed loadbalance between 
the various swap partitions, but it will not provide redundancy at all.  
Oh well, new insights huh ? ;-)

Maarten

^ permalink raw reply	[flat|nested] 172+ messages in thread

* Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard)
  2005-01-04  0:15         ` maarten
@ 2005-01-04 11:21           ` Michael Tokarev
  0 siblings, 0 replies; 172+ messages in thread
From: Michael Tokarev @ 2005-01-04 11:21 UTC (permalink / raw)
  To: linux-raid

maarten wrote:
> On Monday 03 January 2005 22:36, Guy wrote:
> 
>>Maarten said:
>>"Doing the math, the outcome is still (200% divided by four)= 50%.
>>Ergo: the same as with a single disk.  No change."
>>
>>Guy said:
>>"I bet a non-mirror disk has similar risk as a RAID1."
>>
>>Guy and Maarten agree, but Maarten does a better job of explaining it!  :)
>>
>>I also agree with most of what Maarten said below, but not mirroring
>>swap???
> 
> 
> Yeah... bad choice in hindsight.  
> But, there once was a time, a long long time ago, that the software-raid howto 
> explicitly stated that running swap on raid was a bad idea, and that by 

In 2.2, and probably in early 2.4, there indeed was a prob with having
swap on raid (md) array.  "Random" system lockups, especially during
the array recovery.  That problem(s) has been fixed long ago.  But I
think the howto in question tells about something different...

> telling the kernel all swap partitions had the same priority, the kernel 
> itself would already 'raid' the swap, ie. divide equally between the swap 
> spaces. I'm sure you can read it back somewhere.

> Now we know better, and we realize that that will indeed loadbalance between 
> the various swap partitions, but it will not provide redundancy at all.  
> Oh well, new insights huh ? ;-)

...that is, the howto tells about raid0 setup (striping), and yes, there's
no "r" in "raid0" really (but there IS an "anti-r", as raid0 array is LESS
reliable than a single drive).  That to say: instead of placing swap on
raid0 array, let the swap code itself to perform the striping - swap code
"knows better" about its needs.  This is still applies to recent kernels.
But here, we aren't talking about *reilable* swap, we're talking about
*fast* swap (raid1 aka reliable vs raid0 aka fast).  There's no code in
"swap subsystem" to mirror swap space, but there IS such a code in md.
Hence, if you want reliability, use raid1 arrays for swap space.  In the
same time, if you want speed *too*, use multiple raid1 arrays with equal
priority as swap areas (dunno how current raid10 code compares to "swap
striping" on top of raid1 arrays, but that probably makes very small
difference).

Ie, nothing wrong with howto, which is talking about fast swap (sure it'd
be good to mention reliability too), and nothing wrong with having raid
arrays as swap (esp. when the abovementioned bug(s) has been fixed). Or,
"nothing new"... ;)

I learned to place swap to raid1 arrays instead of striping it (as suggested
by the howto) the hard way, going the full cycle recovering damaged data
because system got foobared after one component (stripe) of swapspace was
lost, and I don't want to repeat that recovery again. ;)

> Maarten

/mjt

^ permalink raw reply	[flat|nested] 172+ messages in thread

end of thread, other threads:[~2005-02-28 20:07 UTC | newest]

Thread overview: 172+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-30  0:31 PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Georg C. F. Greve
2004-12-30 16:23 ` Georg C. F. Greve
2004-12-30 17:39   ` Peter T. Breuer
2004-12-30 17:53     ` Sandro Dentella
2004-12-30 18:31       ` Peter T. Breuer
2004-12-30 19:50     ` Michael Tokarev
     [not found]       ` <41D45C1F.5030307-XAri/EZa3C4vJsYlp49lxw@public.gmane.org>
2004-12-30 20:54         ` berk walker
2005-01-01 13:39         ` Helge Hafting
2004-12-30 21:39       ` Peter T. Breuer
2005-01-02 19:42         ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
2005-01-02 20:18           ` Peter T. Breuer
2005-01-03  0:30             ` Andy Smith
2005-01-03  6:41               ` Neil Brown
2005-01-03  8:37                 ` Peter T. Breuer
2005-01-03  8:03               ` Peter T. Breuer
2005-01-03  8:58                 ` Guy
2005-01-03 10:18                 ` Partiy error detection - was " Brad Campbell
2005-01-03 12:11                 ` Michael Tokarev
2005-01-03 14:23                   ` Peter T. Breuer
2005-01-03 18:30                     ` maarten
2005-01-03 21:36                     ` Michael Tokarev
2005-01-05  5:50                     ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
2005-01-05 13:41                       ` Michael Tokarev
2005-01-05 13:57                         ` [help] [I2O] Adaptec 2400A on FC3 Angelo Piraino
2005-01-05 19:15                         ` Debian Sarge mdadm raid 10 assembling at boot problem Roger Ellison
2005-01-05  9:56           ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Andy Smith
2005-01-05 10:44             ` Alvin Oga
2005-01-05 10:56               ` Brad Campbell
2005-01-05 11:39                 ` Alvin Oga
2005-01-05 12:02                   ` Brad Campbell
2005-01-05 13:23                     ` Alvin Oga
2005-01-05 13:33                       ` Brad Campbell
2005-01-05 14:44                         ` parts -- " Alvin Oga
2005-01-19  4:46                           ` Clemens Schwaighofer
2005-01-19  5:05                             ` Alvin Oga
2005-01-19  5:49                               ` Clemens Schwaighofer
2005-01-19  7:08                                 ` Alvin Oga
2005-01-05 13:36                       ` Swap should be mirrored or not? (was Re: ext3 journal on software raid) Andy Smith
2005-01-05 14:12                 ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Erik Mouw
2005-01-05 14:37                   ` Michael Tokarev
2005-01-05 14:55                     ` errors " Alvin Oga
2005-01-05 17:11                     ` Erik Mouw
2005-01-06  5:41                       ` Brad Campbell
2005-01-05 15:17                 ` Guy
2005-01-05 15:33                   ` Alvin Oga
2005-01-05 16:22                     ` Michael Tokarev
2005-01-05 17:23                       ` Peter T. Breuer
2005-01-05 16:23                     ` Andy Smith
2005-01-05 16:30                       ` Andy Smith
2005-01-05 17:04                       ` swp - " Alvin Oga
2005-01-05 17:26                         ` Andy Smith
2005-01-05 18:32                           ` Alvin Oga
2005-01-05 22:35                             ` Andy Smith
2005-01-06  0:57                               ` Guy
2005-01-06  1:28                                 ` Mike Hardy
2005-01-06  3:32                                   ` Guy
2005-01-06  4:49                                     ` Mike Hardy
2005-01-09 21:07                                       ` Mark Hahn
2005-01-06  5:04                                   ` Alvin Oga
2005-01-06  6:18                                     ` Guy
2005-01-06  6:31                                       ` Alvin Oga
2005-01-06  9:38                                     ` swap on RAID (was Re: swp - Re: ext3 journal on software raid) Andy Smith
2005-01-06 17:46                                       ` Mike Hardy
2005-01-06 22:08                                         ` No swap can be dangerous (was Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid)) Andrew Walrond
2005-01-06 22:34                                           ` Jesper Juhl
2005-01-06 22:57                                             ` Mike Hardy
2005-01-06 23:15                                               ` Guy
2005-01-07  9:28                                                 ` Andrew Walrond
2005-02-28 20:07                                                   ` Guy
2005-01-07  1:31                                       ` confused Re: swap on RAID (was Re: swp - Re: ext3 journal on software raid) Alvin Oga
2005-01-07  2:28                                         ` Andy Smith
2005-01-07 13:04                                           ` Alvin Oga
2005-01-09 21:21                                     ` swp - Re: ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Mark Hahn
2005-01-09 22:20                                       ` Alvin Oga
2005-01-06  5:01                                 ` Alvin Oga
2005-01-05 17:07                     ` Guy
2005-01-05 17:21                       ` Alvin Oga
2005-01-05 17:32                         ` Guy
2005-01-05 18:37                           ` Alvin Oga
2005-01-05 17:34                         ` ECC: RE: ext3 blah blah blah Gordon Henderson
2005-01-05 18:33                           ` Alvin Oga
2005-01-05 17:26                       ` ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) David Greaves
2005-01-05 18:16                         ` Peter T. Breuer
2005-01-05 18:28                           ` Guy
2005-01-05 18:26                         ` Guy
2005-01-05 15:48                   ` Peter T. Breuer
2005-01-07  6:21     ` PROBLEM: Kernel 2.6.10 crashing repeatedly and hard Clemens Schwaighofer
2005-01-07  9:39       ` Andy Smith
  -- strict thread matches above, loose matches on Subject: below --
2005-01-03  9:30 ext3 journal on software raid (was Re: PROBLEM: Kernel 2.6.10 crashing repeatedly and hard) Peter T. Breuer
     [not found] <200501030916.j039Gqe23568@inv.it.uc3m.es>
2005-01-03 10:17 ` Guy
2005-01-03 11:31   ` Peter T. Breuer
2005-01-03 17:34     ` Guy
2005-01-03 17:46     ` maarten
2005-01-03 19:52       ` maarten
2005-01-03 20:41         ` Peter T. Breuer
2005-01-03 23:19           ` Peter T. Breuer
2005-01-03 23:46             ` Neil Brown
2005-01-04  0:28               ` Peter T. Breuer
2005-01-04  1:18                 ` Alvin Oga
2005-01-04  4:29                   ` Neil Brown
2005-01-04  8:43                     ` Peter T. Breuer
2005-01-04  2:07                 ` Neil Brown
2005-01-04  2:16                   ` Ewan Grantham
2005-01-04  2:22                     ` Neil Brown
2005-01-04  2:41                       ` Andy Smith
2005-01-04  3:42                         ` Neil Brown
2005-01-04  9:50                           ` Peter T. Breuer
2005-01-04 14:15                             ` David Greaves
2005-01-04 15:20                               ` Peter T. Breuer
2005-01-04 16:42                             ` Guy
2005-01-04 17:46                               ` Peter T. Breuer
2005-01-04  9:30                         ` Maarten
2005-01-04 10:18                           ` Peter T. Breuer
2005-01-04 13:36                             ` Maarten
2005-01-04 14:13                               ` Peter T. Breuer
2005-01-04 19:22                                 ` maarten
2005-01-04 20:05                                   ` Peter T. Breuer
2005-01-04 21:38                                     ` Guy
2005-01-04 23:53                                       ` Peter T. Breuer
2005-01-05  0:58                                       ` Mikael Abrahamsson
2005-01-04 21:48                                     ` maarten
2005-01-04 23:14                                       ` Peter T. Breuer
2005-01-05  1:53                                         ` maarten
2005-01-04  9:46                         ` Peter T. Breuer
2005-01-04 19:02                           ` maarten
2005-01-04 19:12                             ` David Greaves
2005-01-04 21:08                             ` Peter T. Breuer
2005-01-04 22:02                               ` Brad Campbell
2005-01-04 23:20                                 ` Peter T. Breuer
2005-01-05  5:44                                   ` Brad Campbell
2005-01-05  9:00                                     ` Peter T. Breuer
2005-01-05  9:14                                       ` Brad Campbell
2005-01-05  9:28                                         ` Peter T. Breuer
2005-01-05  9:43                                           ` Brad Campbell
2005-01-05 15:09                                             ` Guy
2005-01-05 15:52                                               ` maarten
2005-01-05 10:04                                           ` Andy Smith
2005-01-04 22:21                               ` Neil Brown
2005-01-05  0:08                                 ` Peter T. Breuer
2005-01-04 22:29                               ` Neil Brown
2005-01-05  0:19                                 ` Peter T. Breuer
2005-01-05  1:19                                   ` Jure Pe_ar
2005-01-05  2:29                                     ` Peter T. Breuer
2005-01-05  0:38                               ` maarten
2005-01-04  9:40                   ` Peter T. Breuer
2005-01-04 14:03                     ` David Greaves
2005-01-04 14:07                       ` Peter T. Breuer
2005-01-04 14:43                         ` David Greaves
2005-01-04 15:12                           ` Peter T. Breuer
2005-01-04 16:54                             ` David Greaves
2005-01-04 17:42                               ` Peter T. Breuer
2005-01-04 19:12                                 ` David Greaves
2005-01-04  0:45           ` maarten
2005-01-04 10:14             ` Peter T. Breuer
2005-01-04 13:24               ` Maarten
2005-01-04 14:05                 ` Peter T. Breuer
2005-01-04 15:31                   ` Maarten
2005-01-04 16:21                     ` Peter T. Breuer
2005-01-04 20:55                       ` maarten
2005-01-04 21:11                         ` Peter T. Breuer
2005-01-04 21:38                         ` Peter T. Breuer
2005-01-04 23:29                           ` Guy
2005-01-04 19:57                     ` Mikael Abrahamsson
2005-01-04 21:05                       ` maarten
2005-01-04 21:26                         ` Alvin Oga
2005-01-04 21:46                         ` Guy
2005-01-03 20:22       ` Peter T. Breuer
2005-01-03 23:05         ` Guy
2005-01-04  0:08         ` maarten
2005-01-03 21:36       ` Guy
2005-01-04  0:15         ` maarten
2005-01-04 11:21           ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).