* 1352 NUL bytes at the end of a page?
@ 2004-05-13 19:08 Andy Isaacson
2004-05-14 2:22 ` Andrew Morton
0 siblings, 1 reply; 27+ messages in thread
From: Andy Isaacson @ 2004-05-13 19:08 UTC (permalink / raw)
To: linux-kernel
We've got a user who's reporting BK problems which we've traced down to
the fact that his s.ChangeSet file has a hole, filled with '\0' bytes,
that's so far always 1352 bytes long, and the end is page-aligned. (In
fact, the two cases we've seen so far have been 8k-aligned.) The
correct file data picks up again after the hole.
bk is writing the data using stdio in a child process (fork, exec,
wait), then mmaping the result. The corruption is persistent; he sent
us the s.ChangeSet file and there it was (not a cache or buffering
problem, therefore).
2.6.6-bk (current head of tree from whenever he built), UP PIII, symptom
observed on both ext3 and reiserfs. (However, we've explicitly verified
the hole only on reiser.)
The problem is intermittent, having happened "several" times over the
last few months, and doesn't appear to be tied to any particular kernel
version.
To me, this looks awfully close to an Ethernet frame size... but that's
just a wild guess. And I don't think he's running Ethernet (still
waiting for dmesg and .config).
I've asked for more info, memtest86, and will attempt to reproduce it on
another box.
Does anyone have insight into this peculiar problem?
-andy
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-13 19:08 1352 NUL bytes at the end of a page? Andy Isaacson
@ 2004-05-14 2:22 ` Andrew Morton
0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2004-05-14 2:22 UTC (permalink / raw)
To: Andy Isaacson; +Cc: linux-kernel
Andy Isaacson <adi@bitmover.com> wrote:
>
> We've got a user who's reporting BK problems which we've traced down to
> the fact that his s.ChangeSet file has a hole, filled with '\0' bytes,
> that's so far always 1352 bytes long, and the end is page-aligned. (In
> fact, the two cases we've seen so far have been 8k-aligned.) The
> correct file data picks up again after the hole.
When the reporter has a PIII machine it's often useful to find out the clock
frequency - the lower it is, the older it is and the more likely it is that
some component has rotted.
If this one cannot be reproduced on any other machine I'd say it's a
hardware failure.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
@ 2004-05-14 4:32 Steven Cole
0 siblings, 0 replies; 27+ messages in thread
From: Steven Cole @ 2004-05-14 4:32 UTC (permalink / raw)
To: Andrew Morton; +Cc: Andy Isaacson, Linux Kernel
Andrew Morton wrote:
>Andy Isaacson <adi@bitmover.com> wrote:
>>
>> We've got a user who's reporting BK problems which we've traced down to
>> the fact that his s.ChangeSet file has a hole, filled with '\0' bytes,
>> that's so far always 1352 bytes long, and the end is page-aligned. (In
>> fact, the two cases we've seen so far have been 8k-aligned.) The
>> correct file data picks up again after the hole.
>
>When the reporter has a PIII machine it's often useful to find out the clock
>frequency - the lower it is, the older it is and the more likely it is that
>some component has rotted.
>
>If this one cannot be reproduced on any other machine I'd say it's a
>hardware failure.
Hi Andrew,
The user is me. The machine is a 450 Mhz P-III, about five years old now.
Andy mentioned ethernet, but I don't have that here, just 56k dialup. The
extra information he requested was sent a couple of hours ago, and in the
meantime I ran two full passes of memtest86 3.1 with zero errors.
<slight detour>
I do occasionally have problems with pppd, and the following message always
appears in /var/log/messages:
May 13 18:09:30 spc kernel: serial8250: too much work for irq10
May 13 18:09:30 spc kernel: serial8250: too much work for irq10
The message is always doubled as above. This has never yet occurred
at the same time as the bk failure, so the two seem unrelated. I have
to kill -9 the pppd process and reconnect when the above happens.
This problem never happened with a 2.4.x kernel, and was first detected
during the middle of 2.5.x development.
</slight detour>
The only reason the above was at all possibly relevant to the bk failtures,
is that I've only noticed the failures when pulling over the net via ppp.
I've never gotten the failure when pulling from another repository
on the same disk (I've only got one).
If you have any ideas about narrowing down the potentially rotted
component, please let me know.
I cut and pasted the above from a lkml archive, so sorry if this
messes up your mail thread. I'm not on lkml here at home, so
please cc me on any replies.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <200405131723.15752.elenstev@mesatop.com>]
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
[not found] <200405131723.15752.elenstev@mesatop.com>
@ 2004-05-14 16:53 ` Andy Isaacson
2004-05-15 0:54 ` Steven Cole
0 siblings, 1 reply; 27+ messages in thread
From: Andy Isaacson @ 2004-05-14 16:53 UTC (permalink / raw)
To: Steven Cole; +Cc: Steven Cole, support, Andrew Morton, torvalds, Linux Kernel
Apologies for the enormous quote, but I wanted to get the lspci and
dmesg in here, in case someone else has some insight.
On Thu, May 13, 2004 at 05:23:15PM -0600, Steven Cole wrote:
> [steven@spc steven]$ lspcidrake
> intel-agp : Intel Corporation|440BX/ZX - 82443BX/ZX Host bridge [BRIDGE_HOST]
> unknown : Intel Corporation|440BX/ZX - 82443BX/ZX AGP bridge [BRIDGE_PCI]
> unknown : Intel Corporation|82371AB PIIX4 ISA [BRIDGE_ISA]
> unknown : Intel Corporation|82371AB PIIX4 IDE [STORAGE_IDE]
> usb-uhci : Intel Corporation|82371AB PIIX4 USB [SERIAL_USB]
> sonypi : Intel Corporation|82371AB PIIX4 ACPI - Bus Master IDE Controller [BRIDGE_OTHER]
> es1371 : Creative Labs|Sound Blaster AudioPCI64V/AudioPCI128 [MULTIMEDIA_AUDIO]
> 3c59x : 3Com Corporation|3c905B 100BaseTX [Cyclone] [NETWORK_ETHERNET]
> unknown : Promise Technology, Inc.|20262 (Ultra66) [STORAGE_OTHER]
> Card:RIVA TNT : nVidia Corporation|Riva TNT 128 [DISPLAY_VGA]
[snip]
> Linux version 2.6.6 (steven@spc.mesatop.com) (gcc version 3.3.2 (Mandrake Linux 10.0 3.3.2-6mdk)) #105 Sun May 9 22:00:07 MDT 2004
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009e800 (usable)
> BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e7000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000040fd800 (usable)
> BIOS-e820: 00000000040fd800 - 00000000040ff800 (ACPI data)
> BIOS-e820: 00000000040ff800 - 00000000040ffc00 (ACPI NVS)
> BIOS-e820: 00000000040ffc00 - 0000000018000000 (usable)
> BIOS-e820: 00000000fffe7000 - 0000000100000000 (reserved)
> 384MB LOWMEM available.
> On node 0 totalpages: 98304
> DMA zone: 4096 pages, LIFO batch:1
> Normal zone: 94208 pages, LIFO batch:16
> HighMem zone: 0 pages, LIFO batch:1
> DMI 2.1 present.
> ACPI disabled because your bios is from 1999 and too old
> You can enable it with acpi=force
> Built 1 zonelists
> Kernel command line: auto BOOT_IMAGE=2.6-bk ro root=306 devfs=nomount acpi=ht resume=/dev/hda10 splash=silent
> Initializing CPU#0
> PID hash table entries: 2048 (order 11: 16384 bytes)
> Detected 448.795 MHz processor.
> Using tsc for high-res timesource
> Console: colour VGA+ 80x25
> Memory: 386260k/393216k available (1999k kernel code, 6184k reserved, 548k data, 316k init, 0k highmem)
> Checking if this processor honours the WP bit even in supervisor mode... Ok.
> Calibrating delay loop... 886.78 BogoMIPS
> Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
> Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
> Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> CPU: After generic identify, caps: 0383f9ff 00000000 00000000 00000000
> CPU: After vendor identify, caps: 0383f9ff 00000000 00000000 00000000
> CPU: L1 I cache: 16K, L1 D cache: 16K
> CPU: L2 cache: 512K
> CPU: After all inits, caps: 0383f9ff 00000000 00000000 00000040
> CPU: Intel Pentium III (Katmai) stepping 02
> Enabling fast FPU save and restore... done.
> Enabling unmasked SIMD FPU exception support... done.
> Checking 'hlt' instruction... OK.
> POSIX conformance testing by UNIFIX
> NET: Registered protocol family 16
> PCI: PCI BIOS revision 2.10 entry at 0xfd983, last bus=1
> PCI: Using configuration type 1
> Linux Plug and Play Support v0.97 (c) Adam Belay
> usbcore: registered new driver usbfs
> usbcore: registered new driver hub
> PCI: Probing PCI hardware
> PCI: Probing PCI hardware (bus 00)
> PCI: Using IRQ router PIIX/ICH [8086/7110] at 0000:00:07.0
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
> devfs: 2004-01-31 Richard Gooch (rgooch@atnf.csiro.au)
> devfs: boot_options: 0x0
> NTFS driver 2.1.8 [Flags: R/O].
> Limiting direct PCI/PCI transfers.
> isapnp: Scanning for PnP cards...
> isapnp: Card 'U.S. Robotics 56K FAX INT'
> isapnp: 1 Plug & Play card detected total
> lp: driver loaded but no devices found
> Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
> ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> pnp: Device 00:01.00 activated.
> ttyS1 at I/O 0x2f8 (irq = 10) is a 16550A
> parport0: PC-style at 0x378 (0x778) [PCSPP(,...)]
> parport0: irq 7 detected
> lp0: using parport0 (polling).
> Using anticipatory io scheduler
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a post-1991 82077
> loop: loaded (max 8 devices)
> PPP generic driver version 2.4.2
> PPP Deflate Compression module registered
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> PDC20262: IDE controller at PCI slot 0000:00:0f.0
> PCI: Found IRQ 5 for device 0000:00:0f.0
> PDC20262: chipset revision 1
> PDC20262: 100% native mode on irq 5
> PDC20262: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
> ide0: BM-DMA at 0x10c0-0x10c7, BIOS settings: hda:DMA, hdb:DMA
> ide1: BM-DMA at 0x10c8-0x10cf, BIOS settings: hdc:DMA, hdd:pio
> hda: Maxtor 5T040H4, ATA DISK drive
> hdb: ST317221A, ATA DISK drive
> ide0 at 0x1440-0x1447,0x1436 on irq 5
> hda: max request size: 128KiB
> hda: Host Protected Area detected.
> current capacity is 78125000 sectors (40000 MB)
> native capacity is 80043264 sectors (40982 MB)
> hda: 78125000 sectors (40000 MB) w/2048KiB Cache, CHS=65535/16/63
> /dev/ide/host0/bus0/target0/lun0: p1 p2 < p5 p6 p7 p8 p9 p10 >
> hdb: max request size: 128KiB
> hdb: 33683328 sectors (17245 MB) w/512KiB Cache, CHS=33416/16/63
> /dev/ide/host0/bus0/target1/lun0: p1 p2 < p5 p6 p7 p8 p9 >
> ide-floppy driver 0.99.newide
> mice: PS/2 mouse device common for all mice
> input: PC Speaker
> serio: i8042 AUX port at 0x60,0x64 irq 12
> input: ImPS/2 Generic Wheel Mouse on isa0060/serio1
> serio: i8042 KBD port at 0x60,0x64 irq 1
> input: AT Translated Set 2 keyboard on isa0060/serio0
> Advanced Linux Sound Architecture Driver Version 1.0.4rc2 (Tue Mar 30 08:19:30 2004 UTC).
> PCI: Found IRQ 11 for device 0000:00:0c.0
> PCI: Sharing IRQ 11 with 0000:00:0e.0
> PCI: Sharing IRQ 11 with 0000:01:00.0
> ALSA device list:
> #0: Ensoniq AudioPCI ENS1371 at 0x1080, irq 11
> NET: Registered protocol family 2
> IP: routing cache hash table of 4096 buckets, 32Kbytes
> TCP: Hash tables configured (established 32768 bind 65536)
> NET: Registered protocol family 1
> NET: Registered protocol family 17
> kjournald starting. Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> VFS: Mounted root (ext3 filesystem) readonly.
> Freeing unused kernel memory: 316k freed
> EXT3 FS on hda6, internal journal
> Adding 818960k swap on /dev/hda10. Priority:-1 extents:1
> Adding 248968k swap on /dev/hdb5. Priority:-2 extents:1
> found reiserfs format "3.6" with standard journal
> reiserfs: using ordered data mode
> Reiserfs journal params: device hda9, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
> reiserfs: checking transaction log (hda9) for (hda9)
> Using r5 hash to sort names
> NTFS volume version 3.1.
> kjournald starting. Commit interval 5 seconds
> EXT3 FS on hda7, internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> kjournald starting. Commit interval 5 seconds
> EXT3 FS on hda8, internal journal
[snip]
> [steven@spc testing-2.6]$ bk changes -r+ -nd:KEY:
> geert@linux-m68k.org[torvalds]|ChangeSet|20040511145430|25087
Here's "grep = .config":
> CONFIG_X86=y
> CONFIG_MMU=y
> CONFIG_UID16=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_EXPERIMENTAL=y
> CONFIG_CLEAN_COMPILE=y
> CONFIG_BROKEN_ON_SMP=y
> CONFIG_SWAP=y
> CONFIG_SYSVIPC=y
> CONFIG_SYSCTL=y
> CONFIG_LOG_BUF_SHIFT=14
> CONFIG_KALLSYMS=y
> CONFIG_FUTEX=y
> CONFIG_EPOLL=y
> CONFIG_IOSCHED_NOOP=y
> CONFIG_IOSCHED_AS=y
> CONFIG_IOSCHED_DEADLINE=y
> CONFIG_IOSCHED_CFQ=y
> CONFIG_X86_PC=y
> CONFIG_MPENTIUMIII=y
> CONFIG_X86_CMPXCHG=y
> CONFIG_X86_XADD=y
> CONFIG_X86_L1_CACHE_SHIFT=5
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_X86_WP_WORKS_OK=y
> CONFIG_X86_INVLPG=y
> CONFIG_X86_BSWAP=y
> CONFIG_X86_POPAD_OK=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_INTEL_USERCOPY=y
> CONFIG_X86_USE_PPRO_CHECKSUM=y
> CONFIG_PREEMPT=y
> CONFIG_X86_TSC=y
> CONFIG_NOHIGHMEM=y
> CONFIG_HAVE_DEC_LOCK=y
> CONFIG_REGPARM=y
> CONFIG_ACPI_BOOT=y
> CONFIG_PCI=y
> CONFIG_PCI_GOANY=y
> CONFIG_PCI_BIOS=y
> CONFIG_PCI_DIRECT=y
> CONFIG_PCI_MMCONFIG=y
> CONFIG_PCI_NAMES=y
> CONFIG_ISA=y
> CONFIG_BINFMT_ELF=y
> CONFIG_BINFMT_AOUT=y
> CONFIG_BINFMT_MISC=y
> CONFIG_PARPORT=y
> CONFIG_PARPORT_PC=y
> CONFIG_PARPORT_PC_CML1=y
> CONFIG_PARPORT_PC_FIFO=y
> CONFIG_PNP=y
> CONFIG_ISAPNP=y
> CONFIG_BLK_DEV_FD=y
> CONFIG_BLK_DEV_LOOP=y
> CONFIG_LBD=y
> CONFIG_IDE=y
> CONFIG_BLK_DEV_IDE=y
> CONFIG_BLK_DEV_IDEDISK=y
> CONFIG_BLK_DEV_IDECD=y
> CONFIG_BLK_DEV_IDEFLOPPY=y
> CONFIG_IDE_GENERIC=y
> CONFIG_BLK_DEV_IDEPCI=y
> CONFIG_IDEPCI_SHARE_IRQ=y
> CONFIG_BLK_DEV_OFFBOARD=y
> CONFIG_BLK_DEV_IDEDMA_PCI=y
> CONFIG_BLK_DEV_ADMA=y
> CONFIG_BLK_DEV_PDC202XX_OLD=y
> CONFIG_BLK_DEV_IDEDMA=y
> CONFIG_NET=y
> CONFIG_PACKET=y
> CONFIG_UNIX=y
> CONFIG_INET=y
> CONFIG_SYN_COOKIES=y
> CONFIG_NETDEVICES=y
> CONFIG_PPP=y
> CONFIG_PPP_ASYNC=y
> CONFIG_PPP_DEFLATE=y
> CONFIG_INPUT=y
> CONFIG_INPUT_MOUSEDEV=y
> CONFIG_INPUT_MOUSEDEV_PSAUX=y
> CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
> CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
> CONFIG_SOUND_GAMEPORT=y
> CONFIG_SERIO=y
> CONFIG_SERIO_I8042=y
> CONFIG_INPUT_KEYBOARD=y
> CONFIG_KEYBOARD_ATKBD=y
> CONFIG_INPUT_MOUSE=y
> CONFIG_MOUSE_PS2=y
> CONFIG_INPUT_MISC=y
> CONFIG_INPUT_PCSPKR=y
> CONFIG_VT=y
> CONFIG_VT_CONSOLE=y
> CONFIG_HW_CONSOLE=y
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_8250_NR_UARTS=4
> CONFIG_SERIAL_CORE=y
> CONFIG_UNIX98_PTYS=y
> CONFIG_PRINTER=y
> CONFIG_DRM=y
> CONFIG_RAW_DRIVER=y
> CONFIG_MAX_RAW_DEVS=256
> CONFIG_FB=y
> CONFIG_VIDEO_SELECT=y
> CONFIG_VGA_CONSOLE=y
> CONFIG_DUMMY_CONSOLE=y
> CONFIG_SOUND=y
> CONFIG_SND=y
> CONFIG_SND_TIMER=y
> CONFIG_SND_PCM=y
> CONFIG_SND_RAWMIDI=y
> CONFIG_SND_SEQUENCER=y
> CONFIG_SND_OSSEMUL=y
> CONFIG_SND_MIXER_OSS=y
> CONFIG_SND_PCM_OSS=y
> CONFIG_SND_SEQUENCER_OSS=y
> CONFIG_SND_AC97_CODEC=y
> CONFIG_SND_ENS1371=y
> CONFIG_USB=y
> CONFIG_USB_DEVICEFS=y
> CONFIG_EXT2_FS=y
> CONFIG_EXT3_FS=y
> CONFIG_JBD=y
> CONFIG_REISERFS_FS=y
> CONFIG_QUOTA=y
> CONFIG_QUOTACTL=y
> CONFIG_AUTOFS_FS=y
> CONFIG_AUTOFS4_FS=y
> CONFIG_ISO9660_FS=y
> CONFIG_FAT_FS=y
> CONFIG_MSDOS_FS=y
> CONFIG_VFAT_FS=y
> CONFIG_NTFS_FS=y
> CONFIG_PROC_FS=y
> CONFIG_PROC_KCORE=y
> CONFIG_SYSFS=y
> CONFIG_DEVFS_FS=y
> CONFIG_RAMFS=y
> CONFIG_MSDOS_PARTITION=y
> CONFIG_NLS=y
> CONFIG_NLS_DEFAULT="iso8859-1"
> CONFIG_NLS_CODEPAGE_850=y
> CONFIG_NLS_ISO8859_1=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_EARLY_PRINTK=y
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_ZLIB_INFLATE=y
> CONFIG_ZLIB_DEFLATE=y
> CONFIG_X86_BIOS_REBOOT=y
> CONFIG_X86_STD_RESOURCES=y
> CONFIG_PC=y
So, in the oddball config department, you've got a ISAPnP modem over
which you're running PPP; CONFIG_PREEMPT is on.
That corruption size really does make me think of network packets, so
I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
link? "ifconfig ppp0" or something like that.
Can you try doing something like
#!/bin/sh
x=0
while true; do
bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
(cd foo; bk pull -q)
rm -rf foo
x=`expr $x + 1`
echo -n "$x "
done
(I just pulled that key at random out of the kernel repository; there's
nothing special about it except that it's far enough back for the revert
and pull to be very involved operations.)
That ought to do a nice test of the CPU, memory, disk, and kernel sans
PPP. If that loop runs for, say, 10 iterations without errors, keep it
running and try doing some non-BK network IO for a half hour (or two
iterations of the clone/pull loop, whichever is longer) and see if it
fails. You might want to increase the runtimes, say, overnight and two
hours of network activity, if you don't see any failures.
This test is designed to check the theory that in your config, PPP
somehow corrupts random buffer cache pages.
On Fri, May 14, 2004 at 07:46:17AM -0700, Larry McVoy wrote:
> My instinct is that this is a file system or VM problem. Here's why: BK
> wraps its data in multiple integrity checks. When you are doing a pull,
> the data that is sent across the wire is wrapped both at the individual
> diff level (each delta has a check) as well as a CRC around the whole
> set of diffs and metadata sent. Since Steven is pulling (I believe,
> please confirm) from bkbits.net, we know that the data being generated
> is fine - if it wasn't the whole world would be on our case.
>
> On the receiving side BK splats the entire set of diffs and metadata
> down on disk, checking the CRC, and it doesn't proceed to the apply patch
> part until the entire thing is on disk and checked. Then when the patches
> are applied, each per patch checksum is verified (except, as we recently
> found out, in the case of the changeset file, we added some fast path
> optimization for that and dropped the check on the floor. Oops).
>
> I don't think pppd can be part of the problem because of the way BK is
> designed - you shouldn't have gotten to the place you did if the data
> was corrupted in transit.
I agree, I don't see how it could be an in-flight corruption.
> If any of pppd/kernel stuff is corrupting in memory pages, that's a
> different matter entirely, that could cause these problems.
This is my current top suspect. Well, that, or rotted hardware with the
most bizarre symptoms I've ever seen. A 16550, or ISA DMA controller,
that just happens to stomp on buffer cache pages?
> The fact that Steven is the only guy seeing this really makes me
> suspicious that it is something with his kernel. I don't think this
> is a memory error problem, those never look like this, they look like
> a few bits being flipped. Blocks of nulls are always file/vm system.
Yeah, I'm sure it's some function of his hardware and config. But
really, how many people do you suppose are running 2.6 with PPP and
PREEMPT? And how many of them would notice a few pages per day of
partial buffer cache trashing? I had a machine with one byte of memory
that gave you "x" 50% of the time, and "x & 0xdf" the other 50% of the
time; it took several months and a fairly serious filesystem blowup
before I noticed enough problems to go run memtest86.
-andy
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
@ 2004-05-15 0:54 ` Steven Cole
2004-05-15 1:55 ` 1352 NUL bytes at the end of a page? Wayne Scott
0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-15 0:54 UTC (permalink / raw)
To: Andy Isaacson; +Cc: Steven Cole, support, Andrew Morton, torvalds, Linux Kernel
On Friday 14 May 2004 10:53 am, Andy Isaacson wrote:
> So, in the oddball config department, you've got a ISAPnP modem over
> which you're running PPP; CONFIG_PREEMPT is on.
>
> That corruption size really does make me think of network packets, so
> I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
> link? "ifconfig ppp0" or something like that.
ppp0 Link encap:Point-to-Point Protocol
inet addr:216.31.65.245 P-t-P:216.31.65.1 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:123 errors:0 dropped:0 overruns:0 frame:0
TX packets:152 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:3
RX bytes:77312 (75.5 Kb) TX bytes:8212 (8.0 Kb)
>
> Can you try doing something like
>
> #!/bin/sh
> x=0
> while true; do
> bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
> (cd foo; bk pull -q)
> rm -rf foo
> x=`expr $x + 1`
> echo -n "$x "
> done
>
> (I just pulled that key at random out of the kernel repository; there's
> nothing special about it except that it's far enough back for the revert
> and pull to be very involved operations.)
>
> That ought to do a nice test of the CPU, memory, disk, and kernel sans
> PPP. If that loop runs for, say, 10 iterations without errors, keep it
> running and try doing some non-BK network IO for a half hour (or two
> iterations of the clone/pull loop, whichever is longer) and see if it
> fails. You might want to increase the runtimes, say, overnight and two
> hours of network activity, if you don't see any failures.
>
> This test is designed to check the theory that in your config, PPP
> somehow corrupts random buffer cache pages.
It didn't need PPP to fail.
It looks like it failed on the 7th iteration of the script supplied by Andy.
[snipped list of files]
sound/core/SCCS/s.Kconfig
Your repository should be back to where it was before undo started
We are running a consistency check to verify this.
check passed
Undo failed, repository left locked.
WARNING: deleting orphan file /home/steven/tmp/bk_clone2_mVmrsk
Entire repository is locked by:
RESYNC directory.
ERROR-Unable to lock repository for update.
6
[steven@spc BK]$ ls -ls foo/RESYNC/SCCS/*
40048 -r--r--r-- 1 steven steven 41007273 May 14 18:08 foo/RESYNC/SCCS/s.ChangeSet
68 -r--r--r-- 1 steven steven 67791 May 14 18:11 foo/RESYNC/SCCS/s.CREDITS
76 -r--r--r-- 1 steven steven 75264 May 14 18:11 foo/RESYNC/SCCS/s.MAINTAINERS
124 -r--r--r-- 1 steven steven 124747 May 14 18:11 foo/RESYNC/SCCS/s.Makefile
Let me know if you want any of these files. I can compress them and send them
the usual way.
The kernel was 2.6.6 plus whatever is in Linus' tree, and bk was 3.0.4.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: 1352 NUL bytes at the end of a page?
2004-05-15 0:54 ` Steven Cole
@ 2004-05-15 1:55 ` Wayne Scott
0 siblings, 0 replies; 27+ messages in thread
From: Wayne Scott @ 2004-05-15 1:55 UTC (permalink / raw)
To: elenstev; +Cc: adi, scole, support, akpm, torvalds, linux-kernel
From: Steven Cole <elenstev@mesatop.com>
> [steven@spc BK]$ ls -ls foo/RESYNC/SCCS/*
> 40048 -r--r--r-- 1 steven steven 41007273 May 14 18:08 foo/RESYNC/SCCS/s.ChangeSet
> 68 -r--r--r-- 1 steven steven 67791 May 14 18:11 foo/RESYNC/SCCS/s.CREDITS
> 76 -r--r--r-- 1 steven steven 75264 May 14 18:11 foo/RESYNC/SCCS/s.MAINTAINERS
> 124 -r--r--r-- 1 steven steven 124747 May 14 18:11 foo/RESYNC/SCCS/s.Makefile
>
> Let me know if you want any of these files. I can compress them and send them
> the usual way.
scan the ChangeSet file for nulls.
-Wayne
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
@ 2004-05-17 3:36 Steven Cole
2004-05-17 5:17 ` Linus Torvalds
0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-17 3:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Andrew Morton, adi, scole, support, linux-kernel
On Sunday 16 May 2004 08:42 pm, Linus Torvalds wrote:
>
> On Sun, 16 May 2004, Larry McVoy wrote:
> >
> > Be aware that how BK does I/O is with write() on the way out but with
> > mmap on the way in. The process which forked renumber has just written
> > the file and the renumber process is reading it with mmap.
> >
> > If there are still any problems with mixing read/write and mmap then that
> > may be a prolem but I would have expected to see things start going
> > wrong on a page boundary and the one core dump I saw was page aligned
> > at the tail but not at the head, it started in the middle of the page.
>
> The kernel should have no problems with mixed read/write and mmap usage,
> although user space obviously needs to synchronize the accesses on its own
> some way. There is no implicit synchronization otherwise, and the mmap
> user can see a partial write at any stage of the write.
>
> Some architectures may have cache coherency issues that makes this
> "interesting", but that's not the case on x86 (or indeed anything else
> remotely sane - virtual caches are just stupid in this day and age).
>
> > I've told my team to drop this unless someone can show that it happens
> > on other kernels, this smells like a kernel bug to me, if it were a BK
> > bug we should have been getting hundreds of complaints by now. We can
> > jump back on it if need be, let us know if you think it is a BK problem
> > after all.
>
> Yeah, I agree. The only other possibility I see is that BK just doesn't
> synchronize, and expects writes to be atomically visible to other
> processes. They aren't. Preemption might just make this a whole lot more
> visible, but on the other hand, so should SMP, so this sounds unlikely.
>
> Linus
>
>
Larry, Linus,
I beat on this for last day without PREEMPT and no failures at all.
Several kernels, rock solid all.
Rebooted with an current (as of a couple hours ago) kernel and PREEMPT=y,
and after about the third pull into a repository (I have several), splaaat!
Here are the symptoms. Same message from bk as usual:
---------------------------------------------------------------------------
takepatch: saved entire patch in PENDING/2004-05-16.01
---------------------------------------------------------------------------
Applying 15 revisions to ChangeSet renumber: can't read SCCS info in "RESYNC/SCCS/s.ChangeSet".
bk: takepatch.c:1343: applyCsetPatch: Assertion `s && s->tree' failed.
10586 bytes uncompressed to 52074, 4.92X expansion
One of Larry's guys, Rick Smith, sent me a little program (the source
is earlier in this thread) to check for null. I called its executable
saga (see subject line).
[steven@spc SCCS]$ saga <s.ChangeSet
Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611
That was in the testing-2.6/RESYNC/SCCS directory of course.
OK, no more CONFIG_PREEMPT for me. And, ppp failed earlier with:
serial8250: too much work for irq10. That did not happen without
CONFIG_PREEMPT.
I reconnected to my ISP, bk pulled into my main testing repository,
and that's when I got the splaaat.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 3:36 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Steven Cole
@ 2004-05-17 5:17 ` Linus Torvalds
2004-05-17 6:11 ` Andrew Morton
0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2004-05-17 5:17 UTC (permalink / raw)
To: Steven Cole
Cc: Larry McVoy, Andrew Morton, William Lee Irwin III, hugh, adi,
scole, support, Kernel Mailing List
On Sun, 16 May 2004, Steven Cole wrote:
>
> I beat on this for last day without PREEMPT and no failures at all.
> Several kernels, rock solid all.
> Rebooted with an current (as of a couple hours ago) kernel and PREEMPT=y,
> and after about the third pull into a repository (I have several), splaaat!
Ok. Good. It's PREEMPT that triggers it. However, I doubt it is
necessarily a preempt bug, I suspect that preempt just opens a window that
is really small even on SMP, and makes it much wider. Wide enough to be
seen.
> One of Larry's guys, Rick Smith, sent me a little program (the source
> is earlier in this thread) to check for null. I called its executable
> saga (see subject line).
>
> [steven@spc SCCS]$ saga <s.ChangeSet
> Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
> Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
> Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611
The fact that it's always zeroes, and it's an strange number but it always
ends up being page-aligned at the _end_ makes me strongly suspect that we
have one of the "don't write back data past i_size" things wrong.
There are several cases where we zero the "end of page" before writing
things back - basically everything past the size of the inode is supposed
to be written back as zero. But your symptoms sure sound like we might be
reading the size of the inode without holding the proper locks.
Under normal UP, no races exist, and even on SMP, the window is likely
that another CPU has to be _just_ updating something in between the read
of i_size and the clearing just a few instructions later. What preempt
does is likely that getting an interrupt at the right time inside that
window just makes the window _huge_.
Andrew, the obvious culprit would be the memset() in fs/buffer.c
(block_write_full_page() to be precise):
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
imagine that the "write()" function updates i_size late - after having
written out the new contents to the page, and _after_ havign unlocked the
page, and now we get a writeback at the wrong time, and we decide to clear
out the end of the page because we think it's past i_size.
Andrew, what do you think?
I think this race does exist, since generic_file_aio_write_nolock()
literally _does_ update i_size only after it has written all the pages, so
I don't see why a "block_write_full_page()" couldn't come in there between
and zero them out again at the _old_ i_size boundary.
Do you see anything wrong in my analysis? I think the fix would be to make
sure to update i_size as we go around writing each page, before we unlock
the page.
The DIRECT_IO path does this completely wrong and needs to be taught to do
it page-for-page or something, while the generic_commit_write() path looks
like it should be fairly trivially fixable by just moving the
i_size_write() to _before_ the __block_commit_write() call (which will
unlock the page).
Who else knows this code? Maybe I'm missing something. wli? Hugh?
Alternatively, maybe we could remove the "memset()" entirely from the
block_write_full_page() thing (which is asynchronous and not such a good
place to do this), and instead move the whole thing to some nice
synchronous place where we can make sure that we hold the inode semaphore
or something. Like the last close of a shared-writable mmap - since the
only way we can get non-zero after i_size is with a writable mmap.
And no, I can't guarantee this is the bug, but it does seem a bit
suspicious.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 5:17 ` Linus Torvalds
@ 2004-05-17 6:11 ` Andrew Morton
2004-05-17 13:56 ` 1352 NUL bytes at the end of a page? Wayne Scott
0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2004-05-17 6:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: elenstev, lm, wli, hugh, adi, scole, support, linux-kernel
Linus Torvalds <torvalds@osdl.org> wrote:
>
> Andrew, the obvious culprit would be the memset() in fs/buffer.c
> (block_write_full_page() to be precise):
>
> memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
>
> imagine that the "write()" function updates i_size late - after having
> written out the new contents to the page, and _after_ havign unlocked the
> page, and now we get a writeback at the wrong time, and we decide to clear
> out the end of the page because we think it's past i_size.
>
> Andrew, what do you think?
Interesting. Playing with i_size like that in writepage() _is_ scary. My
immediate reaction is that if this race was real, it's so gross that we
would have spotted it before now in either 2.4 or 2.5->2.6.
Easy test: Steve, could you remove that memset from
block_write_full_page(), see if it changes anything?
It's not very important - it's there because if an application
(incorrectly) writes to mapped data outside EOF we're supposed to drop
their data and write zeroes instead.
> I think this race does exist, since generic_file_aio_write_nolock()
> literally _does_ update i_size only after it has written all the pages, so
> I don't see why a "block_write_full_page()" couldn't come in there between
> and zero them out again at the _old_ i_size boundary.
i_size is updated in generic_commit_write(), on a per-page basis, or I'm
missing something? I sure hope so.
Let's go through the scenarios.
On entry to block_write_full_page(), i_size is in the middle of this page
somewhere. We're worried that i_size can change, and that this will cause
block_write_full_page() to incorrectly zero out the tail of the page.
Well we can stop right there, because the only way someone can get some
more non-zero user data into this page before we memset and write it is by
locking the page beforehand, and block_write_full_page() has the page lock.
(Or they can write stuff into it via mmap, but writing to the page outside
i_size is an application bug).
Other ways in which i_size can change under block_write_full_page()'s feet
are:
- Someone did a truncate.
No problem - the page is about to be invalidated and chopped off the
file anyway.
- Someone did an extending truncate into another page.
OK, i_size will increase but we're still supposed to write zeroes into
the rest of this page outside the previous i_size.
- Someone extended the file into another page with lseek+write or pwrite.
Same argumentation as with extending truncate.
- Someone did an extending truncate to another i_size which lands
*within* this page.
Writing zeroes is still OK: nobody can get into this page to write new
user data anyway - it's locked.
Either all that, or I missed something ;) If Steve can try that test it
would be interesting. Even if removing the memset does make the corruption
go away, this might not be a kernel bug - it could be that the application
is incorrectly relying on mmapped writes outside i_size making it to disk.
As for O_DIRECT: I need to think about that a bit more. We hold i_sem and
have done an fdatasync prior to entering generic_file_aio_write_nolock() so
there should be no dirty pagecache at this stage anyway. The VM may decide
to dirty some pagecache and then block_write_full_page() could come in and
look at i_size and race against generic_file_aio_write_nolock()'s O_DIRECT
i_size_write(). But I doubt if bk is using direct-IO in combination with
MAP_SHARED...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 6:11 ` Andrew Morton
@ 2004-05-17 13:56 ` Wayne Scott
2004-05-17 15:17 ` Theodore Ts'o
0 siblings, 1 reply; 27+ messages in thread
From: Wayne Scott @ 2004-05-17 13:56 UTC (permalink / raw)
To: akpm; +Cc: torvalds, elenstev, lm, wli, hugh, adi, scole, support,
linux-kernel
From: Andrew Morton <akpm@osdl.org>
> Well we can stop right there, because the only way someone can get some
> more non-zero user data into this page before we memset and write it is by
> locking the page beforehand, and block_write_full_page() has the page lock.
> (Or they can write stuff into it via mmap, but writing to the page outside
> i_size is an application bug).
BTW: BitKeeper never opens a writable mmap to a file. The files are
read with mmap() and written by fwriting to a tmp file and then
renaming over the target. And since we run on Windows, no process has
the file open when we are updating it.
Still catching up on this thread.
-Wayne
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 13:56 ` 1352 NUL bytes at the end of a page? Wayne Scott
@ 2004-05-17 15:17 ` Theodore Ts'o
2004-05-17 15:20 ` Larry McVoy
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Theodore Ts'o @ 2004-05-17 15:17 UTC (permalink / raw)
To: Wayne Scott
Cc: akpm, torvalds, elenstev, lm, wli, hugh, adi, scole, support,
linux-kernel
On Mon, May 17, 2004 at 08:56:40AM -0500, Wayne Scott wrote:
> From: Andrew Morton <akpm@osdl.org>
> > Well we can stop right there, because the only way someone can get some
> > more non-zero user data into this page before we memset and write it is by
> > locking the page beforehand, and block_write_full_page() has the page lock.
> > (Or they can write stuff into it via mmap, but writing to the page outside
> > i_size is an application bug).
>
> BTW: BitKeeper never opens a writable mmap to a file. The files are
> read with mmap() and written by fwriting to a tmp file and then
> renaming over the target. And since we run on Windows, no process has
> the file open when we are updating it.
Note though that the stdio library uses a writeable mmap to implement
fwrite.
- Ted
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:17 ` Theodore Ts'o
@ 2004-05-17 15:20 ` Larry McVoy
2004-05-17 15:22 ` Linus Torvalds
2004-05-17 16:23 ` Davide Libenzi
2 siblings, 0 replies; 27+ messages in thread
From: Larry McVoy @ 2004-05-17 15:20 UTC (permalink / raw)
To: Theodore Ts'o, Wayne Scott, akpm, torvalds, elenstev, lm, wli,
hugh, adi, scole, support, linux-kernel
On Mon, May 17, 2004 at 11:17:38AM -0400, Theodore Ts'o wrote:
> On Mon, May 17, 2004 at 08:56:40AM -0500, Wayne Scott wrote:
> > From: Andrew Morton <akpm@osdl.org>
> > > Well we can stop right there, because the only way someone can get some
> > > more non-zero user data into this page before we memset and write it is by
> > > locking the page beforehand, and block_write_full_page() has the page lock.
> > > (Or they can write stuff into it via mmap, but writing to the page outside
> > > i_size is an application bug).
> >
> > BTW: BitKeeper never opens a writable mmap to a file. The files are
> > read with mmap() and written by fwriting to a tmp file and then
> > renaming over the target. And since we run on Windows, no process has
> > the file open when we are updating it.
>
> Note though that the stdio library uses a writeable mmap to implement
> fwrite.
That's news to me. And we use fwrite.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:17 ` Theodore Ts'o
2004-05-17 15:20 ` Larry McVoy
@ 2004-05-17 15:22 ` Linus Torvalds
2004-05-17 15:25 ` Larry McVoy
` (2 more replies)
2004-05-17 16:23 ` Davide Libenzi
2 siblings, 3 replies; 27+ messages in thread
From: Linus Torvalds @ 2004-05-17 15:22 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Wayne Scott, akpm, elenstev, lm, wli, hugh, adi, scole, support,
linux-kernel
On Mon, 17 May 2004, Theodore Ts'o wrote:
>
> Note though that the stdio library uses a writeable mmap to implement
> fwrite.
It does? Whee. Then I'll have to agree with Andrew - if there is a path
that is more likely to have bugs, it's trying to do writes with mmap and
ftruncate.
Who came up with that braindead idea? Is it some crazed Mach developer
that infiltrated the glibc development group?
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:22 ` Linus Torvalds
@ 2004-05-17 15:25 ` Larry McVoy
2004-05-17 15:37 ` viro
2004-05-17 15:40 ` Arjan van de Ven
2 siblings, 0 replies; 27+ messages in thread
From: Larry McVoy @ 2004-05-17 15:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Theodore Ts'o, Wayne Scott, akpm, elenstev, lm, wli, hugh,
adi, scole, support, linux-kernel
On Mon, May 17, 2004 at 08:22:10AM -0700, Linus Torvalds wrote:
> On Mon, 17 May 2004, Theodore Ts'o wrote:
> > Note though that the stdio library uses a writeable mmap to implement
> > fwrite.
>
> It does? Whee. Then I'll have to agree with Andrew - if there is a path
> that is more likely to have bugs, it's trying to do writes with mmap and
> ftruncate.
>
> Who came up with that braindead idea? Is it some crazed Mach developer
> that infiltrated the glibc development group?
You have my agreement on the craziness of this idea. It's a lot easier for
the kernel to do smart things with write behind with write() rather than
mmap()-ed pages being touched. SunOS is the only system I know which does
both "correctly" (correctly meaning the same way whether it is mmap or
write).
It's also a lose to do mmap() instead of read/write for small files. Linux
is light weight enough that the cross over point is pretty small, probably
under 8K and certainly under 16K, but still.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:22 ` Linus Torvalds
2004-05-17 15:25 ` Larry McVoy
@ 2004-05-17 15:37 ` viro
2004-05-17 17:30 ` Steven Cole
2004-05-17 15:40 ` Arjan van de Ven
2 siblings, 1 reply; 27+ messages in thread
From: viro @ 2004-05-17 15:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Theodore Ts'o, Wayne Scott, akpm, elenstev, lm, wli, hugh,
adi, scole, support, linux-kernel
On Mon, May 17, 2004 at 08:22:10AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 17 May 2004, Theodore Ts'o wrote:
> >
> > Note though that the stdio library uses a writeable mmap to implement
> > fwrite.
>
> It does? Whee. Then I'll have to agree with Andrew - if there is a path
> that is more likely to have bugs, it's trying to do writes with mmap and
> ftruncate.
>
> Who came up with that braindead idea? Is it some crazed Mach developer
> that infiltrated the glibc development group?
IIRC, that idiocy had been disabled by default (note that it's inherently
broken, since truncate() between your mmap() and memcpy() will lead to
a coredump, which is not something fwrite() is allowed to do in such
situation).
strace should show if there such mmap calls are made, anyway. Did they
show up in the traces?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:37 ` viro
@ 2004-05-17 17:30 ` Steven Cole
2004-05-17 17:40 ` viro
0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-17 17:30 UTC (permalink / raw)
To: viro
Cc: hugh, elenstev, linux-kernel, support, Linus Torvalds,
Wayne Scott, adi, akpm, wli, lm, Theodore Ts'o
On May 17, 2004, at 9:37 AM, viro@parcelfarce.linux.theplanet.co.uk
wrote:
> On Mon, May 17, 2004 at 08:22:10AM -0700, Linus Torvalds wrote:
>>
>>
>> On Mon, 17 May 2004, Theodore Ts'o wrote:
>>>
>>> Note though that the stdio library uses a writeable mmap to implement
>>> fwrite.
>>
>> It does? Whee. Then I'll have to agree with Andrew - if there is a
>> path
>> that is more likely to have bugs, it's trying to do writes with mmap
>> and
>> ftruncate.
>>
>> Who came up with that braindead idea? Is it some crazed Mach developer
>> that infiltrated the glibc development group?
>
> IIRC, that idiocy had been disabled by default (note that it's
> inherently
> broken, since truncate() between your mmap() and memcpy() will lead to
> a coredump, which is not something fwrite() is allowed to do in such
> situation).
>
> strace should show if there such mmap calls are made, anyway. Did they
> show up in the traces?
>
>
These calls show up in an strace which I did from a non-failing system,
but which has the same glibc as the failing system:
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x40018000
old_mmap(NULL, 19184, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40018000
The command was the following, with the result "Nothing to pull".
strace bk pull bk://linux.bkbits.net/linux-2.5
There were 52 instances of mmap2 or old_mmap in the saved script log.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 17:30 ` Steven Cole
@ 2004-05-17 17:40 ` viro
2004-05-17 17:39 ` Steven Cole
0 siblings, 1 reply; 27+ messages in thread
From: viro @ 2004-05-17 17:40 UTC (permalink / raw)
To: Steven Cole
Cc: hugh, elenstev, linux-kernel, support, Linus Torvalds,
Wayne Scott, adi, akpm, wli, lm, Theodore Ts'o
On Mon, May 17, 2004 at 11:30:36AM -0600, Steven Cole wrote:
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x40018000
rw anonymous - that has nothing to do with any IO.
> old_mmap(NULL, 19184, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40018000
read-only, whatever file that was.
Was there anything with PROT_WRITE and without MAP_ANONYMOUS?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 17:40 ` viro
@ 2004-05-17 17:39 ` Steven Cole
2004-05-17 19:06 ` viro
0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-17 17:39 UTC (permalink / raw)
To: viro
Cc: hugh, linux-kernel, support, Linus Torvalds, Wayne Scott, adi,
Andrew Morton, wli, lm, Theodore Ts'o
On Mon, 2004-05-17 at 11:40, viro@parcelfarce.linux.theplanet.co.uk
wrote:
> On Mon, May 17, 2004 at 11:30:36AM -0600, Steven Cole wrote:
>
> > mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> > 0) = 0x40018000
>
> rw anonymous - that has nothing to do with any IO.
>
> > old_mmap(NULL, 19184, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40018000
>
> read-only, whatever file that was.
>
> Was there anything with PROT_WRITE and without MAP_ANONYMOUS?
Yes, seven of the 52 references to mmap in the strace output met
the above criteria:
old_mmap(0x4015f000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x142000) = 0x4015f000
old_mmap(0x40170000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x9000) = 0x40170000
old_mmap(0x40180000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x9000) = 0x40180000
old_mmap(0x40191000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x10000) = 0x40191000
old_mmap(0x4019c000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x7000) = 0x4019c000
old_mmap(0x401a0000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x3000) = 0x401a0000
old_mmap(0x401af000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xe000) = 0x401af000
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 17:39 ` Steven Cole
@ 2004-05-17 19:06 ` viro
0 siblings, 0 replies; 27+ messages in thread
From: viro @ 2004-05-17 19:06 UTC (permalink / raw)
To: Steven Cole
Cc: hugh, linux-kernel, support, Linus Torvalds, Wayne Scott, adi,
Andrew Morton, wli, lm, Theodore Ts'o
On Mon, May 17, 2004 at 11:39:58AM -0600, Steven Cole wrote:
> Yes, seven of the 52 references to mmap in the strace output met
> the above criteria:
>
> old_mmap(0x4015f000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x142000) = 0x4015f000
> old_mmap(0x40170000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x9000) = 0x40170000
> old_mmap(0x40180000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x9000) = 0x40180000
> old_mmap(0x40191000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x10000) = 0x40191000
> old_mmap(0x4019c000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x7000) = 0x4019c000
> old_mmap(0x401a0000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0x3000) = 0x401a0000
> old_mmap(0x401af000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 4, 0xe000) = 0x401af000
Shared libraries. And no, those will never lead to any writes, no matter how
you modify them (MAP_PRIVATE). Which is the point, since that's where we
are doing relocations and we definitely do not want that to hit the disk ;-)
IOW, we can remove writes of dirtied mmap'ed pages from consideration.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:22 ` Linus Torvalds
2004-05-17 15:25 ` Larry McVoy
2004-05-17 15:37 ` viro
@ 2004-05-17 15:40 ` Arjan van de Ven
2004-05-17 15:53 ` Steven Cole
2 siblings, 1 reply; 27+ messages in thread
From: Arjan van de Ven @ 2004-05-17 15:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Theodore Ts'o, Wayne Scott, akpm, elenstev, lm, wli, hugh,
adi, scole, support, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 330 bytes --]
>
> Who came up with that braindead idea? Is it some crazed Mach developer
> that infiltrated the glibc development
afaik it's optional and off by default, for reads it sort of kinda makes
sense but it can't be on by default otherwise a truncate would cause
fscanf() to throw a sigbus, that's not legal posix wise.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:40 ` Arjan van de Ven
@ 2004-05-17 15:53 ` Steven Cole
0 siblings, 0 replies; 27+ messages in thread
From: Steven Cole @ 2004-05-17 15:53 UTC (permalink / raw)
To: arjanv
Cc: hugh, elenstev, linux-kernel, support, Linus Torvalds,
Wayne Scott, adi, akpm, wli, lm, Theodore Ts'o
On May 17, 2004, at 9:40 AM, Arjan van de Ven wrote:
>
>>
>> Who came up with that braindead idea? Is it some crazed Mach developer
>> that infiltrated the glibc development
>
> afaik it's optional and off by default, for reads it sort of kinda
> makes
> sense but it can't be on by default otherwise a truncate would cause
> fscanf() to throw a sigbus, that's not legal posix wise.
>
>
For what it's worth, here is the glibc information on a system which
has the same distribution as the system at home which hits this bug:
[steven@spc2 testing-2.6]$ /lib/libc.so.6
GNU C Library stable release version 2.3.3, by Roland McGrath et al.
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.3.2 (Mandrake Linux 10.0 3.3.2-4mdk).
Compiled on a Linux 2.6.0 system on 2004-02-16.
Available extensions:
GNU libio by Per Bothner
crypt add-on version 2.1 by Michael Glad and others
linuxthreads-0.10 by Xavier Leroy
BIND-8.2.3-T5B
libthread_db work sponsored by Alpha Processor Inc
NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Thread-local storage support included.
Report bugs using the `glibcbug' script to <bugs@gnu.org>.
Steven
------------------------------------------------------------------------
Steven Cole <scole@lanl.gov>
MacOS X 10.3.3 Panther
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-17 15:17 ` Theodore Ts'o
2004-05-17 15:20 ` Larry McVoy
2004-05-17 15:22 ` Linus Torvalds
@ 2004-05-17 16:23 ` Davide Libenzi
2004-05-17 16:28 ` Davide Libenzi
2 siblings, 1 reply; 27+ messages in thread
From: Davide Libenzi @ 2004-05-17 16:23 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Wayne Scott, Andrew Morton, Linus Torvalds, elenstev, lm,
William Lee Irwin III, hugh, adi, scole, support,
Linux Kernel Mailing List
On Mon, 17 May 2004, Theodore Ts'o wrote:
> On Mon, May 17, 2004 at 08:56:40AM -0500, Wayne Scott wrote:
> > From: Andrew Morton <akpm@osdl.org>
> > > Well we can stop right there, because the only way someone can get some
> > > more non-zero user data into this page before we memset and write it is by
> > > locking the page beforehand, and block_write_full_page() has the page lock.
> > > (Or they can write stuff into it via mmap, but writing to the page outside
> > > i_size is an application bug).
> >
> > BTW: BitKeeper never opens a writable mmap to a file. The files are
> > read with mmap() and written by fwriting to a tmp file and then
> > renaming over the target. And since we run on Windows, no process has
> > the file open when we are updating it.
>
> Note though that the stdio library uses a writeable mmap to implement
> fwrite.
Strange, it uses read/write but it also opens an mmap(private and
anonymous):
#include <stdio.h>
int main(int ac, char **av) {
size_t rd;
FILE *fp;
static char buf[1024 * 64];
fp = fopen(av[1], "r+");
rd = fread(buf, 1, sizeof(buf), fp);
fseek(fp, 0, SEEK_SET);
fwrite(buf, 1, rd, fp);
fflush(fp);
fclose(fp);
return 0;
}
[davide@bigblue davide]$ strace ./foo zzzzzzzzzzzzzzzzzzz
execve("./foo", ["./foo", "zzzzzzzzzzzzzzzzzzz"], [/* 30 vars */]) = 0
uname({sys="Linux", node="bigblue.dev.mdolabs.com", ...}) = 0
brk(0) = 0x9cc8000
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or
directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=84014, ...}) = 0
old_mmap(NULL, 84014, PROT_READ, MAP_PRIVATE, 3, 0) = 0xbf50b000
close(3) = 0
open("/lib/tls/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\350\270"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1578228, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xbf50a000
old_mmap(0xb79000, 1281996, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) =
0xb79000
old_mmap(0xcac000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3,
0x132000) = 0xcac000
old_mmap(0xcb0000, 8140, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xcb0000
close(3) = 0
set_thread_area({entry_number:-1 -> 6, base_addr:0xbf50a740,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1}) = 0
munmap(0xbf50b000, 84014) = 0
brk(0) = 0x9cc8000
brk(0x9ce9000) = 0x9ce9000
brk(0) = 0x9ce9000
open("zzzzzzzzzzzzzzzzzzz", O_RDWR) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=188307, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0xbf51f000
read(3, "<!DOCTYPE HTML PUBLIC \"-//IETF//"..., 65536) = 65536
_llseek(3, 0, [0], SEEK_SET) = 0
write(3, "<!DOCTYPE HTML PUBLIC \"-//IETF//"..., 65536) = 65536
close(3) = 0
munmap(0xbf51f000, 4096) = 0
exit_group(0)
- Davide
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: 1352 NUL bytes at the end of a page?
2004-05-17 16:23 ` Davide Libenzi
@ 2004-05-17 16:28 ` Davide Libenzi
0 siblings, 0 replies; 27+ messages in thread
From: Davide Libenzi @ 2004-05-17 16:28 UTC (permalink / raw)
To: Davide Libenzi
Cc: Theodore Ts'o, Wayne Scott, Andrew Morton, Linus Torvalds,
elenstev, lm, William Lee Irwin III, hugh, adi, scole, support,
Linux Kernel Mailing List
On Mon, 17 May 2004, Davide Libenzi wrote:
> Strange, it uses read/write but it also opens an mmap(private and
> anonymous):
That is not file related (I should have had breakfast before posting :)
- Davide
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
@ 2004-05-18 14:38 Linus Torvalds
2004-05-19 10:53 ` Steven Cole
0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2004-05-18 14:38 UTC (permalink / raw)
To: Steven Cole
Cc: Andrew Morton, Larry McVoy, mason, wli, hugh, adi, support,
linux-kernel
On Mon, 17 May 2004, Steven Cole wrote:
>
> No problems, and with PREEMPT of course.
Ok. Good. It's a small data-set, but the bug made sense, and so did the
fix.
> > If you see a failure on ext3, please try to analyze the corruption pattern
> > again. It might be something different.
>
> So, I take it that I should revert that one-liner if I want to get any failure data?
> With it, ext3 was pretty solid for this testing.
Yes. That one-liner is bogus. It was a good way to test a hypothesis for
the common case of a filesystem that uses the block_write_full_page thing
(and reiser is one of the few that doesn't), but it wasn't the real fix.
The reiser patch was the real fix for the problem on reiser, but ext3
should have been ok already. It uses (through a lot of other functions)
generic_file_aio_write_nolock() as the real write engine, and that one
calls "commit_write()" with the page lock held.
Linus
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 14:38 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Linus Torvalds
@ 2004-05-19 10:53 ` Steven Cole
2004-05-19 12:10 ` Chris Mason
0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-19 10:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Larry McVoy, mason, wli, hugh, adi, support,
linux-kernel
On Tuesday 18 May 2004 08:38 am, Linus Torvalds wrote:
>
> On Mon, 17 May 2004, Steven Cole wrote:
> >
> > No problems, and with PREEMPT of course.
>
> Ok. Good. It's a small data-set, but the bug made sense, and so did the
> fix.
Perhaps a final note on this: I did more testing on reiserfs overnight with
Chris' patch, and it survived eleven pulls and unpulls with no failures.
>
> > > If you see a failure on ext3, please try to analyze the corruption pattern
> > > again. It might be something different.
> >
> > So, I take it that I should revert that one-liner if I want to get any failure data?
> > With it, ext3 was pretty solid for this testing.
>
> Yes. That one-liner is bogus. It was a good way to test a hypothesis for
> the common case of a filesystem that uses the block_write_full_page thing
> (and reiser is one of the few that doesn't), but it wasn't the real fix.
> The reiser patch was the real fix for the problem on reiser, but ext3
> should have been ok already. It uses (through a lot of other functions)
> generic_file_aio_write_nolock() as the real write engine, and that one
> calls "commit_write()" with the page lock held.
>
> Linus
I also tested ext3 more extensively (10 pulls/unpulls) and could not repeat
the alleged failure on ext3. That was with akpm's one-liner backed out.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-19 10:53 ` Steven Cole
@ 2004-05-19 12:10 ` Chris Mason
2004-05-19 12:20 ` 1352 NUL bytes at the end of a page? Wayne Scott
0 siblings, 1 reply; 27+ messages in thread
From: Chris Mason @ 2004-05-19 12:10 UTC (permalink / raw)
To: Steven Cole
Cc: Linus Torvalds, Andrew Morton, Larry McVoy, wli, hugh, adi,
support, linux-kernel
On Wed, 2004-05-19 at 06:53, Steven Cole wrote:
> On Tuesday 18 May 2004 08:38 am, Linus Torvalds wrote:
> >
> > On Mon, 17 May 2004, Steven Cole wrote:
> > >
> > > No problems, and with PREEMPT of course.
> >
> > Ok. Good. It's a small data-set, but the bug made sense, and so did the
> > fix.
>
> Perhaps a final note on this: I did more testing on reiserfs overnight with
> Chris' patch, and it survived eleven pulls and unpulls with no failures.
Good to hear. We probably still need Andrew's truncate fix, this just
isn't the right workload to show it. Andrew, that reiserfs fix survived
testing here, could you please include it?
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 12:10 ` Chris Mason
@ 2004-05-19 12:20 ` Wayne Scott
2004-05-19 12:42 ` Nick Piggin
0 siblings, 1 reply; 27+ messages in thread
From: Wayne Scott @ 2004-05-19 12:20 UTC (permalink / raw)
To: mason; +Cc: elenstev, torvalds, akpm, lm, wli, hugh, adi, support,
linux-kernel
From: Chris Mason <mason@suse.com>
> Good to hear. We probably still need Andrew's truncate fix, this just
> isn't the right workload to show it. Andrew, that reiserfs fix survived
> testing here, could you please include it?
>
> -chris
BTW. We have had one other person report a similar failure.
http://db.bitkeeper.com/cgi-bin/bugdb.cgi?.page=view&id=2004-05-19-001
But if sounds like this problem is now understood. It was a pleasure
to watch you guys, and someone should buy Steven a beer. Or perhaps
order a pizza for his family because I suspect this took some of their
time.
-Wayne
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 12:20 ` 1352 NUL bytes at the end of a page? Wayne Scott
@ 2004-05-19 12:42 ` Nick Piggin
2004-05-19 13:28 ` Steven Cole
0 siblings, 1 reply; 27+ messages in thread
From: Nick Piggin @ 2004-05-19 12:42 UTC (permalink / raw)
To: elenstev
Cc: Wayne Scott, mason, torvalds, akpm, lm, wli, hugh, adi, support,
linux-kernel, Andrea Arcangeli
Wayne Scott wrote:
> From: Chris Mason <mason@suse.com>
>
>>Good to hear. We probably still need Andrew's truncate fix, this just
>>isn't the right workload to show it. Andrew, that reiserfs fix survived
>>testing here, could you please include it?
>>
>>-chris
>
>
> BTW. We have had one other person report a similar failure.
>
> http://db.bitkeeper.com/cgi-bin/bugdb.cgi?.page=view&id=2004-05-19-001
>
> But if sounds like this problem is now understood. It was a pleasure
> to watch you guys, and someone should buy Steven a beer. Or perhaps
> order a pizza for his family because I suspect this took some of their
> time.
>
Yep. Thanks for your help Steven.
I don't think anyone has cleared up the performance regression
problem yet though, so I'll have to bug you a bit more.
Steven, with all else being equal, you said you found a 2.6.3 SuSE
kernel to significantly outperform 2.6.6, is that right? If so can
you try the same test with plain 2.6.3 please? We'll go from there.
This one isn't urgent, because I suspect it could be something
specific to the SuSE kernel rather than a regression in Linus' tree
- we've heard no other complaints... so just whenever you get the
chance.
Nick
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 12:42 ` Nick Piggin
@ 2004-05-19 13:28 ` Steven Cole
2004-05-19 13:36 ` Chris Mason
0 siblings, 1 reply; 27+ messages in thread
From: Steven Cole @ 2004-05-19 13:28 UTC (permalink / raw)
To: Nick Piggin
Cc: mason, hugh, elenstev, linux-kernel, torvalds, support,
Wayne Scott, adi, akpm, wli, Andrea Arcangeli, lm
On May 19, 2004, at 6:42 AM, Nick Piggin wrote:
> Wayne Scott wrote:
>> From: Chris Mason <mason@suse.com>
>>> Good to hear. We probably still need Andrew's truncate fix, this
>>> just
>>> isn't the right workload to show it. Andrew, that reiserfs fix
>>> survived
>>> testing here, could you please include it?
>>>
>>> -chris
>> BTW. We have had one other person report a similar failure.
>> http://db.bitkeeper.com/cgi-bin/bugdb.cgi?.page=view&id=2004-05-19-001
I received a report from James H. Cloos Jr. (cc'ed to the rest of you),
but apparently, that report never made it to linux-kernel (I haven't
seen it it the archives or in my lmkl file). His report was regarding
similar file corruption on xfs.
<OT for this thread>
I also made a report yesterday regarding a compile problem with
lib/kobject.c and no CONFIG_SYSFS. That post also never made it
to the list. If that happens again, I'll report it to the right folks.
</OT>
>> But if sounds like this problem is now understood. It was a pleasure
>> to watch you guys, and someone should buy Steven a beer. Or perhaps
>> order a pizza for his family because I suspect this took some of their
>> time.
>
> Yep. Thanks for your help Steven.
>
> I don't think anyone has cleared up the performance regression
> problem yet though, so I'll have to bug you a bit more.
>
> Steven, with all else being equal, you said you found a 2.6.3 SuSE
> kernel to significantly outperform 2.6.6, is that right? If so can
> you try the same test with plain 2.6.3 please? We'll go from there.
Actually, it was a Mandrake kernel, 2.6.3-4mdk IIRC. Whatever is
the default with MDK 10. One salient difference with the vendor
kernel is that everything which can be a module is, and I wasn't
using any modules with my kernels. BTW, I was careful to have the
same hdparm settings during the performance testing.
The performance difference was very repeatable. Using the script
provided by Andy Isaacson, the 2.6.3-4mdk did the clone in about
11 minutes total, while the various current kernels took about
15 minutes total. The user times were the same, and the difference
was in system time. Those numbers are from memory, the actual
results should be in the archive.
>
> This one isn't urgent, because I suspect it could be something
> specific to the SuSE kernel rather than a regression in Linus' tree
> - we've heard no other complaints... so just whenever you get the
> chance.
>
I may be able to do some some performance testing here at work,
where I have a greater variety (and much faster) machines to use.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 13:28 ` Steven Cole
@ 2004-05-19 13:36 ` Chris Mason
2004-05-19 13:59 ` Steven Cole
0 siblings, 1 reply; 27+ messages in thread
From: Chris Mason @ 2004-05-19 13:36 UTC (permalink / raw)
To: Steven Cole
Cc: Nick Piggin, hugh, elenstev, linux-kernel, torvalds, support,
Wayne Scott, adi, akpm, wli, Andrea Arcangeli, lm
On Wed, 2004-05-19 at 09:28, Steven Cole wrote:
> > Steven, with all else being equal, you said you found a 2.6.3 SuSE
> > kernel to significantly outperform 2.6.6, is that right? If so can
> > you try the same test with plain 2.6.3 please? We'll go from there.
>
> Actually, it was a Mandrake kernel, 2.6.3-4mdk IIRC. Whatever is
> the default with MDK 10. One salient difference with the vendor
> kernel is that everything which can be a module is, and I wasn't
> using any modules with my kernels. BTW, I was careful to have the
> same hdparm settings during the performance testing.
>
> The performance difference was very repeatable. Using the script
> provided by Andy Isaacson, the 2.6.3-4mdk did the clone in about
> 11 minutes total, while the various current kernels took about
> 15 minutes total. The user times were the same, and the difference
> was in system time. Those numbers are from memory, the actual
> results should be in the archive.
Was this regression only reiserv3 or both v3 and ext3?
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 13:36 ` Chris Mason
@ 2004-05-19 13:59 ` Steven Cole
2004-05-19 14:03 ` Wayne Scott
2004-05-19 14:08 ` Chris Mason
0 siblings, 2 replies; 27+ messages in thread
From: Steven Cole @ 2004-05-19 13:59 UTC (permalink / raw)
To: Chris Mason
Cc: hugh, Nick Piggin, elenstev, linux-kernel, torvalds, support,
Wayne Scott, adi, akpm, wli, Andrea Arcangeli, lm
On May 19, 2004, at 7:36 AM, Chris Mason wrote:
> On Wed, 2004-05-19 at 09:28, Steven Cole wrote:
>
>>> Steven, with all else being equal, you said you found a 2.6.3 SuSE
>>> kernel to significantly outperform 2.6.6, is that right? If so can
>>> you try the same test with plain 2.6.3 please? We'll go from there.
>>
>> Actually, it was a Mandrake kernel, 2.6.3-4mdk IIRC. Whatever is
>> the default with MDK 10. One salient difference with the vendor
>> kernel is that everything which can be a module is, and I wasn't
>> using any modules with my kernels. BTW, I was careful to have the
>> same hdparm settings during the performance testing.
>>
>> The performance difference was very repeatable. Using the script
>> provided by Andy Isaacson, the 2.6.3-4mdk did the clone in about
>> 11 minutes total, while the various current kernels took about
>> 15 minutes total. The user times were the same, and the difference
>> was in system time. Those numbers are from memory, the actual
>> results should be in the archive.
>
> Was this regression only reiserv3 or both v3 and ext3?
>
> -chris
>
I went back through the archive to make sure, and since I didn't
specify where I did the timed tests, those timing tests would have
been done on my /home partition, which is reiserfs v3.
Since I was using different partitions for ext3 and reiserfs on
/dev/hda, a direct comparison between ext3 and reiserfs wouldn't
be completely fair, but a "watching the paint dry" observation
seemed to indicate that reiserfs was significantly faster for this
load. I did press my backup disk into service for this testing,
to eliminate the possibility that this was due to a finicky disk,
and I have a 3.9 G partition which I've formatted first reiserfs,
then ext3, so I could do some fair tests between reiserfs and
ext3 on that disk. But I think the results are already known;
reiserfs opens a can of whoopass for this kind of load.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 13:59 ` Steven Cole
@ 2004-05-19 14:03 ` Wayne Scott
2004-05-19 14:08 ` Chris Mason
1 sibling, 0 replies; 27+ messages in thread
From: Wayne Scott @ 2004-05-19 14:03 UTC (permalink / raw)
To: scole
Cc: mason, hugh, nickpiggin, elenstev, linux-kernel, torvalds,
support, adi, akpm, wli, andrea, lm
From: Steven Cole <scole@lanl.gov>
> But I think the results are already known;
> reiserfs opens a can of whoopass for this kind of load.
I can confirm that bk really likes reiserfs.
-Wayne
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 13:59 ` Steven Cole
2004-05-19 14:03 ` Wayne Scott
@ 2004-05-19 14:08 ` Chris Mason
2004-05-19 14:20 ` Steven Cole
2004-05-19 14:45 ` Steven Cole
1 sibling, 2 replies; 27+ messages in thread
From: Chris Mason @ 2004-05-19 14:08 UTC (permalink / raw)
To: Steven Cole
Cc: hugh, Nick Piggin, elenstev, linux-kernel, torvalds, support,
Wayne Scott, adi, akpm, wli, Andrea Arcangeli, lm
On Wed, 2004-05-19 at 09:59, Steven Cole wrote:
> I went back through the archive to make sure, and since I didn't
> specify where I did the timed tests, those timing tests would have
> been done on my /home partition, which is reiserfs v3.
>
> Since I was using different partitions for ext3 and reiserfs on
> /dev/hda, a direct comparison between ext3 and reiserfs wouldn't
> be completely fair, but a "watching the paint dry" observation
> seemed to indicate that reiserfs was significantly faster for this
> load. I did press my backup disk into service for this testing,
> to eliminate the possibility that this was due to a finicky disk,
> and I have a 3.9 G partition which I've formatted first reiserfs,
> then ext3, so I could do some fair tests between reiserfs and
> ext3 on that disk. But I think the results are already known;
> reiserfs opens a can of whoopass for this kind of load.
While this is the kind of thing I like to hear, it wasn't really what I
was asking ;-)
There was a regression between a 2.6.3 mandrake kernel and 2.6.6, was
this regression just for reiserfs or was it for all filesystems?
If just reiserfs, it might be from the data=ordered and logging changes
that went into 2.6.6, so I'm quite interested in figuring things out.
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 14:08 ` Chris Mason
@ 2004-05-19 14:20 ` Steven Cole
2004-05-19 14:45 ` Steven Cole
1 sibling, 0 replies; 27+ messages in thread
From: Steven Cole @ 2004-05-19 14:20 UTC (permalink / raw)
To: Chris Mason
Cc: hugh, Nick Piggin, elenstev, linux-kernel, torvalds, support,
Wayne Scott, adi, akpm, wli, Andrea Arcangeli, lm
On May 19, 2004, at 8:08 AM, Chris Mason wrote:
> On Wed, 2004-05-19 at 09:59, Steven Cole wrote:
>
>> I went back through the archive to make sure, and since I didn't
>> specify where I did the timed tests, those timing tests would have
>> been done on my /home partition, which is reiserfs v3.
>>
>> Since I was using different partitions for ext3 and reiserfs on
>> /dev/hda, a direct comparison between ext3 and reiserfs wouldn't
>> be completely fair, but a "watching the paint dry" observation
>> seemed to indicate that reiserfs was significantly faster for this
>> load. I did press my backup disk into service for this testing,
>> to eliminate the possibility that this was due to a finicky disk,
>> and I have a 3.9 G partition which I've formatted first reiserfs,
>> then ext3, so I could do some fair tests between reiserfs and
>> ext3 on that disk. But I think the results are already known;
>> reiserfs opens a can of whoopass for this kind of load.
>
> While this is the kind of thing I like to hear, it wasn't really what I
> was asking ;-)
>
> There was a regression between a 2.6.3 mandrake kernel and 2.6.6, was
> this regression just for reiserfs or was it for all filesystems?
>
> If just reiserfs, it might be from the data=ordered and logging changes
> that went into 2.6.6, so I'm quite interested in figuring things out.
>
> -chris
Sorry for distracting you with the second paragraph, but to reemphasize
the first paragraph, I did the timing comparisons between the
various kernel versions on reiserfs v3 _only_, and the /home reiserfs
wasn't mounted with anything special. I don't have the /etc/fstab
right here in front of me, but I can get that later if needed. The
.config file was posted in the very first post of this thread, and
the only deviations from that were to add some DEBUG config options
suggested by Andrew or Linus and I dropped PREEMPT (before your patch)
to
keep the thing from going splaat.
The kernel versions I tested were 2.6.6-current a few days ago,
2.6.3-4mdk, and 2.6.5-aa5.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: 1352 NUL bytes at the end of a page?
2004-05-19 14:08 ` Chris Mason
2004-05-19 14:20 ` Steven Cole
@ 2004-05-19 14:45 ` Steven Cole
1 sibling, 0 replies; 27+ messages in thread
From: Steven Cole @ 2004-05-19 14:45 UTC (permalink / raw)
To: Chris Mason
Cc: hugh, Nick Piggin, elenstev, linux-kernel, torvalds, support,
Wayne Scott, adi, akpm, wli, Andrea Arcangeli, lm
On May 19, 2004, at 8:08 AM, Chris Mason wrote:
> On Wed, 2004-05-19 at 09:59, Steven Cole wrote:
>
>> I went back through the archive to make sure, and since I didn't
>> specify where I did the timed tests, those timing tests would have
>> been done on my /home partition, which is reiserfs v3.
>>
>> Since I was using different partitions for ext3 and reiserfs on
>> /dev/hda, a direct comparison between ext3 and reiserfs wouldn't
>> be completely fair, but a "watching the paint dry" observation
>> seemed to indicate that reiserfs was significantly faster for this
>> load. I did press my backup disk into service for this testing,
>> to eliminate the possibility that this was due to a finicky disk,
>> and I have a 3.9 G partition which I've formatted first reiserfs,
>> then ext3, so I could do some fair tests between reiserfs and
>> ext3 on that disk. But I think the results are already known;
>> reiserfs opens a can of whoopass for this kind of load.
>
> While this is the kind of thing I like to hear, it wasn't really what I
> was asking ;-)
>
> There was a regression between a 2.6.3 mandrake kernel and 2.6.6, was
> this regression just for reiserfs or was it for all filesystems?
>
> If just reiserfs, it might be from the data=ordered and logging changes
> that went into 2.6.6, so I'm quite interested in figuring things out.
>
> -chris
2nd reply:
It was just reiserfs, and I'm preparing to repeat some of those timing
tests on one of my test boxes at work, a dual P-III with IDE in this
case. I can test both reiserfs v3 and ext3 and 2.6.3-4mdk (smp version)
versus other kernel versions and whatever reiserfs mount options you
want to try. I have SCSI boxes too if needed.
Steven
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2004-05-19 14:46 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-13 19:08 1352 NUL bytes at the end of a page? Andy Isaacson
2004-05-14 2:22 ` Andrew Morton
-- strict thread matches above, loose matches on Subject: below --
2004-05-14 4:32 Steven Cole
[not found] <200405131723.15752.elenstev@mesatop.com>
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
2004-05-15 0:54 ` Steven Cole
2004-05-15 1:55 ` 1352 NUL bytes at the end of a page? Wayne Scott
2004-05-17 3:36 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Steven Cole
2004-05-17 5:17 ` Linus Torvalds
2004-05-17 6:11 ` Andrew Morton
2004-05-17 13:56 ` 1352 NUL bytes at the end of a page? Wayne Scott
2004-05-17 15:17 ` Theodore Ts'o
2004-05-17 15:20 ` Larry McVoy
2004-05-17 15:22 ` Linus Torvalds
2004-05-17 15:25 ` Larry McVoy
2004-05-17 15:37 ` viro
2004-05-17 17:30 ` Steven Cole
2004-05-17 17:40 ` viro
2004-05-17 17:39 ` Steven Cole
2004-05-17 19:06 ` viro
2004-05-17 15:40 ` Arjan van de Ven
2004-05-17 15:53 ` Steven Cole
2004-05-17 16:23 ` Davide Libenzi
2004-05-17 16:28 ` Davide Libenzi
2004-05-18 14:38 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Linus Torvalds
2004-05-19 10:53 ` Steven Cole
2004-05-19 12:10 ` Chris Mason
2004-05-19 12:20 ` 1352 NUL bytes at the end of a page? Wayne Scott
2004-05-19 12:42 ` Nick Piggin
2004-05-19 13:28 ` Steven Cole
2004-05-19 13:36 ` Chris Mason
2004-05-19 13:59 ` Steven Cole
2004-05-19 14:03 ` Wayne Scott
2004-05-19 14:08 ` Chris Mason
2004-05-19 14:20 ` Steven Cole
2004-05-19 14:45 ` Steven Cole
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox