* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
[not found] ` <200405131723.15752.elenstev@mesatop.com>
@ 2004-05-14 16:53 ` Andy Isaacson
2004-05-14 17:23 ` Steven Cole
` (2 more replies)
0 siblings, 3 replies; 68+ messages in thread
From: Andy Isaacson @ 2004-05-14 16:53 UTC (permalink / raw)
To: Steven Cole; +Cc: Steven Cole, support, Andrew Morton, torvalds, Linux Kernel
Apologies for the enormous quote, but I wanted to get the lspci and
dmesg in here, in case someone else has some insight.
On Thu, May 13, 2004 at 05:23:15PM -0600, Steven Cole wrote:
> [steven@spc steven]$ lspcidrake
> intel-agp : Intel Corporation|440BX/ZX - 82443BX/ZX Host bridge [BRIDGE_HOST]
> unknown : Intel Corporation|440BX/ZX - 82443BX/ZX AGP bridge [BRIDGE_PCI]
> unknown : Intel Corporation|82371AB PIIX4 ISA [BRIDGE_ISA]
> unknown : Intel Corporation|82371AB PIIX4 IDE [STORAGE_IDE]
> usb-uhci : Intel Corporation|82371AB PIIX4 USB [SERIAL_USB]
> sonypi : Intel Corporation|82371AB PIIX4 ACPI - Bus Master IDE Controller [BRIDGE_OTHER]
> es1371 : Creative Labs|Sound Blaster AudioPCI64V/AudioPCI128 [MULTIMEDIA_AUDIO]
> 3c59x : 3Com Corporation|3c905B 100BaseTX [Cyclone] [NETWORK_ETHERNET]
> unknown : Promise Technology, Inc.|20262 (Ultra66) [STORAGE_OTHER]
> Card:RIVA TNT : nVidia Corporation|Riva TNT 128 [DISPLAY_VGA]
[snip]
> Linux version 2.6.6 (steven@spc.mesatop.com) (gcc version 3.3.2 (Mandrake Linux 10.0 3.3.2-6mdk)) #105 Sun May 9 22:00:07 MDT 2004
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009e800 (usable)
> BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e7000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000040fd800 (usable)
> BIOS-e820: 00000000040fd800 - 00000000040ff800 (ACPI data)
> BIOS-e820: 00000000040ff800 - 00000000040ffc00 (ACPI NVS)
> BIOS-e820: 00000000040ffc00 - 0000000018000000 (usable)
> BIOS-e820: 00000000fffe7000 - 0000000100000000 (reserved)
> 384MB LOWMEM available.
> On node 0 totalpages: 98304
> DMA zone: 4096 pages, LIFO batch:1
> Normal zone: 94208 pages, LIFO batch:16
> HighMem zone: 0 pages, LIFO batch:1
> DMI 2.1 present.
> ACPI disabled because your bios is from 1999 and too old
> You can enable it with acpi=force
> Built 1 zonelists
> Kernel command line: auto BOOT_IMAGE=2.6-bk ro root=306 devfs=nomount acpi=ht resume=/dev/hda10 splash=silent
> Initializing CPU#0
> PID hash table entries: 2048 (order 11: 16384 bytes)
> Detected 448.795 MHz processor.
> Using tsc for high-res timesource
> Console: colour VGA+ 80x25
> Memory: 386260k/393216k available (1999k kernel code, 6184k reserved, 548k data, 316k init, 0k highmem)
> Checking if this processor honours the WP bit even in supervisor mode... Ok.
> Calibrating delay loop... 886.78 BogoMIPS
> Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
> Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
> Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> CPU: After generic identify, caps: 0383f9ff 00000000 00000000 00000000
> CPU: After vendor identify, caps: 0383f9ff 00000000 00000000 00000000
> CPU: L1 I cache: 16K, L1 D cache: 16K
> CPU: L2 cache: 512K
> CPU: After all inits, caps: 0383f9ff 00000000 00000000 00000040
> CPU: Intel Pentium III (Katmai) stepping 02
> Enabling fast FPU save and restore... done.
> Enabling unmasked SIMD FPU exception support... done.
> Checking 'hlt' instruction... OK.
> POSIX conformance testing by UNIFIX
> NET: Registered protocol family 16
> PCI: PCI BIOS revision 2.10 entry at 0xfd983, last bus=1
> PCI: Using configuration type 1
> Linux Plug and Play Support v0.97 (c) Adam Belay
> usbcore: registered new driver usbfs
> usbcore: registered new driver hub
> PCI: Probing PCI hardware
> PCI: Probing PCI hardware (bus 00)
> PCI: Using IRQ router PIIX/ICH [8086/7110] at 0000:00:07.0
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
> devfs: 2004-01-31 Richard Gooch (rgooch@atnf.csiro.au)
> devfs: boot_options: 0x0
> NTFS driver 2.1.8 [Flags: R/O].
> Limiting direct PCI/PCI transfers.
> isapnp: Scanning for PnP cards...
> isapnp: Card 'U.S. Robotics 56K FAX INT'
> isapnp: 1 Plug & Play card detected total
> lp: driver loaded but no devices found
> Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
> ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> pnp: Device 00:01.00 activated.
> ttyS1 at I/O 0x2f8 (irq = 10) is a 16550A
> parport0: PC-style at 0x378 (0x778) [PCSPP(,...)]
> parport0: irq 7 detected
> lp0: using parport0 (polling).
> Using anticipatory io scheduler
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a post-1991 82077
> loop: loaded (max 8 devices)
> PPP generic driver version 2.4.2
> PPP Deflate Compression module registered
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> PDC20262: IDE controller at PCI slot 0000:00:0f.0
> PCI: Found IRQ 5 for device 0000:00:0f.0
> PDC20262: chipset revision 1
> PDC20262: 100% native mode on irq 5
> PDC20262: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
> ide0: BM-DMA at 0x10c0-0x10c7, BIOS settings: hda:DMA, hdb:DMA
> ide1: BM-DMA at 0x10c8-0x10cf, BIOS settings: hdc:DMA, hdd:pio
> hda: Maxtor 5T040H4, ATA DISK drive
> hdb: ST317221A, ATA DISK drive
> ide0 at 0x1440-0x1447,0x1436 on irq 5
> hda: max request size: 128KiB
> hda: Host Protected Area detected.
> current capacity is 78125000 sectors (40000 MB)
> native capacity is 80043264 sectors (40982 MB)
> hda: 78125000 sectors (40000 MB) w/2048KiB Cache, CHS=65535/16/63
> /dev/ide/host0/bus0/target0/lun0: p1 p2 < p5 p6 p7 p8 p9 p10 >
> hdb: max request size: 128KiB
> hdb: 33683328 sectors (17245 MB) w/512KiB Cache, CHS=33416/16/63
> /dev/ide/host0/bus0/target1/lun0: p1 p2 < p5 p6 p7 p8 p9 >
> ide-floppy driver 0.99.newide
> mice: PS/2 mouse device common for all mice
> input: PC Speaker
> serio: i8042 AUX port at 0x60,0x64 irq 12
> input: ImPS/2 Generic Wheel Mouse on isa0060/serio1
> serio: i8042 KBD port at 0x60,0x64 irq 1
> input: AT Translated Set 2 keyboard on isa0060/serio0
> Advanced Linux Sound Architecture Driver Version 1.0.4rc2 (Tue Mar 30 08:19:30 2004 UTC).
> PCI: Found IRQ 11 for device 0000:00:0c.0
> PCI: Sharing IRQ 11 with 0000:00:0e.0
> PCI: Sharing IRQ 11 with 0000:01:00.0
> ALSA device list:
> #0: Ensoniq AudioPCI ENS1371 at 0x1080, irq 11
> NET: Registered protocol family 2
> IP: routing cache hash table of 4096 buckets, 32Kbytes
> TCP: Hash tables configured (established 32768 bind 65536)
> NET: Registered protocol family 1
> NET: Registered protocol family 17
> kjournald starting. Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> VFS: Mounted root (ext3 filesystem) readonly.
> Freeing unused kernel memory: 316k freed
> EXT3 FS on hda6, internal journal
> Adding 818960k swap on /dev/hda10. Priority:-1 extents:1
> Adding 248968k swap on /dev/hdb5. Priority:-2 extents:1
> found reiserfs format "3.6" with standard journal
> reiserfs: using ordered data mode
> Reiserfs journal params: device hda9, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
> reiserfs: checking transaction log (hda9) for (hda9)
> Using r5 hash to sort names
> NTFS volume version 3.1.
> kjournald starting. Commit interval 5 seconds
> EXT3 FS on hda7, internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> kjournald starting. Commit interval 5 seconds
> EXT3 FS on hda8, internal journal
[snip]
> [steven@spc testing-2.6]$ bk changes -r+ -nd:KEY:
> geert@linux-m68k.org[torvalds]|ChangeSet|20040511145430|25087
Here's "grep = .config":
> CONFIG_X86=y
> CONFIG_MMU=y
> CONFIG_UID16=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_EXPERIMENTAL=y
> CONFIG_CLEAN_COMPILE=y
> CONFIG_BROKEN_ON_SMP=y
> CONFIG_SWAP=y
> CONFIG_SYSVIPC=y
> CONFIG_SYSCTL=y
> CONFIG_LOG_BUF_SHIFT=14
> CONFIG_KALLSYMS=y
> CONFIG_FUTEX=y
> CONFIG_EPOLL=y
> CONFIG_IOSCHED_NOOP=y
> CONFIG_IOSCHED_AS=y
> CONFIG_IOSCHED_DEADLINE=y
> CONFIG_IOSCHED_CFQ=y
> CONFIG_X86_PC=y
> CONFIG_MPENTIUMIII=y
> CONFIG_X86_CMPXCHG=y
> CONFIG_X86_XADD=y
> CONFIG_X86_L1_CACHE_SHIFT=5
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_X86_WP_WORKS_OK=y
> CONFIG_X86_INVLPG=y
> CONFIG_X86_BSWAP=y
> CONFIG_X86_POPAD_OK=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_INTEL_USERCOPY=y
> CONFIG_X86_USE_PPRO_CHECKSUM=y
> CONFIG_PREEMPT=y
> CONFIG_X86_TSC=y
> CONFIG_NOHIGHMEM=y
> CONFIG_HAVE_DEC_LOCK=y
> CONFIG_REGPARM=y
> CONFIG_ACPI_BOOT=y
> CONFIG_PCI=y
> CONFIG_PCI_GOANY=y
> CONFIG_PCI_BIOS=y
> CONFIG_PCI_DIRECT=y
> CONFIG_PCI_MMCONFIG=y
> CONFIG_PCI_NAMES=y
> CONFIG_ISA=y
> CONFIG_BINFMT_ELF=y
> CONFIG_BINFMT_AOUT=y
> CONFIG_BINFMT_MISC=y
> CONFIG_PARPORT=y
> CONFIG_PARPORT_PC=y
> CONFIG_PARPORT_PC_CML1=y
> CONFIG_PARPORT_PC_FIFO=y
> CONFIG_PNP=y
> CONFIG_ISAPNP=y
> CONFIG_BLK_DEV_FD=y
> CONFIG_BLK_DEV_LOOP=y
> CONFIG_LBD=y
> CONFIG_IDE=y
> CONFIG_BLK_DEV_IDE=y
> CONFIG_BLK_DEV_IDEDISK=y
> CONFIG_BLK_DEV_IDECD=y
> CONFIG_BLK_DEV_IDEFLOPPY=y
> CONFIG_IDE_GENERIC=y
> CONFIG_BLK_DEV_IDEPCI=y
> CONFIG_IDEPCI_SHARE_IRQ=y
> CONFIG_BLK_DEV_OFFBOARD=y
> CONFIG_BLK_DEV_IDEDMA_PCI=y
> CONFIG_BLK_DEV_ADMA=y
> CONFIG_BLK_DEV_PDC202XX_OLD=y
> CONFIG_BLK_DEV_IDEDMA=y
> CONFIG_NET=y
> CONFIG_PACKET=y
> CONFIG_UNIX=y
> CONFIG_INET=y
> CONFIG_SYN_COOKIES=y
> CONFIG_NETDEVICES=y
> CONFIG_PPP=y
> CONFIG_PPP_ASYNC=y
> CONFIG_PPP_DEFLATE=y
> CONFIG_INPUT=y
> CONFIG_INPUT_MOUSEDEV=y
> CONFIG_INPUT_MOUSEDEV_PSAUX=y
> CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
> CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
> CONFIG_SOUND_GAMEPORT=y
> CONFIG_SERIO=y
> CONFIG_SERIO_I8042=y
> CONFIG_INPUT_KEYBOARD=y
> CONFIG_KEYBOARD_ATKBD=y
> CONFIG_INPUT_MOUSE=y
> CONFIG_MOUSE_PS2=y
> CONFIG_INPUT_MISC=y
> CONFIG_INPUT_PCSPKR=y
> CONFIG_VT=y
> CONFIG_VT_CONSOLE=y
> CONFIG_HW_CONSOLE=y
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_8250_NR_UARTS=4
> CONFIG_SERIAL_CORE=y
> CONFIG_UNIX98_PTYS=y
> CONFIG_PRINTER=y
> CONFIG_DRM=y
> CONFIG_RAW_DRIVER=y
> CONFIG_MAX_RAW_DEVS=256
> CONFIG_FB=y
> CONFIG_VIDEO_SELECT=y
> CONFIG_VGA_CONSOLE=y
> CONFIG_DUMMY_CONSOLE=y
> CONFIG_SOUND=y
> CONFIG_SND=y
> CONFIG_SND_TIMER=y
> CONFIG_SND_PCM=y
> CONFIG_SND_RAWMIDI=y
> CONFIG_SND_SEQUENCER=y
> CONFIG_SND_OSSEMUL=y
> CONFIG_SND_MIXER_OSS=y
> CONFIG_SND_PCM_OSS=y
> CONFIG_SND_SEQUENCER_OSS=y
> CONFIG_SND_AC97_CODEC=y
> CONFIG_SND_ENS1371=y
> CONFIG_USB=y
> CONFIG_USB_DEVICEFS=y
> CONFIG_EXT2_FS=y
> CONFIG_EXT3_FS=y
> CONFIG_JBD=y
> CONFIG_REISERFS_FS=y
> CONFIG_QUOTA=y
> CONFIG_QUOTACTL=y
> CONFIG_AUTOFS_FS=y
> CONFIG_AUTOFS4_FS=y
> CONFIG_ISO9660_FS=y
> CONFIG_FAT_FS=y
> CONFIG_MSDOS_FS=y
> CONFIG_VFAT_FS=y
> CONFIG_NTFS_FS=y
> CONFIG_PROC_FS=y
> CONFIG_PROC_KCORE=y
> CONFIG_SYSFS=y
> CONFIG_DEVFS_FS=y
> CONFIG_RAMFS=y
> CONFIG_MSDOS_PARTITION=y
> CONFIG_NLS=y
> CONFIG_NLS_DEFAULT="iso8859-1"
> CONFIG_NLS_CODEPAGE_850=y
> CONFIG_NLS_ISO8859_1=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_EARLY_PRINTK=y
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_ZLIB_INFLATE=y
> CONFIG_ZLIB_DEFLATE=y
> CONFIG_X86_BIOS_REBOOT=y
> CONFIG_X86_STD_RESOURCES=y
> CONFIG_PC=y
So, in the oddball config department, you've got a ISAPnP modem over
which you're running PPP; CONFIG_PREEMPT is on.
That corruption size really does make me think of network packets, so
I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
link? "ifconfig ppp0" or something like that.
Can you try doing something like
#!/bin/sh
x=0
while true; do
bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
(cd foo; bk pull -q)
rm -rf foo
x=`expr $x + 1`
echo -n "$x "
done
(I just pulled that key at random out of the kernel repository; there's
nothing special about it except that it's far enough back for the revert
and pull to be very involved operations.)
That ought to do a nice test of the CPU, memory, disk, and kernel sans
PPP. If that loop runs for, say, 10 iterations without errors, keep it
running and try doing some non-BK network IO for a half hour (or two
iterations of the clone/pull loop, whichever is longer) and see if it
fails. You might want to increase the runtimes, say, overnight and two
hours of network activity, if you don't see any failures.
This test is designed to check the theory that in your config, PPP
somehow corrupts random buffer cache pages.
On Fri, May 14, 2004 at 07:46:17AM -0700, Larry McVoy wrote:
> My instinct is that this is a file system or VM problem. Here's why: BK
> wraps its data in multiple integrity checks. When you are doing a pull,
> the data that is sent across the wire is wrapped both at the individual
> diff level (each delta has a check) as well as a CRC around the whole
> set of diffs and metadata sent. Since Steven is pulling (I believe,
> please confirm) from bkbits.net, we know that the data being generated
> is fine - if it wasn't the whole world would be on our case.
>
> On the receiving side BK splats the entire set of diffs and metadata
> down on disk, checking the CRC, and it doesn't proceed to the apply patch
> part until the entire thing is on disk and checked. Then when the patches
> are applied, each per patch checksum is verified (except, as we recently
> found out, in the case of the changeset file, we added some fast path
> optimization for that and dropped the check on the floor. Oops).
>
> I don't think pppd can be part of the problem because of the way BK is
> designed - you shouldn't have gotten to the place you did if the data
> was corrupted in transit.
I agree, I don't see how it could be an in-flight corruption.
> If any of pppd/kernel stuff is corrupting in memory pages, that's a
> different matter entirely, that could cause these problems.
This is my current top suspect. Well, that, or rotted hardware with the
most bizarre symptoms I've ever seen. A 16550, or ISA DMA controller,
that just happens to stomp on buffer cache pages?
> The fact that Steven is the only guy seeing this really makes me
> suspicious that it is something with his kernel. I don't think this
> is a memory error problem, those never look like this, they look like
> a few bits being flipped. Blocks of nulls are always file/vm system.
Yeah, I'm sure it's some function of his hardware and config. But
really, how many people do you suppose are running 2.6 with PPP and
PREEMPT? And how many of them would notice a few pages per day of
partial buffer cache trashing? I had a machine with one byte of memory
that gave you "x" 50% of the time, and "x & 0xdf" the other 50% of the
time; it took several months and a fairly serious filesystem blowup
before I noticed enough problems to go run memtest86.
-andy
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
@ 2004-05-14 17:23 ` Steven Cole
2004-05-15 0:54 ` Steven Cole
2004-05-15 3:15 ` Lincoln Dale
2 siblings, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-14 17:23 UTC (permalink / raw)
To: Andy Isaacson; +Cc: Linux Kernel, Steven Cole, torvalds, support, Andrew Morton
On May 14, 2004, at 10:53 AM, Andy Isaacson wrote:
> Apologies for the enormous quote, but I wanted to get the lspci and
> dmesg in here, in case someone else has some insight.
>
[enormous quote snipped]
>
> So, in the oddball config department, you've got a ISAPnP modem over
> which you're running PPP; CONFIG_PREEMPT is on.
>
> That corruption size really does make me think of network packets, so
> I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
> link? "ifconfig ppp0" or something like that.
>
> Can you try doing something like
>
> #!/bin/sh
> x=0
> while true; do
> bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
> (cd foo; bk pull -q)
> rm -rf foo
> x=`expr $x + 1`
> echo -n "$x "
> done
>
Yes, I'll do those tests tonight.
> On Fri, May 14, 2004 at 07:46:17AM -0700, Larry McVoy wrote:
>> My instinct is that this is a file system or VM problem. Here's why:
>> BK
>> wraps its data in multiple integrity checks. When you are doing a
>> pull,
>> the data that is sent across the wire is wrapped both at the
>> individual
>> diff level (each delta has a check) as well as a CRC around the whole
>> set of diffs and metadata sent. Since Steven is pulling (I believe,
>> please confirm) from bkbits.net, we know that the data being generated
>> is fine - if it wasn't the whole world would be on our case.
>>
>> On the receiving side BK splats the entire set of diffs and metadata
>> down on disk, checking the CRC, and it doesn't proceed to the apply
>> patch
>> part until the entire thing is on disk and checked. Then when the
>> patches
>> are applied, each per patch checksum is verified (except, as we
>> recently
>> found out, in the case of the changeset file, we added some fast path
>> optimization for that and dropped the check on the floor. Oops).
>>
>> I don't think pppd can be part of the problem because of the way BK is
>> designed - you shouldn't have gotten to the place you did if the data
>> was corrupted in transit.
>
> I agree, I don't see how it could be an in-flight corruption.
>
>> If any of pppd/kernel stuff is corrupting in memory pages, that's a
>> different matter entirely, that could cause these problems.
>
> This is my current top suspect. Well, that, or rotted hardware with
> the
> most bizarre symptoms I've ever seen. A 16550, or ISA DMA controller,
> that just happens to stomp on buffer cache pages?
>
>> The fact that Steven is the only guy seeing this really makes me
>> suspicious that it is something with his kernel. I don't think this
>> is a memory error problem, those never look like this, they look like
>> a few bits being flipped. Blocks of nulls are always file/vm system.
>
> Yeah, I'm sure it's some function of his hardware and config. But
> really, how many people do you suppose are running 2.6 with PPP and
> PREEMPT? And how many of them would notice a few pages per day of
> partial buffer cache trashing? I had a machine with one byte of memory
> that gave you "x" 50% of the time, and "x & 0xdf" the other 50% of the
> time; it took several months and a fairly serious filesystem blowup
> before I noticed enough problems to go run memtest86.
>
> -andy
And in case anyone is wondering, yes I did run memtest86 all night
on that machine with zero errors for all tests.
Question for the list. Have the few folks out there still using dialup
ever seen the following problem with pppd and 2.6.x?
Occasionally, perhaps once every couple of hours of dialup connection
time,
the data flow from pppd comes to a halt, and messages reporting
"too much work for irq10" appear in dmesg at that time. Using the kppp
2.4.2
GUI to exit leaves pppd still running. Killing pppd works, and pppd
can be
restarted normally. These failures never occurred with 2.4.x.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
2004-05-14 17:23 ` Steven Cole
@ 2004-05-15 0:54 ` Steven Cole
2004-05-15 3:15 ` Lincoln Dale
2 siblings, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-15 0:54 UTC (permalink / raw)
To: Andy Isaacson; +Cc: Steven Cole, support, Andrew Morton, torvalds, Linux Kernel
On Friday 14 May 2004 10:53 am, Andy Isaacson wrote:
> So, in the oddball config department, you've got a ISAPnP modem over
> which you're running PPP; CONFIG_PREEMPT is on.
>
> That corruption size really does make me think of network packets, so
> I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
> link? "ifconfig ppp0" or something like that.
ppp0 Link encap:Point-to-Point Protocol
inet addr:216.31.65.245 P-t-P:216.31.65.1 Mask:255.255.255.255
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:123 errors:0 dropped:0 overruns:0 frame:0
TX packets:152 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:3
RX bytes:77312 (75.5 Kb) TX bytes:8212 (8.0 Kb)
>
> Can you try doing something like
>
> #!/bin/sh
> x=0
> while true; do
> bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
> (cd foo; bk pull -q)
> rm -rf foo
> x=`expr $x + 1`
> echo -n "$x "
> done
>
> (I just pulled that key at random out of the kernel repository; there's
> nothing special about it except that it's far enough back for the revert
> and pull to be very involved operations.)
>
> That ought to do a nice test of the CPU, memory, disk, and kernel sans
> PPP. If that loop runs for, say, 10 iterations without errors, keep it
> running and try doing some non-BK network IO for a half hour (or two
> iterations of the clone/pull loop, whichever is longer) and see if it
> fails. You might want to increase the runtimes, say, overnight and two
> hours of network activity, if you don't see any failures.
>
> This test is designed to check the theory that in your config, PPP
> somehow corrupts random buffer cache pages.
It didn't need PPP to fail.
It looks like it failed on the 7th iteration of the script supplied by Andy.
[snipped list of files]
sound/core/SCCS/s.Kconfig
Your repository should be back to where it was before undo started
We are running a consistency check to verify this.
check passed
Undo failed, repository left locked.
WARNING: deleting orphan file /home/steven/tmp/bk_clone2_mVmrsk
Entire repository is locked by:
RESYNC directory.
ERROR-Unable to lock repository for update.
6
[steven@spc BK]$ ls -ls foo/RESYNC/SCCS/*
40048 -r--r--r-- 1 steven steven 41007273 May 14 18:08 foo/RESYNC/SCCS/s.ChangeSet
68 -r--r--r-- 1 steven steven 67791 May 14 18:11 foo/RESYNC/SCCS/s.CREDITS
76 -r--r--r-- 1 steven steven 75264 May 14 18:11 foo/RESYNC/SCCS/s.MAINTAINERS
124 -r--r--r-- 1 steven steven 124747 May 14 18:11 foo/RESYNC/SCCS/s.Makefile
Let me know if you want any of these files. I can compress them and send them
the usual way.
The kernel was 2.6.6 plus whatever is in Linus' tree, and bk was 3.0.4.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
2004-05-14 17:23 ` Steven Cole
2004-05-15 0:54 ` Steven Cole
@ 2004-05-15 3:15 ` Lincoln Dale
2004-05-15 3:41 ` Andrew Morton
2 siblings, 1 reply; 68+ messages in thread
From: Lincoln Dale @ 2004-05-15 3:15 UTC (permalink / raw)
To: Andy Isaacson
Cc: Steven Cole, Steven Cole, support, Andrew Morton, torvalds,
Linux Kernel
At 02:53 AM 15/05/2004, Andy Isaacson wrote:
>That corruption size really does make me think of network packets, so
>I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
>link? "ifconfig ppp0" or something like that.
1352 bytes coule be remarkably close to the TCP MSS . . .
perhaps there is some interaction with ppp where there is an overrun / lost
packet and the TCP window is mistakenly advanced?
i.e.
- 1500 byte MTU
- less 28 bytes for PPP header (1472 bytes)
- less 20 bytes for IP header (1452 bytes)
- less 20 bytes for TCP header (1432 bytes)
if, however, the MRU is actually negotiated to be 1420 rather than 1500 . . .
when you issue the "bk pull", it may be interesting to see the output from:
tcpdump -i ppp0 -n | grep mss
cheers,
lincoln.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-15 3:15 ` Lincoln Dale
@ 2004-05-15 3:41 ` Andrew Morton
2004-05-15 5:39 ` Steven Cole
2004-05-16 1:23 ` Steven Cole
0 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-15 3:41 UTC (permalink / raw)
To: Lincoln Dale; +Cc: adi, elenstev, scole, support, torvalds, linux-kernel
Lincoln Dale <ltd@cisco.com> wrote:
>
> At 02:53 AM 15/05/2004, Andy Isaacson wrote:
> >That corruption size really does make me think of network packets, so
> >I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
> >link? "ifconfig ppp0" or something like that.
>
> 1352 bytes coule be remarkably close to the TCP MSS . . .
> perhaps there is some interaction with ppp where there is an overrun / lost
> packet and the TCP window is mistakenly advanced?
Steve, if it's a memory stomp then perhaps CONFIG_DEBUG_PAGEALLOC and
CONFIG_DEBUG_SLAB might pick it up.
It seems awfully deterministic though.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-15 3:41 ` Andrew Morton
@ 2004-05-15 5:39 ` Steven Cole
2004-05-16 1:23 ` Steven Cole
1 sibling, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-15 5:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: Lincoln Dale, adi, scole, support, torvalds, linux-kernel
On Friday 14 May 2004 09:41 pm, Andrew Morton wrote:
> Lincoln Dale <ltd@cisco.com> wrote:
> >
> > At 02:53 AM 15/05/2004, Andy Isaacson wrote:
> > >That corruption size really does make me think of network packets, so
> > >I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
> > >link? "ifconfig ppp0" or something like that.
> >
> > 1352 bytes coule be remarkably close to the TCP MSS . . .
> > perhaps there is some interaction with ppp where there is an overrun / lost
> > packet and the TCP window is mistakenly advanced?
>
> Steve, if it's a memory stomp then perhaps CONFIG_DEBUG_PAGEALLOC and
> CONFIG_DEBUG_SLAB might pick it up.
>
> It seems awfully deterministic though.
>
>
Andy asked me to do the following without ppp, which I did.
The explanation of the odd key is in his previous mail.
#!/bin/sh
x=0
while true; do
bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
(cd foo; bk pull -q)
rm -rf foo
x=`expr $x + 1`
echo -n "$x "
done
The above caused a failure after the 7th iteration or so. This time, the
RESYNC/SCCS/s.ChangeSet file didn't have any nulls, but three other
files did, namely s.Makefile, s.CREDITS, s.MAINTAINERS. Not knowing
whether this was normal or not, I've sent those files to bitkeeper for
analysis.
Since I only began seeing these "Assertion `s && s->tree' failed" problems
with bk in the past month or so, and I generally run a current kernel
on this machine, I booted an older kernel, 2.6.3. I'm going to run
Andy's test overnight and see if 2.6.3 acts any differently.
In the meantime, I'll have the machine building a 2.6.6-plus kernel
with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_SLAB if its
use seems indicated in the morning.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-15 3:41 ` Andrew Morton
2004-05-15 5:39 ` Steven Cole
@ 2004-05-16 1:23 ` Steven Cole
2004-05-16 2:18 ` Linus Torvalds
` (2 more replies)
1 sibling, 3 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-16 1:23 UTC (permalink / raw)
To: Andrew Morton; +Cc: adi, scole, support, torvalds, linux-kernel
On Friday 14 May 2004 09:41 pm, Andrew Morton wrote:
> Lincoln Dale <ltd@cisco.com> wrote:
> >
> > At 02:53 AM 15/05/2004, Andy Isaacson wrote:
> > >That corruption size really does make me think of network packets, so
> > >I'm tempted to blame it on PPP. Can you find out the MTU of your PPP
> > >link? "ifconfig ppp0" or something like that.
> >
> > 1352 bytes coule be remarkably close to the TCP MSS . . .
> > perhaps there is some interaction with ppp where there is an overrun / lost
> > packet and the TCP window is mistakenly advanced?
>
> Steve, if it's a memory stomp then perhaps CONFIG_DEBUG_PAGEALLOC and
> CONFIG_DEBUG_SLAB might pick it up.
>
> It seems awfully deterministic though.
>
>
Second reply with some interesting developments.
I ran Andy's bk exersisor script on a vendor supplied kernel (2.6.3-4mdk) for
36 iterations, with no failures at all.
I then stopped that test, updated the 2.6-tree to the state at 15:00 MDT today,
compiled with the above two DEBUG options, and rebooted with that new
kernel.
I ran Andy's script again, and it caused a failure right away. Here is a
snipped version of the log:
renumber: can't read SCCS info in "SCCS/s.ChangeSet".
include/asm-x86_64/SCCS/s.i387.h
[list of files snipped]
net/ipv4/SCCS/s.ip_output.c
Your repository should be back to where it was before undo started
We are running a consistency check to verify this.
check passed
Undo failed, repository left locked.
WARNING: deleting orphan file /home/steven/tmp/bk_clone2_0dH5v6
Entire repository is locked by:
RESYNC directory.
ERROR-Unable to lock repository for update.
1 renumber: can't read SCCS info in "RESYNC/SCCS/s.ChangeSet".
bk: takepatch.c:1343: applyCsetPatch: Assertion `s && s->tree' failed.
2 renumber: can't read SCCS info in "RESYNC/SCCS/s.ChangeSet".
bk: takepatch.c:1343: applyCsetPatch: Assertion `s && s->tree' failed.
3
The digits in column 1 above are an iteration count from the testing script.
I control-c'ed out at that point.
For reference, here is Andy's script again:
#!/bin/sh
x=0
while true; do
bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
(cd foo; bk pull -q)
rm -rf foo
x=`expr $x + 1`
echo -n "$x "
done
The RESYNC directory in 'foo' does not contain an SCSS directory.
There were no unusual messages in dmesg.
In the spirit of 'rounding up the usual suspects', I'll unset CONFIG_PREEMT
and try again.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 1:23 ` Steven Cole
@ 2004-05-16 2:18 ` Linus Torvalds
2004-05-16 3:44 ` Linus Torvalds
2004-05-18 1:47 ` Benjamin Herrenschmidt
2004-05-16 3:20 ` Andrew Morton
2004-05-17 2:28 ` Larry McVoy
2 siblings, 2 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-16 2:18 UTC (permalink / raw)
To: Steven Cole; +Cc: Andrew Morton, adi, scole, support, linux-kernel
On Sat, 15 May 2004, Steven Cole wrote:
>
> In the spirit of 'rounding up the usual suspects', I'll unset CONFIG_PREEMT
> and try again.
Thanks. If that doesn't do it, can you start binary-searching on kernel
versions? I run with preempt myself (well, not on my current G5 desktop,
but otherwise), so it _should_ be stable, but you may have a driver or
something else that doesn't like preempt.
Or it could be any number of other config options. Do you have anything
else interesting enabled?
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 1:23 ` Steven Cole
2004-05-16 2:18 ` Linus Torvalds
@ 2004-05-16 3:20 ` Andrew Morton
2004-05-16 3:58 ` Linus Torvalds
2004-05-17 2:28 ` Larry McVoy
2 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2004-05-16 3:20 UTC (permalink / raw)
To: Steven Cole; +Cc: adi, scole, support, torvalds, linux-kernel
Steven Cole <elenstev@mesatop.com> wrote:
>
> For reference, here is Andy's script again:
> #!/bin/sh
> x=0
> while true; do
> bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
> (cd foo; bk pull -q)
> rm -rf foo
> x=`expr $x + 1`
> echo -n "$x "
> done
Two hours so far here.
bix:/usr/src> ~/clone.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
That's 2.6.6-mm2+, 2GB 4-way x86.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 2:18 ` Linus Torvalds
@ 2004-05-16 3:44 ` Linus Torvalds
2004-05-16 4:31 ` Steven Cole
2004-05-18 1:47 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2004-05-16 3:44 UTC (permalink / raw)
To: Steven Cole; +Cc: Andrew Morton, adi, scole, support, linux-kernel
On Sat, 15 May 2004, Linus Torvalds wrote:
>
>
> On Sat, 15 May 2004, Steven Cole wrote:
> >
> > In the spirit of 'rounding up the usual suspects', I'll unset CONFIG_PREEMT
> > and try again.
>
> Or it could be any number of other config options. Do you have anything
> else interesting enabled?
Ahh, looking at an earlier email I see that you have CONFIG_REGPARM=y too.
That could easily be pretty dangerous - there have been both compiler bugs
in this area, and just kernel bugs (missing "asmlinkage" things causing
bad calling conventions and really nasty bugs).
So please try without both PREEMPT and REGPARM. Considering that it's
apparently very repeatable for you, I'd be more inclined to worry about
REGPARM than PREEMPT, but it's best to try with both disabled.
I also worry about that PDC202XX controller, but that 1352 is a strange
number (divisible by 8, but not by a cacheline or 512-byte sector or
something like that), so it doesn't _sound_ like something like DMA
failure or chipset programming, but who the hell knows..
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 3:20 ` Andrew Morton
@ 2004-05-16 3:58 ` Linus Torvalds
0 siblings, 0 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-16 3:58 UTC (permalink / raw)
To: Andrew Morton; +Cc: Steven Cole, adi, scole, support, linux-kernel
On Sat, 15 May 2004, Andrew Morton wrote:
>
> Two hours so far here.
>
> bix:/usr/src> ~/clone.sh
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
>
> That's 2.6.6-mm2+, 2GB 4-way x86.
I think Steven's machine (according to an earlier 'dmesg') has something
like 384MB of RAM and just one PIII-450 CPU.
With that setup, he's likely getting a lot of IO (and possibly even
swapping). The BK disk working set for the kernel archive is something
like half a gig per tree, I think.
In contrast, your nicer machine will do the whole stress-test basically
totally cached (well, BK will force writeback with fsync, but it will all
be pretty synchronous with nothing else going on).
So if it's IO-related or happens when swapping...
But again, neither of those should usually cause that kind of strange
partial-page corruption.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 3:44 ` Linus Torvalds
@ 2004-05-16 4:31 ` Steven Cole
2004-05-16 4:52 ` Linus Torvalds
2004-05-16 10:01 ` Andrew Morton
0 siblings, 2 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-16 4:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, adi, scole, support, linux-kernel
On Saturday 15 May 2004 09:44 pm, Linus Torvalds wrote:
>
> On Sat, 15 May 2004, Linus Torvalds wrote:
> >
> >
> > On Sat, 15 May 2004, Steven Cole wrote:
> > >
> > > In the spirit of 'rounding up the usual suspects', I'll unset CONFIG_PREEMT
> > > and try again.
> >
> > Or it could be any number of other config options. Do you have anything
> > else interesting enabled?
>
> Ahh, looking at an earlier email I see that you have CONFIG_REGPARM=y too.
>
> That could easily be pretty dangerous - there have been both compiler bugs
> in this area, and just kernel bugs (missing "asmlinkage" things causing
> bad calling conventions and really nasty bugs).
>
> So please try without both PREEMPT and REGPARM. Considering that it's
> apparently very repeatable for you, I'd be more inclined to worry about
> REGPARM than PREEMPT, but it's best to try with both disabled.
>
> I also worry about that PDC202XX controller, but that 1352 is a strange
> number (divisible by 8, but not by a cacheline or 512-byte sector or
> something like that), so it doesn't _sound_ like something like DMA
> failure or chipset programming, but who the hell knows..
>
> Linus
>
>
OK, will do. I ran the bk exerciser script for over an hour with 2.6.6-current
and no CONFIG_PREEMPT and no errors. The script only reported one
iteration finished, while I got it to do 36 iterations over several hours earlier
today (with a 2.6.3-4mdk vendor kernel), so I'm going to add some timing
tests to the script to see if things are really slowing down with current kernels,
or if it's just my worried imaginings.
Going back through my sent mail, it looks like I first reported the originally
noticed failture as a bk bug to LM on 4/15/04 if that helps with kernel versions.
I usually do a bk pull and kernel build every night or so "just for fun".
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 4:31 ` Steven Cole
@ 2004-05-16 4:52 ` Linus Torvalds
2004-05-16 5:22 ` Andrea Arcangeli
2004-05-16 5:54 ` Steven Cole
2004-05-16 10:01 ` Andrew Morton
1 sibling, 2 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-16 4:52 UTC (permalink / raw)
To: Steven Cole
Cc: Andrew Morton, adi, scole, support, Kernel Mailing List,
Andrea Arcangeli
On Sat, 15 May 2004, Steven Cole wrote:
>
> OK, will do. I ran the bk exerciser script for over an hour with 2.6.6-current
> and no CONFIG_PREEMPT and no errors. The script only reported one
> iteration finished, while I got it to do 36 iterations over several hours earlier
> today (with a 2.6.3-4mdk vendor kernel)
Hmm.. Th ecurrent BK tree contains much of the anonvma stuff, so this
might actually be a serious VM performance regression. That could
effectively be hiding whatever problem you saw.
Andrea: have you tested under low memory and high fs load? Steven has 384M
or RAM, which _will_ cause a lot of VM activity when doing a full kernel
BK clone + undo + pull, which is what his test script ends up doing...
It would be good to test going back to the kernel that saw the "immediate
problem", and try that version without CONFIG_PREEMPT.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 4:52 ` Linus Torvalds
@ 2004-05-16 5:22 ` Andrea Arcangeli
2004-05-16 15:28 ` Steven Cole
2004-05-16 5:54 ` Steven Cole
1 sibling, 1 reply; 68+ messages in thread
From: Andrea Arcangeli @ 2004-05-16 5:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Steven Cole, Andrew Morton, adi, scole, support,
Kernel Mailing List
On Sat, May 15, 2004 at 09:52:50PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 15 May 2004, Steven Cole wrote:
> >
> > OK, will do. I ran the bk exerciser script for over an hour with 2.6.6-current
> > and no CONFIG_PREEMPT and no errors. The script only reported one
> > iteration finished, while I got it to do 36 iterations over several hours earlier
> > today (with a 2.6.3-4mdk vendor kernel)
>
> Hmm.. Th ecurrent BK tree contains much of the anonvma stuff, so this
> might actually be a serious VM performance regression. That could
> effectively be hiding whatever problem you saw.
>
> Andrea: have you tested under low memory and high fs load? Steven has 384M
> or RAM, which _will_ cause a lot of VM activity when doing a full kernel
> BK clone + undo + pull, which is what his test script ends up doing...
An easy way to verify for Steven is to give a quick spin to 2.6.5-aa5
and see if it's slow too, that will rule out the anon-vma changes
(for completeness: there's a minor race in 2.6.5-aa5 fixed in my current
internal tree, I posted the fix to l-k separately, but you can ignore
the fix for a simple test, it takes weeks to trigger anyways and you
need threads to trigger it and I've never seen threaded version control
systems so I doubt BK is threaded).
In general a "slowdown" cannot be related to anon-vma (unless it's a
minor merging error), that's a black and white thing, it doesn't touch
the vm heuristics and it will only speed the fast paths up plus it will
save some tons of ram in the big systems. Pratically no change should be
measurable on a small system (unless it uses an heavy amount of cows, in
which case it will improve things, it should never hurt). As for being
tested, it is very well tested on the small desktops too. Probably the
only thing to double check is that there was no minor merging error that
could have caused this.
> It would be good to test going back to the kernel that saw the "immediate
> problem", and try that version without CONFIG_PREEMPT.
Agreed.
Thanks.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 4:52 ` Linus Torvalds
2004-05-16 5:22 ` Andrea Arcangeli
@ 2004-05-16 5:54 ` Steven Cole
2004-05-16 6:09 ` Andrew Morton
1 sibling, 1 reply; 68+ messages in thread
From: Steven Cole @ 2004-05-16 5:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, adi, scole, support, Kernel Mailing List,
Andrea Arcangeli
On Saturday 15 May 2004 10:52 pm, Linus Torvalds wrote:
>
> On Sat, 15 May 2004, Steven Cole wrote:
> >
> > OK, will do. I ran the bk exerciser script for over an hour with 2.6.6-current
> > and no CONFIG_PREEMPT and no errors. The script only reported one
> > iteration finished, while I got it to do 36 iterations over several hours earlier
> > today (with a 2.6.3-4mdk vendor kernel)
>
> Hmm.. Th ecurrent BK tree contains much of the anonvma stuff, so this
> might actually be a serious VM performance regression. That could
> effectively be hiding whatever problem you saw.
[steven@spc steven]$ vmstat -n 1 15
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 2 16644 3056 5288 71752 13 25 652 357 3024 357 36 35 6 22
0 1 16644 2832 5288 72036 0 0 1052 0 3155 300 20 29 0 51
1 0 16644 2708 5284 72072 0 0 788 0 2586 333 19 26 0 55
0 2 16644 3216 5288 71976 0 0 932 0 2850 291 20 26 0 54
1 1 16644 2832 5292 72500 0 0 1036 0 3093 329 20 30 0 50
0 1 16644 3088 5264 72688 0 0 1000 303 3561 449 21 35 0 43
0 2 16644 3216 5276 72384 0 0 720 0 2475 335 19 23 0 58
1 1 16644 2848 5292 72440 60 0 763 0 2544 372 18 25 0 58
0 3 16644 3152 5172 72136 0 0 776 4 2530 392 20 24 0 55
0 1 16644 3216 5200 71848 0 0 945 0 2893 375 20 27 0 53
1 1 16644 2512 5464 71488 0 0 924 260 2899 364 18 26 0 56
1 1 16644 2832 5500 71348 0 0 880 224 3714 320 20 36 0 45
1 1 16644 3208 5380 71316 0 0 932 7 2879 412 20 28 0 52
0 1 16644 3280 5356 71172 0 0 924 0 2828 348 22 27 0 51
1 0 16644 3748 5368 71728 0 0 1056 0 2867 343 17 30 0 53
[steven@spc steven]$ free
total used free shared buffers cached
Mem: 386472 384032 2440 0 3936 113748
-/+ buffers/cache: 266348 120124
Swap: 1067928 16644 1051284
>
> Andrea: have you tested under low memory and high fs load? Steven has 384M
> or RAM, which _will_ cause a lot of VM activity when doing a full kernel
> BK clone + undo + pull, which is what his test script ends up doing...
>
> It would be good to test going back to the kernel that saw the "immediate
> problem", and try that version without CONFIG_PREEMPT.
>
> Linus
>
>
I'll give that a try tomorrow. I'll let this thing (sans PREEMPT and REGPARM) cook
on Andy's script overnight. The kernel is up to date through Changeset 1.1724.
The original problem happened only about 20% of the time during bk pulls.
Prior to 4/15/04 (or perhaps a day or two before at most), it never happened.
The 'it' being the 'Assertion `s && s->tree' failed' during a bk pull.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 5:54 ` Steven Cole
@ 2004-05-16 6:09 ` Andrew Morton
2004-05-16 6:24 ` Andrew Morton
0 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2004-05-16 6:09 UTC (permalink / raw)
To: Steven Cole; +Cc: torvalds, adi, scole, support, linux-kernel, andrea
Steven Cole <elenstev@mesatop.com> wrote:
>
> > Hmm.. Th ecurrent BK tree contains much of the anonvma stuff, so this
> > might actually be a serious VM performance regression. That could
> > effectively be hiding whatever problem you saw.
>
> [steven@spc steven]$ vmstat -n 1 15
I have a feeling that the pageout performance got broken again, even more.
It was OK for a while and we need to backtrack and see where it went wrong.
The below might improve things, but I doubt it.
From: Nick Piggin <nickpiggin@yahoo.com.au>
If the zone has a very small number of inactive pages, local variable
`ratio' can be huge and we do way too much scanning. So much so that Ingo
hit an NMI watchdog expiry, although that was because the zone would have a
had a single refcount-zero page in it, and that logic recently got fixed up
via get_page_testone().
Nick's patch simply puts a sane-looking upper bound on the number of pages
which we'll scan in this round. It hasn't had a lot of thought or testing
yet.
---
25-akpm/mm/vmscan.c | 24 +++++++++++++++---------
1 files changed, 15 insertions(+), 9 deletions(-)
diff -puN mm/vmscan.c~vm-shrink-zone mm/vmscan.c
--- 25/mm/vmscan.c~vm-shrink-zone 2004-05-15 23:08:36.471571816 -0700
+++ 25-akpm/mm/vmscan.c 2004-05-15 23:08:36.476571056 -0700
@@ -745,23 +745,29 @@ static int
shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask,
int *total_scanned, struct page_state *ps, int do_writepage)
{
- unsigned long ratio;
+ unsigned long scan_active;
int count;
/*
* Try to keep the active list 2/3 of the size of the cache. And
* make sure that refill_inactive is given a decent number of pages.
*
- * The "ratio+1" here is important. With pagecache-intensive workloads
- * the inactive list is huge, and `ratio' evaluates to zero all the
- * time. Which pins the active list memory. So we add one to `ratio'
- * just to make sure that the kernel will slowly sift through the
- * active list.
+ * The "scan_active + 1" here is important. With pagecache-intensive
+ * workloads the inactive list is huge, and `ratio' evaluates to zero
+ * all the time. Which pins the active list memory. So we add one to
+ * `scan_active' just to make sure that the kernel will slowly sift
+ * through the active list.
*/
- ratio = (unsigned long)SWAP_CLUSTER_MAX * zone->nr_active /
- ((zone->nr_inactive | 1) * 2);
+ if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) {
+ /* Don't scan more than 4 times the inactive list scan size */
+ scan_active = 4*max_scan;
+ } else {
+ /* Cast to long long so the multiply doesn't overflow */
+ scan_active = (unsigned long long)max_scan * zone->nr_active
+ / (zone->nr_inactive*2 + 1);
+ }
- atomic_add(ratio+1, &zone->nr_scan_active);
+ atomic_add(scan_active + 1, &zone->nr_scan_active);
count = atomic_read(&zone->nr_scan_active);
if (count >= SWAP_CLUSTER_MAX) {
atomic_set(&zone->nr_scan_active, 0);
_
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 6:09 ` Andrew Morton
@ 2004-05-16 6:24 ` Andrew Morton
0 siblings, 0 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-16 6:24 UTC (permalink / raw)
To: elenstev, torvalds, adi, scole, support, linux-kernel, andrea
Andrew Morton <akpm@osdl.org> wrote:
>
> The below might improve things, but I doubt it.
>
Might as well send something which compiles, hey?
diff -puN mm/vmscan.c~vm-shrink-zone mm/vmscan.c
--- 25/mm/vmscan.c~vm-shrink-zone 2004-05-15 23:10:08.800535680 -0700
+++ 25-akpm/mm/vmscan.c 2004-05-15 23:19:20.015738232 -0700
@@ -745,23 +745,33 @@ static int
shrink_zone(struct zone *zone, int max_scan, unsigned int gfp_mask,
int *total_scanned, struct page_state *ps, int do_writepage)
{
- unsigned long ratio;
+ unsigned long scan_active;
int count;
/*
* Try to keep the active list 2/3 of the size of the cache. And
* make sure that refill_inactive is given a decent number of pages.
*
- * The "ratio+1" here is important. With pagecache-intensive workloads
- * the inactive list is huge, and `ratio' evaluates to zero all the
- * time. Which pins the active list memory. So we add one to `ratio'
- * just to make sure that the kernel will slowly sift through the
- * active list.
+ * The "scan_active + 1" here is important. With pagecache-intensive
+ * workloads the inactive list is huge, and `ratio' evaluates to zero
+ * all the time. Which pins the active list memory. So we add one to
+ * `scan_active' just to make sure that the kernel will slowly sift
+ * through the active list.
*/
- ratio = (unsigned long)SWAP_CLUSTER_MAX * zone->nr_active /
- ((zone->nr_inactive | 1) * 2);
+ if (zone->nr_active >= 4*(zone->nr_inactive*2 + 1)) {
+ /* Don't scan more than 4 times the inactive list scan size */
+ scan_active = 4*max_scan;
+ } else {
+ unsigned long long tmp;
+
+ /* Cast to long long so the multiply doesn't overflow */
+
+ tmp = (unsigned long long)max_scan * zone->nr_active;
+ do_div(tmp, zone->nr_inactive*2 + 1);
+ scan_active = (unsigned long)tmp;
+ }
- atomic_add(ratio+1, &zone->nr_scan_active);
+ atomic_add(scan_active + 1, &zone->nr_scan_active);
count = atomic_read(&zone->nr_scan_active);
if (count >= SWAP_CLUSTER_MAX) {
atomic_set(&zone->nr_scan_active, 0);
_
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 4:31 ` Steven Cole
2004-05-16 4:52 ` Linus Torvalds
@ 2004-05-16 10:01 ` Andrew Morton
2004-05-16 13:49 ` Steven Cole
1 sibling, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2004-05-16 10:01 UTC (permalink / raw)
To: Steven Cole; +Cc: torvalds, adi, scole, support, linux-kernel
Steven Cole <elenstev@mesatop.com> wrote:
>
> The script only reported one
> iteration finished, while I got it to do 36 iterations over several hours earlier
> today (with a 2.6.3-4mdk vendor kernel), so I'm going to add some timing
> tests to the script to see if things are really slowing down with current kernels,
> or if it's just my worried imaginings.
I did a bit of testing on a 256MB laptop with a fairly slow disk, ext3.
Three iterations of the test took:
2.6.6: 1055.53s user 327.14s system 32% cpu 1:10:06.71 total
2.4.27-pre2: 1042.03s user 307.21s system 32% cpu 1:09:46.00 total
2.6.3: 1053.23s user 326.16s system 27% cpu 1:22:07.24 total
So there's nothing particularly wild there. It's possible I guess that the
2.6 VM is very sucky but something else made up for it - possibly the
anticipatory scheduler but more likely the Orlov allocator.
You're using reiserfs, yes?
Are you sure the IDE disks are in DMA mode?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 10:01 ` Andrew Morton
@ 2004-05-16 13:49 ` Steven Cole
0 siblings, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-16 13:49 UTC (permalink / raw)
To: Andrew Morton; +Cc: torvalds, adi, scole, support, linux-kernel
On Sunday 16 May 2004 04:01 am, Andrew Morton wrote:
> Steven Cole <elenstev@mesatop.com> wrote:
> >
> > The script only reported one
> > iteration finished, while I got it to do 36 iterations over several hours earlier
> > today (with a 2.6.3-4mdk vendor kernel), so I'm going to add some timing
> > tests to the script to see if things are really slowing down with current kernels,
> > or if it's just my worried imaginings.
>
> I did a bit of testing on a 256MB laptop with a fairly slow disk, ext3.
> Three iterations of the test took:
>
> 2.6.6: 1055.53s user 327.14s system 32% cpu 1:10:06.71 total
>
> 2.4.27-pre2: 1042.03s user 307.21s system 32% cpu 1:09:46.00 total
>
> 2.6.3: 1053.23s user 326.16s system 27% cpu 1:22:07.24 total
>
>
> So there's nothing particularly wild there. It's possible I guess that the
> 2.6 VM is very sucky but something else made up for it - possibly the
> anticipatory scheduler but more likely the Orlov allocator.
>
> You're using reiserfs, yes?
Yes, but I also have a BK repo on an ext3 fs which saw the exact
same initial problem. Well, almost in that I got the Assertion `s && s->tree' failed
message, but didn't preserve the possibly null-containing s.ChangeSet file.
>
> Are you sure the IDE disks are in DMA mode?
>
>
For the 2.6.3 tests, yes.
For the overnight 2.6.6 tests, apparently not, but it is now.
Without DMA, the clone took about 22 minutes, but even with
DMA back on, 2.6.6-current is much slower than 2.6.3-4mdk.
It's hard to make your wheels fall off when putting around the
race track at 2/3 speed. If I get time, I'll try the patch from Nick
and/or a kernel from around 4/15 when I first saw this problem,
but the weather's nice outside, so it may be later this evening
before I return to testing.
/dev/hda:
multcount = 0 (off)
IO_support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 256 (on)
geometry = 65535/16/63, sectors = 78125000, start = 0
----------------------------
2.6.3-4mdk:
time bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
287.93user 59.33system 11:14.03elapsed 51%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (41264major+523595minor)pagefaults 0swaps
(cd foo; time bk pull -q)
397.05user 184.05system 16:38.35elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (30287major+2115127minor)pagefaults 0swaps
-----------------------------
-----------------------------
2.6.6-current:
time bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
290.48user 96.76system 15:00.85elapsed 42%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (50893major+515118minor)pagefaults 0swaps
(cd foo; time bk pull -q)
402.74user 254.98system 23:25.43elapsed 46%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (60130major+2089908minor)pagefaults 0swaps
------------------------------
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 5:22 ` Andrea Arcangeli
@ 2004-05-16 15:28 ` Steven Cole
2004-05-16 17:49 ` Rutger Nijlunsing
2004-05-16 20:38 ` Andrea Arcangeli
0 siblings, 2 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-16 15:28 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Linus Torvalds, Andrew Morton, adi, scole, support,
Kernel Mailing List
On Saturday 15 May 2004 11:22 pm, Andrea Arcangeli wrote:
> On Sat, May 15, 2004 at 09:52:50PM -0700, Linus Torvalds wrote:
> >
> >
> > On Sat, 15 May 2004, Steven Cole wrote:
> > >
> > > OK, will do. I ran the bk exerciser script for over an hour with 2.6.6-current
> > > and no CONFIG_PREEMPT and no errors. The script only reported one
> > > iteration finished, while I got it to do 36 iterations over several hours earlier
> > > today (with a 2.6.3-4mdk vendor kernel)
> >
> > Hmm.. Th ecurrent BK tree contains much of the anonvma stuff, so this
> > might actually be a serious VM performance regression. That could
> > effectively be hiding whatever problem you saw.
> >
> > Andrea: have you tested under low memory and high fs load? Steven has 384M
> > or RAM, which _will_ cause a lot of VM activity when doing a full kernel
> > BK clone + undo + pull, which is what his test script ends up doing...
>
> An easy way to verify for Steven is to give a quick spin to 2.6.5-aa5
> and see if it's slow too, that will rule out the anon-vma changes
> (for completeness: there's a minor race in 2.6.5-aa5 fixed in my current
> internal tree, I posted the fix to l-k separately, but you can ignore
> the fix for a simple test, it takes weeks to trigger anyways and you
> need threads to trigger it and I've never seen threaded version control
> systems so I doubt BK is threaded).
I'm getting the linux-2.6.5.tar.bz2 file (already got 2.6.5-aa2) via ppp,
while running the bk test script on 2.6.6-current and no PREEMPT.
That takes a while on 56k dialup. I'll leave all that running while
I go hiking.
>
> In general a "slowdown" cannot be related to anon-vma (unless it's a
> minor merging error), that's a black and white thing, it doesn't touch
> the vm heuristics and it will only speed the fast paths up plus it will
> save some tons of ram in the big systems. Pratically no change should be
> measurable on a small system (unless it uses an heavy amount of cows, in
> which case it will improve things, it should never hurt). As for being
> tested, it is very well tested on the small desktops too. Probably the
> only thing to double check is that there was no minor merging error that
> could have caused this.
Andrea, I did see a significant slowdown with Andy's test script (with DMA on)
on my timed test of 2.6.6-current vs 2.6.3.
>
> > It would be good to test going back to the kernel that saw the "immediate
> > problem", and try that version without CONFIG_PREEMPT.
>
> Agreed.
>
> Thanks.
>
>
Yep, later this evening, I hope.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 15:28 ` Steven Cole
@ 2004-05-16 17:49 ` Rutger Nijlunsing
2004-05-16 20:38 ` Andrea Arcangeli
1 sibling, 0 replies; 68+ messages in thread
From: Rutger Nijlunsing @ 2004-05-16 17:49 UTC (permalink / raw)
To: Kernel Mailing List
> > An easy way to verify for Steven is to give a quick spin to 2.6.5-aa5
> > and see if it's slow too, that will rule out the anon-vma changes
> > (for completeness: there's a minor race in 2.6.5-aa5 fixed in my current
> > internal tree, I posted the fix to l-k separately, but you can ignore
> > the fix for a simple test, it takes weeks to trigger anyways and you
> > need threads to trigger it and I've never seen threaded version control
> > systems so I doubt BK is threaded).
>
> I'm getting the linux-2.6.5.tar.bz2 file (already got 2.6.5-aa2) via ppp,
> while running the bk test script on 2.6.6-current and no PREEMPT.
> That takes a while on 56k dialup. I'll leave all that running while
> I go hiking.
ketchup should get you faster to 2.6.5 vanilla...
--
Rutger Nijlunsing ---------------------------- rutger ed tux tmfweb nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 15:28 ` Steven Cole
2004-05-16 17:49 ` Rutger Nijlunsing
@ 2004-05-16 20:38 ` Andrea Arcangeli
2004-05-16 21:19 ` Steven Cole
1 sibling, 1 reply; 68+ messages in thread
From: Andrea Arcangeli @ 2004-05-16 20:38 UTC (permalink / raw)
To: Steven Cole
Cc: Linus Torvalds, Andrew Morton, adi, scole, support,
Kernel Mailing List
On Sun, May 16, 2004 at 09:28:21AM -0600, Steven Cole wrote:
> Andrea, I did see a significant slowdown with Andy's test script (with DMA on)
> on my timed test of 2.6.6-current vs 2.6.3.
2.6.3 is quite old, as Andrew is wondering about, this is more likely a
vm heuristic issue if something, it cannot be anon-vma related.
btw, if you've 2.6.3 you should download just two patches to go to
2.6.5.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 20:38 ` Andrea Arcangeli
@ 2004-05-16 21:19 ` Steven Cole
2004-05-16 21:29 ` Andrew Morton
0 siblings, 1 reply; 68+ messages in thread
From: Steven Cole @ 2004-05-16 21:19 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Linus Torvalds, Andrew Morton, adi, scole, support,
Kernel Mailing List
On Sunday 16 May 2004 02:38 pm, Andrea Arcangeli wrote:
> On Sun, May 16, 2004 at 09:28:21AM -0600, Steven Cole wrote:
> > Andrea, I did see a significant slowdown with Andy's test script (with DMA on)
> > on my timed test of 2.6.6-current vs 2.6.3.
>
> 2.6.3 is quite old, as Andrew is wondering about, this is more likely a
> vm heuristic issue if something, it cannot be anon-vma related.
>
> btw, if you've 2.6.3 you should download just two patches to go to
> 2.6.5.
>
>
Sure, but I also wanted to beat the ppp paths while I did other things.
I've been using bk to keep a current kernel, and my older source
trees were sitting on another (disconnected) disk. The 2.6.3 is
the vendor kernel.
That download succeeded, better than I've experienced for a long
time, possibly due to not having PREEMPT set. With PREEMPT and
2.6.x kernels, I had been getting this, and ppp would stop moving data.
May 13 18:09:30 spc kernel: serial8250: too much work for irq10
I did build boot and run 2.6.5-aa5 (which had -aa4 in the EXTRAVERSION),
but the results for the bk exerciser script were similar to 2.6.6-current:
-----------------------------
2.6.6-current:
time bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
290.48user 96.76system 15:00.85elapsed 42%CPU
(cd foo; time bk pull -q)
402.74user 254.98system 23:25.43elapsed 46%CPU
------------------------------
2.6.5-aa5:
time bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
290.78user 94.29system 16:06.73elapsed 39%CPU
(cd foo; time bk pull -q)
401.82user 234.05system 23:36.32elapsed 44%CPU
------------------------------
2.6.3-4mdk (repeated run)
time bk clone -qlr40514130hBbvgP4CvwEVEu27oxm46w testing-2.6 foo
288.71user 58.47system 10:57.37elapsed 52%CPU
(cd foo; time bk pull -q)
397.94user 186.18system 17:24.73elapsed 55%CPU
Anyway, although the regression for my particular machine for this
particular load may be interesting, the good news is that I've seen
none of the failures which started this whole thread, which are relatively
easily reproduceable with PREEMPT set.
Perhaps PREEMPT should be renamed to BUG_FLUSH. :)
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 21:19 ` Steven Cole
@ 2004-05-16 21:29 ` Andrew Morton
2004-05-16 22:11 ` Steven Cole
0 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2004-05-16 21:29 UTC (permalink / raw)
To: Steven Cole; +Cc: andrea, torvalds, adi, scole, support, linux-kernel
Steven Cole <elenstev@mesatop.com> wrote:
>
> Anyway, although the regression for my particular machine for this
> particular load may be interesting, the good news is that I've seen
> none of the failures which started this whole thread, which are relatively
> easily reproduceable with PREEMPT set.
So... would it be correct to say that with CONFIG_PREEMPT, ppp or its
underlying driver stack
a) screws up the connection and hangs and
b) scribbles on pagecache?
Because if so, the same will probably happen on SMP.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 21:29 ` Andrew Morton
@ 2004-05-16 22:11 ` Steven Cole
2004-05-16 23:53 ` Andrea Arcangeli
2004-05-17 8:21 ` R. J. Wysocki
0 siblings, 2 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-16 22:11 UTC (permalink / raw)
To: Andrew Morton; +Cc: andrea, torvalds, adi, scole, support, linux-kernel
On Sunday 16 May 2004 03:29 pm, Andrew Morton wrote:
> Steven Cole <elenstev@mesatop.com> wrote:
> >
> > Anyway, although the regression for my particular machine for this
> > particular load may be interesting, the good news is that I've seen
> > none of the failures which started this whole thread, which are relatively
> > easily reproduceable with PREEMPT set.
>
> So... would it be correct to say that with CONFIG_PREEMPT, ppp or its
> underlying driver stack
>
> a) screws up the connection and hangs and
>
> b) scribbles on pagecache?
>
> Because if so, the same will probably happen on SMP.
>
Perhaps someone has the hardware to test this.
To summarize my experience with the past 24 hours of testing:
Without PREEMPT , everything is rock solid.
I may have a window of time later this evening to continue testing,
and I (cringes at the thought) may repeat some bk pulls with
PREEMPT set.
Later,
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 22:11 ` Steven Cole
@ 2004-05-16 23:53 ` Andrea Arcangeli
2004-05-17 2:12 ` Steven Cole
2004-05-17 8:21 ` R. J. Wysocki
1 sibling, 1 reply; 68+ messages in thread
From: Andrea Arcangeli @ 2004-05-16 23:53 UTC (permalink / raw)
To: Steven Cole; +Cc: Andrew Morton, torvalds, adi, scole, support, linux-kernel
On Sun, May 16, 2004 at 04:11:16PM -0600, Steven Cole wrote:
> On Sunday 16 May 2004 03:29 pm, Andrew Morton wrote:
> > Steven Cole <elenstev@mesatop.com> wrote:
> > >
> > > Anyway, although the regression for my particular machine for this
> > > particular load may be interesting, the good news is that I've seen
> > > none of the failures which started this whole thread, which are relatively
> > > easily reproduceable with PREEMPT set.
> >
> > So... would it be correct to say that with CONFIG_PREEMPT, ppp or its
> > underlying driver stack
> >
> > a) screws up the connection and hangs and
> >
> > b) scribbles on pagecache?
> >
> > Because if so, the same will probably happen on SMP.
> >
> Perhaps someone has the hardware to test this.
>
> To summarize my experience with the past 24 hours of testing:
> Without PREEMPT , everything is rock solid.
so we've two separate problems: the first is the ppp instability with
preempt, the second is a regresion in the vm heuristics between 2.6.3
and 2.6.5.
> and I (cringes at the thought) may repeat some bk pulls with
> PREEMPT set.
I've heard other reports of preempt being unstable with some sound
stuff, just in case are you using sound drivers at all during that
workload?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 23:53 ` Andrea Arcangeli
@ 2004-05-17 2:12 ` Steven Cole
0 siblings, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-17 2:12 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Andrew Morton, torvalds, adi, scole, support, linux-kernel
On Sunday 16 May 2004 05:53 pm, Andrea Arcangeli wrote:
> On Sun, May 16, 2004 at 04:11:16PM -0600, Steven Cole wrote:
> > On Sunday 16 May 2004 03:29 pm, Andrew Morton wrote:
> > > Steven Cole <elenstev@mesatop.com> wrote:
> > > >
> > > > Anyway, although the regression for my particular machine for this
> > > > particular load may be interesting, the good news is that I've seen
> > > > none of the failures which started this whole thread, which are relatively
> > > > easily reproduceable with PREEMPT set.
> > >
> > > So... would it be correct to say that with CONFIG_PREEMPT, ppp or its
> > > underlying driver stack
> > >
> > > a) screws up the connection and hangs and
> > >
> > > b) scribbles on pagecache?
> > >
> > > Because if so, the same will probably happen on SMP.
> > >
> > Perhaps someone has the hardware to test this.
> >
> > To summarize my experience with the past 24 hours of testing:
> > Without PREEMPT , everything is rock solid.
>
> so we've two separate problems: the first is the ppp instability with
> preempt, the second is a regresion in the vm heuristics between 2.6.3
> and 2.6.5.
Yes, that is correct.
The instability was first noticed about one month ago when doing
a bk pull from linus' repository. I've been updating my kernel via
bk almost nightly, and around the time of 2.6.6-rc1 (IIRC), I got the
Assertion `s && s->tree' failed message from bk. At first it was thought
to be related to using an older version (3.0.1) of bk, so that was updated.
A few days later, the problem recurred. Since it only happened about
15% to 20% of the time, and was easy to recover from, I didn't scream
too loudly or too often to bitmover. But then, the problem started becoming
more persistent about a week ago, so I began complaining again. I managed
to get a bitkeeper-generated file to bitmover, who discovered that a very
odd (or even in this case) number of NUL bytes existed where they
should not exist. Hence this thread.
Then during the course of testing, I noticed the significant difference
in time it took to run a test script supplied by bitkeeper for current kernels
versus an older vendor kernel. Hence your being cc'ed.
>
> > and I (cringes at the thought) may repeat some bk pulls with
> > PREEMPT set.
>
> I've heard other reports of preempt being unstable with some sound
> stuff, just in case are you using sound drivers at all during that
> workload?
>
>
Yes, mea culpa. CONFIG_SND_ENS1371=y.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 1:23 ` Steven Cole
2004-05-16 2:18 ` Linus Torvalds
2004-05-16 3:20 ` Andrew Morton
@ 2004-05-17 2:28 ` Larry McVoy
2004-05-17 2:42 ` Linus Torvalds
2 siblings, 1 reply; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 2:28 UTC (permalink / raw)
To: Steven Cole; +Cc: Andrew Morton, adi, scole, support, torvalds, linux-kernel
> renumber: can't read SCCS info in "SCCS/s.ChangeSet".
Be aware that how BK does I/O is with write() on the way out but with
mmap on the way in. The process which forked renumber has just written
the file and the renumber process is reading it with mmap.
If there are still any problems with mixing read/write and mmap then that
may be a prolem but I would have expected to see things start going
wrong on a page boundary and the one core dump I saw was page aligned
at the tail but not at the head, it started in the middle of the page.
I've told my team to drop this unless someone can show that it happens
on other kernels, this smells like a kernel bug to me, if it were a BK
bug we should have been getting hundreds of complaints by now. We can
jump back on it if need be, let us know if you think it is a BK problem
after all.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 2:28 ` Larry McVoy
@ 2004-05-17 2:42 ` Linus Torvalds
2004-05-17 3:36 ` Steven Cole
2004-05-17 14:11 ` Larry McVoy
0 siblings, 2 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-17 2:42 UTC (permalink / raw)
To: Larry McVoy; +Cc: Steven Cole, Andrew Morton, adi, scole, support, linux-kernel
On Sun, 16 May 2004, Larry McVoy wrote:
>
> Be aware that how BK does I/O is with write() on the way out but with
> mmap on the way in. The process which forked renumber has just written
> the file and the renumber process is reading it with mmap.
>
> If there are still any problems with mixing read/write and mmap then that
> may be a prolem but I would have expected to see things start going
> wrong on a page boundary and the one core dump I saw was page aligned
> at the tail but not at the head, it started in the middle of the page.
The kernel should have no problems with mixed read/write and mmap usage,
although user space obviously needs to synchronize the accesses on its own
some way. There is no implicit synchronization otherwise, and the mmap
user can see a partial write at any stage of the write.
Some architectures may have cache coherency issues that makes this
"interesting", but that's not the case on x86 (or indeed anything else
remotely sane - virtual caches are just stupid in this day and age).
> I've told my team to drop this unless someone can show that it happens
> on other kernels, this smells like a kernel bug to me, if it were a BK
> bug we should have been getting hundreds of complaints by now. We can
> jump back on it if need be, let us know if you think it is a BK problem
> after all.
Yeah, I agree. The only other possibility I see is that BK just doesn't
synchronize, and expects writes to be atomically visible to other
processes. They aren't. Preemption might just make this a whole lot more
visible, but on the other hand, so should SMP, so this sounds unlikely.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 2:42 ` Linus Torvalds
@ 2004-05-17 3:36 ` Steven Cole
2004-05-17 5:17 ` Linus Torvalds
2004-05-17 14:11 ` Larry McVoy
1 sibling, 1 reply; 68+ messages in thread
From: Steven Cole @ 2004-05-17 3:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Andrew Morton, adi, scole, support, linux-kernel
On Sunday 16 May 2004 08:42 pm, Linus Torvalds wrote:
>
> On Sun, 16 May 2004, Larry McVoy wrote:
> >
> > Be aware that how BK does I/O is with write() on the way out but with
> > mmap on the way in. The process which forked renumber has just written
> > the file and the renumber process is reading it with mmap.
> >
> > If there are still any problems with mixing read/write and mmap then that
> > may be a prolem but I would have expected to see things start going
> > wrong on a page boundary and the one core dump I saw was page aligned
> > at the tail but not at the head, it started in the middle of the page.
>
> The kernel should have no problems with mixed read/write and mmap usage,
> although user space obviously needs to synchronize the accesses on its own
> some way. There is no implicit synchronization otherwise, and the mmap
> user can see a partial write at any stage of the write.
>
> Some architectures may have cache coherency issues that makes this
> "interesting", but that's not the case on x86 (or indeed anything else
> remotely sane - virtual caches are just stupid in this day and age).
>
> > I've told my team to drop this unless someone can show that it happens
> > on other kernels, this smells like a kernel bug to me, if it were a BK
> > bug we should have been getting hundreds of complaints by now. We can
> > jump back on it if need be, let us know if you think it is a BK problem
> > after all.
>
> Yeah, I agree. The only other possibility I see is that BK just doesn't
> synchronize, and expects writes to be atomically visible to other
> processes. They aren't. Preemption might just make this a whole lot more
> visible, but on the other hand, so should SMP, so this sounds unlikely.
>
> Linus
>
>
Larry, Linus,
I beat on this for last day without PREEMPT and no failures at all.
Several kernels, rock solid all.
Rebooted with an current (as of a couple hours ago) kernel and PREEMPT=y,
and after about the third pull into a repository (I have several), splaaat!
Here are the symptoms. Same message from bk as usual:
---------------------------------------------------------------------------
takepatch: saved entire patch in PENDING/2004-05-16.01
---------------------------------------------------------------------------
Applying 15 revisions to ChangeSet renumber: can't read SCCS info in "RESYNC/SCCS/s.ChangeSet".
bk: takepatch.c:1343: applyCsetPatch: Assertion `s && s->tree' failed.
10586 bytes uncompressed to 52074, 4.92X expansion
One of Larry's guys, Rick Smith, sent me a little program (the source
is earlier in this thread) to check for null. I called its executable
saga (see subject line).
[steven@spc SCCS]$ saga <s.ChangeSet
Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611
That was in the testing-2.6/RESYNC/SCCS directory of course.
OK, no more CONFIG_PREEMPT for me. And, ppp failed earlier with:
serial8250: too much work for irq10. That did not happen without
CONFIG_PREEMPT.
I reconnected to my ISP, bk pulled into my main testing repository,
and that's when I got the splaaat.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 3:36 ` Steven Cole
@ 2004-05-17 5:17 ` Linus Torvalds
2004-05-17 6:11 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-17 5:17 UTC (permalink / raw)
To: Steven Cole
Cc: Larry McVoy, Andrew Morton, William Lee Irwin III, hugh, adi,
scole, support, Kernel Mailing List
On Sun, 16 May 2004, Steven Cole wrote:
>
> I beat on this for last day without PREEMPT and no failures at all.
> Several kernels, rock solid all.
> Rebooted with an current (as of a couple hours ago) kernel and PREEMPT=y,
> and after about the third pull into a repository (I have several), splaaat!
Ok. Good. It's PREEMPT that triggers it. However, I doubt it is
necessarily a preempt bug, I suspect that preempt just opens a window that
is really small even on SMP, and makes it much wider. Wide enough to be
seen.
> One of Larry's guys, Rick Smith, sent me a little program (the source
> is earlier in this thread) to check for null. I called its executable
> saga (see subject line).
>
> [steven@spc SCCS]$ saga <s.ChangeSet
> Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
> Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
> Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611
The fact that it's always zeroes, and it's an strange number but it always
ends up being page-aligned at the _end_ makes me strongly suspect that we
have one of the "don't write back data past i_size" things wrong.
There are several cases where we zero the "end of page" before writing
things back - basically everything past the size of the inode is supposed
to be written back as zero. But your symptoms sure sound like we might be
reading the size of the inode without holding the proper locks.
Under normal UP, no races exist, and even on SMP, the window is likely
that another CPU has to be _just_ updating something in between the read
of i_size and the clearing just a few instructions later. What preempt
does is likely that getting an interrupt at the right time inside that
window just makes the window _huge_.
Andrew, the obvious culprit would be the memset() in fs/buffer.c
(block_write_full_page() to be precise):
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
imagine that the "write()" function updates i_size late - after having
written out the new contents to the page, and _after_ havign unlocked the
page, and now we get a writeback at the wrong time, and we decide to clear
out the end of the page because we think it's past i_size.
Andrew, what do you think?
I think this race does exist, since generic_file_aio_write_nolock()
literally _does_ update i_size only after it has written all the pages, so
I don't see why a "block_write_full_page()" couldn't come in there between
and zero them out again at the _old_ i_size boundary.
Do you see anything wrong in my analysis? I think the fix would be to make
sure to update i_size as we go around writing each page, before we unlock
the page.
The DIRECT_IO path does this completely wrong and needs to be taught to do
it page-for-page or something, while the generic_commit_write() path looks
like it should be fairly trivially fixable by just moving the
i_size_write() to _before_ the __block_commit_write() call (which will
unlock the page).
Who else knows this code? Maybe I'm missing something. wli? Hugh?
Alternatively, maybe we could remove the "memset()" entirely from the
block_write_full_page() thing (which is asynchronous and not such a good
place to do this), and instead move the whole thing to some nice
synchronous place where we can make sure that we hold the inode semaphore
or something. Like the last close of a shared-writable mmap - since the
only way we can get non-zero after i_size is with a writable mmap.
And no, I can't guarantee this is the bug, but it does seem a bit
suspicious.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 5:17 ` Linus Torvalds
@ 2004-05-17 6:11 ` Andrew Morton
2004-05-17 14:07 ` Larry McVoy
2004-05-17 14:12 ` Linus Torvalds
2004-05-17 7:25 ` Andrew Morton
2004-05-17 14:14 ` Larry McVoy
2 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-17 6:11 UTC (permalink / raw)
To: Linus Torvalds; +Cc: elenstev, lm, wli, hugh, adi, scole, support, linux-kernel
Linus Torvalds <torvalds@osdl.org> wrote:
>
> Andrew, the obvious culprit would be the memset() in fs/buffer.c
> (block_write_full_page() to be precise):
>
> memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
>
> imagine that the "write()" function updates i_size late - after having
> written out the new contents to the page, and _after_ havign unlocked the
> page, and now we get a writeback at the wrong time, and we decide to clear
> out the end of the page because we think it's past i_size.
>
> Andrew, what do you think?
Interesting. Playing with i_size like that in writepage() _is_ scary. My
immediate reaction is that if this race was real, it's so gross that we
would have spotted it before now in either 2.4 or 2.5->2.6.
Easy test: Steve, could you remove that memset from
block_write_full_page(), see if it changes anything?
It's not very important - it's there because if an application
(incorrectly) writes to mapped data outside EOF we're supposed to drop
their data and write zeroes instead.
> I think this race does exist, since generic_file_aio_write_nolock()
> literally _does_ update i_size only after it has written all the pages, so
> I don't see why a "block_write_full_page()" couldn't come in there between
> and zero them out again at the _old_ i_size boundary.
i_size is updated in generic_commit_write(), on a per-page basis, or I'm
missing something? I sure hope so.
Let's go through the scenarios.
On entry to block_write_full_page(), i_size is in the middle of this page
somewhere. We're worried that i_size can change, and that this will cause
block_write_full_page() to incorrectly zero out the tail of the page.
Well we can stop right there, because the only way someone can get some
more non-zero user data into this page before we memset and write it is by
locking the page beforehand, and block_write_full_page() has the page lock.
(Or they can write stuff into it via mmap, but writing to the page outside
i_size is an application bug).
Other ways in which i_size can change under block_write_full_page()'s feet
are:
- Someone did a truncate.
No problem - the page is about to be invalidated and chopped off the
file anyway.
- Someone did an extending truncate into another page.
OK, i_size will increase but we're still supposed to write zeroes into
the rest of this page outside the previous i_size.
- Someone extended the file into another page with lseek+write or pwrite.
Same argumentation as with extending truncate.
- Someone did an extending truncate to another i_size which lands
*within* this page.
Writing zeroes is still OK: nobody can get into this page to write new
user data anyway - it's locked.
Either all that, or I missed something ;) If Steve can try that test it
would be interesting. Even if removing the memset does make the corruption
go away, this might not be a kernel bug - it could be that the application
is incorrectly relying on mmapped writes outside i_size making it to disk.
As for O_DIRECT: I need to think about that a bit more. We hold i_sem and
have done an fdatasync prior to entering generic_file_aio_write_nolock() so
there should be no dirty pagecache at this stage anyway. The VM may decide
to dirty some pagecache and then block_write_full_page() could come in and
look at i_size and race against generic_file_aio_write_nolock()'s O_DIRECT
i_size_write(). But I doubt if bk is using direct-IO in combination with
MAP_SHARED...
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 5:17 ` Linus Torvalds
2004-05-17 6:11 ` Andrew Morton
@ 2004-05-17 7:25 ` Andrew Morton
2004-05-17 7:46 ` Andrew Morton
2004-05-17 14:05 ` Larry McVoy
2004-05-17 14:14 ` Larry McVoy
2 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-17 7:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: elenstev, lm, wli, hugh, adi, scole, support, linux-kernel
Linus Torvalds <torvalds@osdl.org> wrote:
>
> Andrew, the obvious culprit would be the memset() in fs/buffer.c
> (block_write_full_page()
There is one race.
If an application does mmap(MAP_SHARED) of, say, a 2048 byte file and then
extends it:
p = mmap(..., fd, ...);
ftructate(fd, 4096);
p[3000] = 1;
A racing block_write_full_page() could fail to notice the extended i_size
and would decide to zap those 2048 bytes anyway.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 7:25 ` Andrew Morton
@ 2004-05-17 7:46 ` Andrew Morton
2004-05-17 8:39 ` Vladimir Saveliev
2004-05-17 11:58 ` Steven Cole
2004-05-17 14:05 ` Larry McVoy
1 sibling, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-17 7:46 UTC (permalink / raw)
To: torvalds, elenstev, lm, wli, hugh, adi, scole, support,
linux-kernel
Andrew Morton <akpm@osdl.org> wrote:
>
> If an application does mmap(MAP_SHARED) of, say, a 2048 byte file and then
> extends it:
>
> p = mmap(..., fd, ...);
> ftructate(fd, 4096);
> p[3000] = 1;
>
> A racing block_write_full_page() could fail to notice the extended i_size
> and would decide to zap those 2048 bytes anyway.
This should plug it.
diff -puN mm/memory.c~ftruncate-vs-block_write_full_page mm/memory.c
--- 25/mm/memory.c~ftruncate-vs-block_write_full_page 2004-05-17 00:33:07.060231368 -0700
+++ 25-akpm/mm/memory.c 2004-05-17 00:41:00.924193096 -0700
@@ -1208,6 +1208,8 @@ int vmtruncate(struct inode * inode, lof
{
struct address_space *mapping = inode->i_mapping;
unsigned long limit;
+ loff_t i_size;
+ struct page *page;
if (inode->i_size < offset)
goto do_expand;
@@ -1222,8 +1224,22 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out;
- i_size_write(inode, offset);
+ /*
+ * If there is a pagecache page at the current i_size we need to lock
+ * it while modifying i_size to synchronise against
+ * block_write_full_page()'s sampling of i_size. Otherwise
+ * block_write_full_page may decide to memset part of this page after
+ * the application extended the file size.
+ */
+ i_size = inode->i_size; /* don't need i_size_read() due to i_sem */
+ page = NULL;
+ if (i_size & (PAGE_CACHE_SIZE - 1))
+ page = find_lock_page(inode->i_mapping,
+ i_size >> PAGE_CACHE_SHIFT);
+ i_size_write(inode, offset);
+ if (page)
+ unlock_page(page);
out_truncate:
if (inode->i_op && inode->i_op->truncate)
inode->i_op->truncate(inode);
_
The same could happen with a pwrite() in place of ftruncate:
fd = open("2048-byte-file");
p = mmap(..., MAP_SHARED, fd, ...);
pwrite(fd, buf, 1, 4096);
p[3000] = 1;
But I doubt that bk does extending writes() against a file which is
concurrently being modified via MAP_SHARED.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 22:11 ` Steven Cole
2004-05-16 23:53 ` Andrea Arcangeli
@ 2004-05-17 8:21 ` R. J. Wysocki
1 sibling, 0 replies; 68+ messages in thread
From: R. J. Wysocki @ 2004-05-17 8:21 UTC (permalink / raw)
To: Steven Cole, Andrew Morton
Cc: andrea, torvalds, adi, scole, support, linux-kernel
On Monday 17 of May 2004 00:11, Steven Cole wrote:
> On Sunday 16 May 2004 03:29 pm, Andrew Morton wrote:
> > Steven Cole <elenstev@mesatop.com> wrote:
> > > Anyway, although the regression for my particular machine for this
> > > particular load may be interesting, the good news is that I've seen
> > > none of the failures which started this whole thread, which are
> > > relatively easily reproduceable with PREEMPT set.
> >
> > So... would it be correct to say that with CONFIG_PREEMPT, ppp or its
> > underlying driver stack
> >
> > a) screws up the connection and hangs and
> >
> > b) scribbles on pagecache?
> >
> > Because if so, the same will probably happen on SMP.
>
> Perhaps someone has the hardware to test this.
Well, this may be OT (I'm sorry, if so), but I ran pppd yesterday on 2.6.6-mm2
with no major problems on an SMP box (AMD64). The only problem I had with it
is that the pppd died unexpectedly 6-7 minutes after the connection had been
established, but this might happen for many reasons. My kernel had been
built without CONFIG_PREEMPT, though.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 7:46 ` Andrew Morton
@ 2004-05-17 8:39 ` Vladimir Saveliev
2004-05-17 8:44 ` Andrew Morton
2004-05-17 11:58 ` Steven Cole
1 sibling, 1 reply; 68+ messages in thread
From: Vladimir Saveliev @ 2004-05-17 8:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Hello
On Mon, 2004-05-17 at 11:46, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > If an application does mmap(MAP_SHARED) of, say, a 2048 byte file and then
> > extends it:
> >
> > p = mmap(..., fd, ...);
> > ftructate(fd, 4096);
> > p[3000] = 1;
> >
> > A racing block_write_full_page() could fail to notice the extended i_size
> > and would decide to zap those 2048 bytes anyway.
>
> This should plug it.
>
> diff -puN mm/memory.c~ftruncate-vs-block_write_full_page mm/memory.c
> --- 25/mm/memory.c~ftruncate-vs-block_write_full_page 2004-05-17 00:33:07.060231368 -0700
> +++ 25-akpm/mm/memory.c 2004-05-17 00:41:00.924193096 -0700
> @@ -1208,6 +1208,8 @@ int vmtruncate(struct inode * inode, lof
> {
> struct address_space *mapping = inode->i_mapping;
> unsigned long limit;
> + loff_t i_size;
> + struct page *page;
>
> if (inode->i_size < offset)
> goto do_expand;
> @@ -1222,8 +1224,22 @@ do_expand:
> goto out_sig;
> if (offset > inode->i_sb->s_maxbytes)
> goto out;
> - i_size_write(inode, offset);
>
> + /*
> + * If there is a pagecache page at the current i_size we need to lock
> + * it while modifying i_size to synchronise against
> + * block_write_full_page()'s sampling of i_size. Otherwise
> + * block_write_full_page may decide to memset part of this page after
> + * the application extended the file size.
> + */
Don't down-ings i_sem in do_truncate and in generic_file_write take care
of this kind of race?
> + i_size = inode->i_size; /* don't need i_size_read() due to i_sem */
> + page = NULL;
> + if (i_size & (PAGE_CACHE_SIZE - 1))
> + page = find_lock_page(inode->i_mapping,
> + i_size >> PAGE_CACHE_SHIFT);
> + i_size_write(inode, offset);
> + if (page)
> + unlock_page(page);
> out_truncate:
> if (inode->i_op && inode->i_op->truncate)
> inode->i_op->truncate(inode);
>
> _
>
> The same could happen with a pwrite() in place of ftruncate:
>
> fd = open("2048-byte-file");
> p = mmap(..., MAP_SHARED, fd, ...);
> pwrite(fd, buf, 1, 4096);
> p[3000] = 1;
>
> But I doubt that bk does extending writes() against a file which is
> concurrently being modified via MAP_SHARED.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 8:39 ` Vladimir Saveliev
@ 2004-05-17 8:44 ` Andrew Morton
0 siblings, 0 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-17 8:44 UTC (permalink / raw)
To: Vladimir Saveliev; +Cc: linux-kernel
Vladimir Saveliev <vs@namesys.com> wrote:
>
> > + /*
> > + * If there is a pagecache page at the current i_size we need to lock
> > + * it while modifying i_size to synchronise against
> > + * block_write_full_page()'s sampling of i_size. Otherwise
> > + * block_write_full_page may decide to memset part of this page after
> > + * the application extended the file size.
> > + */
>
> Don't down-ings i_sem in do_truncate and in generic_file_write take care
> of this kind of race?
Nope, the only lock which block_write_full_page() can be guaranteed to hold
is the page lock.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 7:46 ` Andrew Morton
2004-05-17 8:39 ` Vladimir Saveliev
@ 2004-05-17 11:58 ` Steven Cole
1 sibling, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-17 11:58 UTC (permalink / raw)
To: Andrew Morton; +Cc: torvalds, lm, wli, hugh, adi, scole, support, linux-kernel
On Monday 17 May 2004 01:46 am, Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> >
> > If an application does mmap(MAP_SHARED) of, say, a 2048 byte file and then
> > extends it:
> >
> > p = mmap(..., fd, ...);
> > ftructate(fd, 4096);
> > p[3000] = 1;
> >
> > A racing block_write_full_page() could fail to notice the extended i_size
> > and would decide to zap those 2048 bytes anyway.
>
> This should plug it.
I'll test this tonight when I get back to this home machine.
Thanks, and thanks to Bitmover for providing the resources to help chase this bug.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 7:25 ` Andrew Morton
2004-05-17 7:46 ` Andrew Morton
@ 2004-05-17 14:05 ` Larry McVoy
1 sibling, 0 replies; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 14:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, elenstev, lm, wli, hugh, adi, scole, support,
linux-kernel
On Mon, May 17, 2004 at 12:25:06AM -0700, Andrew Morton wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
> >
> > Andrew, the obvious culprit would be the memset() in fs/buffer.c
> > (block_write_full_page()
>
> There is one race.
>
> If an application does mmap(MAP_SHARED) of, say, a 2048 byte file and then
> extends it:
>
> p = mmap(..., fd, ...);
> ftructate(fd, 4096);
> p[3000] = 1;
>
> A racing block_write_full_page() could fail to notice the extended i_size
> and would decide to zap those 2048 bytes anyway.
This isn't our problem, we only read with mmap(). We write with stdio.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 6:11 ` Andrew Morton
@ 2004-05-17 14:07 ` Larry McVoy
2004-05-17 14:12 ` Linus Torvalds
1 sibling, 0 replies; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 14:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, elenstev, lm, wli, hugh, adi, scole, support,
linux-kernel
> i_size_write(). But I doubt if bk is using direct-IO in combination with
> MAP_SHARED...
BK doesn't use direct io.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 2:42 ` Linus Torvalds
2004-05-17 3:36 ` Steven Cole
@ 2004-05-17 14:11 ` Larry McVoy
1 sibling, 0 replies; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 14:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Steven Cole, Andrew Morton, adi, scole, support,
linux-kernel
On Sun, May 16, 2004 at 07:42:16PM -0700, Linus Torvalds wrote:
> On Sun, 16 May 2004, Larry McVoy wrote:
> > Be aware that how BK does I/O is with write() on the way out but with
> > mmap on the way in. The process which forked renumber has just written
> > the file and the renumber process is reading it with mmap.
> >
> > If there are still any problems with mixing read/write and mmap then that
> > may be a prolem but I would have expected to see things start going
> > wrong on a page boundary and the one core dump I saw was page aligned
> > at the tail but not at the head, it started in the middle of the page.
>
> The kernel should have no problems with mixed read/write and mmap usage,
> although user space obviously needs to synchronize the accesses on its own
> some way. There is no implicit synchronization otherwise, and the mmap
> user can see a partial write at any stage of the write.
You can strace BK and see what it does but I'll save you the trouble.
We never hold a mapping open to a file being written because we never
rewrite a file in place (that's a really bad thing for an SCM to do).
What we do is to write the file to SCCS/x.<filename> and then when it is
written we rename it to SCCS/s.<filename>. Any process which wants to
map it is either going to get the old s.<filename> or the new s.<filename>
but there is no chance that we are extending the file while someone has it
mapped. Famous last words and all that notwithstanding, that's my belief.
So unless I'm more dimwitted than normal we don't have any synchronization
problems by design.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 6:11 ` Andrew Morton
2004-05-17 14:07 ` Larry McVoy
@ 2004-05-17 14:12 ` Linus Torvalds
1 sibling, 0 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-17 14:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: elenstev, lm, wli, hugh, adi, scole, support, linux-kernel
On Sun, 16 May 2004, Andrew Morton wrote:
>
> i_size is updated in generic_commit_write(), on a per-page basis, or I'm
> missing something? I sure hope so.
It's updated AFTER we drop the page lock. It's not enough to do it
page-per-page, you have to do it protected by the only thing that
block_write_full_page() sees, namely the page lock.
So what can happen is
copy data from user space to page
commit_write()
unlock_page()
**preemption happens**
fsync
block_write_full_page()
doen't see the new i_size, clears the data
**preempt back**
update i_size.
end result: zeroes where the process wrote stuff.
> As for O_DIRECT: I need to think about that a bit more. We hold i_sem and
> have done an fdatasync prior to entering generic_file_aio_write_nolock() so
> there should be no dirty pagecache at this stage anyway.
That just hides the race.
> But I doubt if bk is using direct-IO in combination with MAP_SHARED...
Absolutely. I think the bug happens for the regular case, simply because
nobody is even _using_ direct-IO. But in direct-IO, the race is about a
million times bigger, because we won't actually update i_size until much
much later.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 5:17 ` Linus Torvalds
2004-05-17 6:11 ` Andrew Morton
2004-05-17 7:25 ` Andrew Morton
@ 2004-05-17 14:14 ` Larry McVoy
2004-05-17 14:32 ` Linus Torvalds
2 siblings, 1 reply; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 14:14 UTC (permalink / raw)
To: Linus Torvalds
Cc: Steven Cole, Larry McVoy, Andrew Morton, William Lee Irwin III,
hugh, adi, scole, support, Kernel Mailing List
On Sun, May 16, 2004 at 10:17:58PM -0700, Linus Torvalds wrote:
> > Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
> > Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
> > Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611
>
> The fact that it's always zeroes, and it's an strange number but it always
> ends up being page-aligned at the _end_ makes me strongly suspect that we
> have one of the "don't write back data past i_size" things wrong.
Isn't it weird that it is starting at 0xb01 and has the same length at
three different offsets? That's a definite pattern and might be a clue.
And note that the weird starting offset plus the length is a page size.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 14:14 ` Larry McVoy
@ 2004-05-17 14:32 ` Linus Torvalds
2004-05-17 14:52 ` Larry McVoy
0 siblings, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2004-05-17 14:32 UTC (permalink / raw)
To: Larry McVoy
Cc: Steven Cole, Andrew Morton, William Lee Irwin III, hugh, adi,
scole, support, Kernel Mailing List
On Mon, 17 May 2004, Larry McVoy wrote:
>
> On Sun, May 16, 2004 at 10:17:58PM -0700, Linus Torvalds wrote:
> > > Found null start 0x1550b01 end 0x1551000 len 0x4ff line 535587
> > > Found null start 0x2030b01 end 0x2031000 len 0x4ff line 639039
> > > Found null start 0x2330b01 end 0x2331000 len 0x4ff line 663611
> >
> > The fact that it's always zeroes, and it's an strange number but it always
> > ends up being page-aligned at the _end_ makes me strongly suspect that we
> > have one of the "don't write back data past i_size" things wrong.
>
> Isn't it weird that it is starting at 0xb01 and has the same length at
> three different offsets? That's a definite pattern and might be a clue.
Note that in the previous case, it was 1352 bytes of NUL (according to the
subject), now it's 1279. So it's only consistantly the same offset for one
particular run, and changes between cases.
I agree that it isn't just random, though:
> And note that the weird starting offset plus the length is a page size.
My claim (which may be bogus, but isn't), is that you wrote the file with
a buffered interface like stdio (or your own buffers), and that the buffer
size is likely a nice round number like 8kB or something. I think that's
what stdio uses by default.
And at some point earlier in the process you did an fflush(), or somebody
else had written a header of n*PAGE_SIZE + 0x4ff bytes, or something like
that. Since this was the ChangeSet file, I suspect that the "header" is
the checkin-comment section at the beginning, and the "second phase" is
the actual key list thing. You know how you write the ChangeSet file
better than I do.
So what happens is that for _that_ run of BK, the ChangeSet file will
first have an i_size that is always at an even page offset (when it is
writing the first phase, buffered), and then in the second phase i_size
will always be "fixed offset + n*BUFFER_SIZE".
And in this case the fixed offset happened to be 0x4ff. Last time it was
something else.
The bug I think triggered only happens when i_size is not on a page
boundary, so the first phase will never see a zeroed area. And the second
phase will always see a zeroed area that starts at the same offset.
So this would explain it, except I just noticed that the unlock_page()
already _is_ way down after we update the i_size. Oh well.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 14:32 ` Linus Torvalds
@ 2004-05-17 14:52 ` Larry McVoy
2004-05-17 15:02 ` Linus Torvalds
0 siblings, 1 reply; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 14:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Steven Cole, Andrew Morton, William Lee Irwin III,
hugh, adi, scole, support, Kernel Mailing List
On Mon, May 17, 2004 at 07:32:44AM -0700, Linus Torvalds wrote:
> My claim (which may be bogus, but isn't), is that you wrote the file with
> a buffered interface like stdio (or your own buffers), and that the buffer
> size is likely a nice round number like 8kB or something. I think that's
> what stdio uses by default.
Yup, that's what we do.
> And at some point earlier in the process you did an fflush(), or somebody
> else had written a header of n*PAGE_SIZE + 0x4ff bytes, or something like
> that. Since this was the ChangeSet file, I suspect that the "header" is
> the checkin-comment section at the beginning, and the "second phase" is
> the actual key list thing. You know how you write the ChangeSet file
> better than I do.
I don't think we flush along the way but let me look. Whoops, you're right,
we do. Right where you thought too. But that doesn't explain there being
3 blocks of nulls (there should NEVER be a null in the s.ChangeSet file, we
don't compress that, it's always ascii).
But the bigger problem is that you are missing the point that I mentioned
elsewhere, we are writing to a tmp file, the tmp file is NOT mmapped.
We mmap only after that file is closed and renamed. We don't map the
tmp file.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 14:52 ` Larry McVoy
@ 2004-05-17 15:02 ` Linus Torvalds
2004-05-17 15:05 ` Larry McVoy
2004-05-17 15:23 ` Chris Mason
0 siblings, 2 replies; 68+ messages in thread
From: Linus Torvalds @ 2004-05-17 15:02 UTC (permalink / raw)
To: Larry McVoy
Cc: Steven Cole, Andrew Morton, William Lee Irwin III, hugh, adi,
scole, support, Kernel Mailing List
On Mon, 17 May 2004, Larry McVoy wrote:
>
> > And at some point earlier in the process you did an fflush(), or somebody
> > else had written a header of n*PAGE_SIZE + 0x4ff bytes, or something like
> > that. Since this was the ChangeSet file, I suspect that the "header" is
> > the checkin-comment section at the beginning, and the "second phase" is
> > the actual key list thing. You know how you write the ChangeSet file
> > better than I do.
>
> I don't think we flush along the way but let me look. Whoops, you're right,
> we do. Right where you thought too. But that doesn't explain there being
> 3 blocks of nulls (there should NEVER be a null in the s.ChangeSet file, we
> don't compress that, it's always ascii).
No, no, I'm not claiming that _you_ are writing the NUL bytes. I'm
claiming the kernel has a bug that triggers with non-page-aligned starting
offsets of writes (because we clear the bytes after "i_size", and we don't
synchronize those clears sufficiently), and concurrent flushes. There's
probably some other trigger needed too (CONFIG_PREEMPT being just the
thing that uncovers the race).
> But the bigger problem is that you are missing the point that I mentioned
> elsewhere, we are writing to a tmp file, the tmp file is NOT mmapped.
No, the mmap thing was Andrew's theory. My theory is that regular
"write()" calls can trigger it through the "commit_write()" function.
Of course, my theory also depended on a page unlock happening in a place
where it didn't actually happen, so the exact details of my theory are
crap. I'll need to re-think that part.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 15:02 ` Linus Torvalds
@ 2004-05-17 15:05 ` Larry McVoy
2004-05-17 15:23 ` Chris Mason
1 sibling, 0 replies; 68+ messages in thread
From: Larry McVoy @ 2004-05-17 15:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Steven Cole, Andrew Morton, William Lee Irwin III,
hugh, adi, scole, support, Kernel Mailing List
On Mon, May 17, 2004 at 08:02:48AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 17 May 2004, Larry McVoy wrote:
> >
> > > And at some point earlier in the process you did an fflush(), or somebody
> > > else had written a header of n*PAGE_SIZE + 0x4ff bytes, or something like
> > > that. Since this was the ChangeSet file, I suspect that the "header" is
> > > the checkin-comment section at the beginning, and the "second phase" is
> > > the actual key list thing. You know how you write the ChangeSet file
> > > better than I do.
> >
> > I don't think we flush along the way but let me look. Whoops, you're right,
> > we do. Right where you thought too. But that doesn't explain there being
> > 3 blocks of nulls (there should NEVER be a null in the s.ChangeSet file, we
> > don't compress that, it's always ascii).
>
> No, no, I'm not claiming that _you_ are writing the NUL bytes. I'm
Yes, I know that. But you had a theory that depended on flush and other
than the flush at the header/data boundary I don't see one until we are
done and fclose() the file. So if you were counting on 3 fflush() calls
I don't think that is happening (an ltrace would tell us, tell me if you
want to know for sure and I'll check).
> > But the bigger problem is that you are missing the point that I mentioned
> > elsewhere, we are writing to a tmp file, the tmp file is NOT mmapped.
>
> No, the mmap thing was Andrew's theory. My theory is that regular
> "write()" calls can trigger it through the "commit_write()" function.
OK.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 15:02 ` Linus Torvalds
2004-05-17 15:05 ` Larry McVoy
@ 2004-05-17 15:23 ` Chris Mason
2004-05-17 15:49 ` Steven Cole
2004-05-17 20:24 ` Chris Mason
1 sibling, 2 replies; 68+ messages in thread
From: Chris Mason @ 2004-05-17 15:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Steven Cole, Andrew Morton, William Lee Irwin III,
hugh, adi, scole, support, Kernel Mailing List
On Mon, 2004-05-17 at 11:02, Linus Torvalds wrote:
> On Mon, 17 May 2004, Larry McVoy wrote:
> >
> > > And at some point earlier in the process you did an fflush(), or somebody
> > > else had written a header of n*PAGE_SIZE + 0x4ff bytes, or something like
> > > that. Since this was the ChangeSet file, I suspect that the "header" is
> > > the checkin-comment section at the beginning, and the "second phase" is
> > > the actual key list thing. You know how you write the ChangeSet file
> > > better than I do.
> >
> > I don't think we flush along the way but let me look. Whoops, you're right,
> > we do. Right where you thought too. But that doesn't explain there being
> > 3 blocks of nulls (there should NEVER be a null in the s.ChangeSet file, we
> > don't compress that, it's always ascii).
>
> No, no, I'm not claiming that _you_ are writing the NUL bytes. I'm
> claiming the kernel has a bug that triggers with non-page-aligned starting
> offsets of writes (because we clear the bytes after "i_size", and we don't
> synchronize those clears sufficiently), and concurrent flushes. There's
> probably some other trigger needed too (CONFIG_PREEMPT being just the
> thing that uncovers the race).
>
> > But the bigger problem is that you are missing the point that I mentioned
> > elsewhere, we are writing to a tmp file, the tmp file is NOT mmapped.
>
> No, the mmap thing was Andrew's theory. My theory is that regular
> "write()" calls can trigger it through the "commit_write()" function.
>
> Of course, my theory also depended on a page unlock happening in a place
> where it didn't actually happen, so the exact details of my theory are
> crap. I'll need to re-think that part.
You've described it correctly for reiserfs though, we unlock the page
too soon. I'll fix the page locking for reiserfs_file_write. Steven,
we need to figure out why you're seeing this on ext3.
The two filesystems don't share much code for the normal write path, and
I don't see how you can trigger this on ext3 without truncate jumping
into the fun.
-chris
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
@ 2004-05-17 15:35 Albert Cahalan
0 siblings, 0 replies; 68+ messages in thread
From: Albert Cahalan @ 2004-05-17 15:35 UTC (permalink / raw)
To: linux-kernel mailing list
Cc: lm, Andrew Morton OSDL, adi, scole, Linus Torvalds
Larry McVoy writes:
> But the bigger problem is that you are missing the point
> that I mentioned elsewhere, we are writing to a tmp file,
> the tmp file is NOT mmapped. We mmap only after that file
> is closed and renamed. We don't map the tmp file.
Recent glibc versions will sometimes use mmap() for stdio.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 15:23 ` Chris Mason
@ 2004-05-17 15:49 ` Steven Cole
2004-05-17 20:24 ` Chris Mason
1 sibling, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-17 15:49 UTC (permalink / raw)
To: Chris Mason
Cc: Steven Cole, Andrew Morton, hugh, William Lee Irwin III,
Larry McVoy, support, Linus Torvalds, adi, Kernel Mailing List
On May 17, 2004, at 9:23 AM, Chris Mason wrote:
> On Mon, 2004-05-17 at 11:02, Linus Torvalds wrote:
>> Of course, my theory also depended on a page unlock happening in a
>> place
>> where it didn't actually happen, so the exact details of my theory are
>> crap. I'll need to re-think that part.
>
> You've described it correctly for reiserfs though, we unlock the page
> too soon. I'll fix the page locking for reiserfs_file_write. Steven,
> we need to figure out why you're seeing this on ext3.
I'll have to wait until tonight since I've only been able to trigger
this
using dialup, and I don't have that here at work. The failures were
only
seen with PREEMPT of course, and very early in the testing I ran the
test on ext3 too, to exonerate reiserfs as a primary cause. However, I
did not examine or preserve the RESYNC/SCCS/s.ChangeSet file on ext3
since I didn't know what the issues were at that early stage of testing.
So, I don't really know any details of the failure on ext3, apart from
the most superficial symptoms.
>
> The two filesystems don't share much code for the normal write path,
> and
> I don't see how you can trigger this on ext3 without truncate jumping
> into the fun.
>
> -chris
The "Assertion `s && s->tree' failed" happened fairly quickly (on the
3rd bk pull) on reiserfs with PREEMPT enabled. I'll let you know
how ext3 behaves later tonight.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 15:23 ` Chris Mason
2004-05-17 15:49 ` Steven Cole
@ 2004-05-17 20:24 ` Chris Mason
2004-05-17 21:08 ` Steven Cole
1 sibling, 1 reply; 68+ messages in thread
From: Chris Mason @ 2004-05-17 20:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Larry McVoy, Steven Cole, Andrew Morton, William Lee Irwin III,
hugh, adi, scole, support, Kernel Mailing List
On Mon, 2004-05-17 at 11:23, Chris Mason wrote:
> You've described it correctly for reiserfs though, we unlock the page
> too soon. I'll fix the page locking for reiserfs_file_write. Steven,
> we need to figure out why you're seeing this on ext3.
Steven, could you give this a try as well? It is against 2.6.6-mm3, but
should work against vanilla too:
reiserfs_file_write unlocks the pages it operated on before updating
i_size. This can lead to races with writepage, who checks i_size when
deciding how much of the file to zero out.
This patch also replaces SetPageReferenced with mark_page_accessed() in
reiserfs_file_write
Index: linux.mm/fs/reiserfs/file.c
===================================================================
--- linux.mm.orig/fs/reiserfs/file.c 2004-05-17 13:42:02.000000000 -0400
+++ linux.mm/fs/reiserfs/file.c 2004-05-17 16:22:35.135105528 -0400
@@ -10,6 +10,7 @@
#include <linux/smp_lock.h>
#include <asm/uaccess.h>
#include <linux/pagemap.h>
+#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
@@ -678,10 +679,6 @@
// we only remember error status to report it on
// exit.
write_bytes-=count;
- SetPageReferenced(page);
- unlock_page(page); // We unlock the page as it was locked by earlier call
- // to grab_cache_page
- page_cache_release(page);
}
/* now that we've gotten all the ordered buffers marked dirty,
* we can safely update i_size and close any running transaction
@@ -718,6 +715,17 @@
reiserfs_write_unlock(inode->i_sb);
}
th->t_trans_id = 0;
+
+ /*
+ * we have to unlock the pages after updating i_size, otherwise
+ * we race with writepage
+ */
+ for ( i = 0; i < num_pages ; i++) {
+ struct page *page=prepared_pages[i];
+ unlock_page(page);
+ mark_page_accessed(page);
+ page_cache_release(page);
+ }
return retval;
}
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 20:24 ` Chris Mason
@ 2004-05-17 21:08 ` Steven Cole
2004-05-17 21:29 ` Andrew Morton
0 siblings, 1 reply; 68+ messages in thread
From: Steven Cole @ 2004-05-17 21:08 UTC (permalink / raw)
To: Chris Mason
Cc: Linus Torvalds, Larry McVoy, Andrew Morton, William Lee Irwin III,
hugh, adi, support, Kernel Mailing List
On Mon, 2004-05-17 at 14:24, Chris Mason wrote:
> On Mon, 2004-05-17 at 11:23, Chris Mason wrote:
>
> > You've described it correctly for reiserfs though, we unlock the page
> > too soon. I'll fix the page locking for reiserfs_file_write. Steven,
> > we need to figure out why you're seeing this on ext3.
>
> Steven, could you give this a try as well? It is against 2.6.6-mm3, but
> should work against vanilla too:
>
> reiserfs_file_write unlocks the pages it operated on before updating
> i_size. This can lead to races with writepage, who checks i_size when
> deciding how much of the file to zero out.
>
> This patch also replaces SetPageReferenced with mark_page_accessed() in
> reiserfs_file_write
>
> Index: linux.mm/fs/reiserfs/file.c
OK, my plan is to do this:
1) Apply your patch to 2.6.6-current, build with PREEMPT
2) Test bk pull via ppp on reiserfs until and if it breaks.
3) Test bk pull via ppp on ext3 and take a look at the s.ChangeSet file
if/when the failure occurs.
4) Apply akpm's patch here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=108478018304305&w=2
5) Repeat 2,3
Here I'm defining 2.6.6-current as what was current at around midnight
last night. I'll keep that source tree as a constant.
I'll post the results either late tonight or tomorrow.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 21:08 ` Steven Cole
@ 2004-05-17 21:29 ` Andrew Morton
2004-05-17 22:15 ` Steven Cole
2004-05-17 23:52 ` Steven Cole
0 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-17 21:29 UTC (permalink / raw)
To: Steven Cole; +Cc: mason, torvalds, lm, wli, hugh, adi, support, linux-kernel
Steven Cole <elenstev@mesatop.com> wrote:
>
> 1) Apply your patch to 2.6.6-current, build with PREEMPT
> 2) Test bk pull via ppp on reiserfs until and if it breaks.
> 3) Test bk pull via ppp on ext3 and take a look at the s.ChangeSet file
> if/when the failure occurs.
> 4) Apply akpm's patch here:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=108478018304305&w=2
> 5) Repeat 2,3
Nope. Please just see if this makes the problem go away:
--- 25/fs/buffer.c~a Mon May 17 14:28:51 2004
+++ 25-akpm/fs/buffer.c Mon May 17 14:29:02 2004
@@ -2723,7 +2723,6 @@ int block_write_full_page(struct page *p
* writes to that region are not written out to the file."
*/
kaddr = kmap_atomic(page, KM_USER0);
- memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
return __block_write_full_page(inode, page, get_block, wbc);
_
If this patch is confirmed to fix things up, then and only then should you
bother testing the vmtruncate patch.
Thanks.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 21:29 ` Andrew Morton
@ 2004-05-17 22:15 ` Steven Cole
2004-05-17 23:52 ` Steven Cole
1 sibling, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-17 22:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: mason, torvalds, lm, wli, hugh, adi, support, linux-kernel
On Monday 17 May 2004 03:29 pm, Andrew Morton wrote:
> Steven Cole <elenstev@mesatop.com> wrote:
> >
> > 1) Apply your patch to 2.6.6-current, build with PREEMPT
> > 2) Test bk pull via ppp on reiserfs until and if it breaks.
> > 3) Test bk pull via ppp on ext3 and take a look at the s.ChangeSet file
> > if/when the failure occurs.
> > 4) Apply akpm's patch here:
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=108478018304305&w=2
> > 5) Repeat 2,3
>
> Nope. Please just see if this makes the problem go away:
>
> --- 25/fs/buffer.c~a Mon May 17 14:28:51 2004
> +++ 25-akpm/fs/buffer.c Mon May 17 14:29:02 2004
> @@ -2723,7 +2723,6 @@ int block_write_full_page(struct page *p
> * writes to that region are not written out to the file."
> */
> kaddr = kmap_atomic(page, KM_USER0);
> - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
> flush_dcache_page(page);
> kunmap_atomic(kaddr, KM_USER0);
> return __block_write_full_page(inode, page, get_block, wbc);
>
> _
>
> If this patch is confirmed to fix things up, then and only then should you
> bother testing the vmtruncate patch.
>
> Thanks.
>
>
Thank you very much Andrew. Building now.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 21:29 ` Andrew Morton
2004-05-17 22:15 ` Steven Cole
@ 2004-05-17 23:52 ` Steven Cole
2004-05-18 0:03 ` Chris Mason
2004-05-18 0:13 ` Andrew Morton
1 sibling, 2 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-17 23:52 UTC (permalink / raw)
To: Andrew Morton; +Cc: mason, torvalds, lm, wli, hugh, adi, support, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2762 bytes --]
On Monday 17 May 2004 03:29 pm, Andrew Morton wrote:
> Steven Cole <elenstev@mesatop.com> wrote:
> >
> > 1) Apply your patch to 2.6.6-current, build with PREEMPT
> > 2) Test bk pull via ppp on reiserfs until and if it breaks.
> > 3) Test bk pull via ppp on ext3 and take a look at the s.ChangeSet file
> > if/when the failure occurs.
> > 4) Apply akpm's patch here:
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=108478018304305&w=2
> > 5) Repeat 2,3
>
> Nope. Please just see if this makes the problem go away:
>
> --- 25/fs/buffer.c~a Mon May 17 14:28:51 2004
> +++ 25-akpm/fs/buffer.c Mon May 17 14:29:02 2004
> @@ -2723,7 +2723,6 @@ int block_write_full_page(struct page *p
> * writes to that region are not written out to the file."
> */
> kaddr = kmap_atomic(page, KM_USER0);
> - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
> flush_dcache_page(page);
> kunmap_atomic(kaddr, KM_USER0);
> return __block_write_full_page(inode, page, get_block, wbc);
>
> _
>
> If this patch is confirmed to fix things up, then and only then should you
> bother testing the vmtruncate patch.
>
> Thanks.
>
>
OK, applied your one-liner above with PREEMPT.
Pull bk://linux.bkbits.net/linux-2.5
-> file://home/steven/BK/save-2.6
---------------------- Receiving the following csets -----------------------
1.1727 1.1726 1.1725 1.1626.1.10 1.1626.1.9 1.1626.1.8 1.1626.1.7
1.1612.11.1 1.1371.746.12 1.1371.746.11 1.1371.746.10 1.1371.746.9
1.1371.746.8 1.1371.746.7 1.1371.746.6 1.1371.746.5 1.1371.746.4
1.1371.746.3 1.1371.746.2 1.1371.746.1
----------------------------------------------------------------------------
ChangeSet: 20 deltas
[snipped list of files]
---------------------------------------------------------------------------
takepatch: saved entire patch in PENDING/2004-05-17.01
---------------------------------------------------------------------------
Applying 20 revisions to ChangeSet renumber: can't read SCCS info in "RESYNC/SCCS/s.ChangeSet" .
bk: takepatch.c:1343: applyCsetPatch: Assertion `s && s->tree' failed.
11760 bytes uncompressed to 57721, 4.91X expansion
[steven@spc save-2.6]$ exit
Script done, file is test1
[steven@spc save-2.6]$ saga <RESYNC/SCCS/s.ChangeSet
Found null start 0xfb259a end 0xfb3000 len 0xa66 line 478846
The above was on reiserfs and happened on the very first pull.
Attaching the source of saga.c for reference.
So, what next doc? Back out that one-liner and try your vmtruncate?
Or try Chris' patch for reiserfs?
At the moment I'm testing on ext3, which survived the two pull/unpulls.
This is like watching paint dry.
I'll do some more bk unpull and bk pull cycles until this breaks on ext3.
Steven
[-- Attachment #2: saga.c --]
[-- Type: text/x-csrc, Size: 388 bytes --]
#include <stdio.h>
main()
{
int c, where = -1, line = 0;
int start;
int null = 0;
while ((c = getchar()) != EOF) {
where++;
if (c == '\n') line++;
if (c && null) {
fprintf(stderr,
"Found null start 0x%x end 0x%x len 0x%x line %d\n",
start, where, where - start, line);
}
if (c) {null = 0; continue;}
if (null) continue;
start = where;
null = 1;
}
}
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 23:52 ` Steven Cole
@ 2004-05-18 0:03 ` Chris Mason
2004-05-18 0:15 ` Andrew Morton
2004-05-18 0:13 ` Andrew Morton
1 sibling, 1 reply; 68+ messages in thread
From: Chris Mason @ 2004-05-18 0:03 UTC (permalink / raw)
To: Steven Cole
Cc: Andrew Morton, torvalds, lm, wli, hugh, adi, support,
linux-kernel
On Mon, 2004-05-17 at 19:52, Steven Cole wrote:
>
> OK, applied your one-liner above with PREEMPT.
> Found null start 0xfb259a end 0xfb3000 len 0xa66 line 478846
>
> The above was on reiserfs and happened on the very first pull.
>
His one liner won't change any reiserfs code paths. If you're testing
on ext3, now, just keep going there.
-chris
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-17 23:52 ` Steven Cole
2004-05-18 0:03 ` Chris Mason
@ 2004-05-18 0:13 ` Andrew Morton
2004-05-18 0:45 ` Steven Cole
2004-05-18 1:34 ` Larry McVoy
1 sibling, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-18 0:13 UTC (permalink / raw)
To: Steven Cole; +Cc: mason, torvalds, lm, wli, hugh, adi, support, linux-kernel
Steven Cole <elenstev@mesatop.com> wrote:
>
> >
> OK, applied your one-liner above with PREEMPT.
>
> ...
> Found null start 0xfb259a end 0xfb3000 len 0xa66 line 478846
>
> The above was on reiserfs and happened on the very first pull.
>
> Attaching the source of saga.c for reference.
ok, thanks.
> So, what next doc? Back out that one-liner and try your vmtruncate?
No, it won't help in that case.
> Or try Chris' patch for reiserfs?
>
> At the moment I'm testing on ext3, which survived the two pull/unpulls.
> This is like watching paint dry.
>
> I'll do some more bk unpull and bk pull cycles until this breaks on ext3.
I guess it would be interesting to run it on a filesystem which has 2k or
even 1k blocksize. If the corruption then terminates on a 2k- or
1k-boundary then that will rule out a few culprits.
I'd really like to see this happen on some other machine though. It'd be
funny if you have a dud disk drive or something.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 0:03 ` Chris Mason
@ 2004-05-18 0:15 ` Andrew Morton
0 siblings, 0 replies; 68+ messages in thread
From: Andrew Morton @ 2004-05-18 0:15 UTC (permalink / raw)
To: Chris Mason; +Cc: elenstev, torvalds, lm, wli, hugh, adi, support, linux-kernel
Chris Mason <mason@suse.com> wrote:
>
> On Mon, 2004-05-17 at 19:52, Steven Cole wrote:
> >
> > OK, applied your one-liner above with PREEMPT.
> > Found null start 0xfb259a end 0xfb3000 len 0xa66 line 478846
> >
> > The above was on reiserfs and happened on the very first pull.
> >
> His one liner won't change any reiserfs code paths. If you're testing
> on ext3, now, just keep going there.
>
Good point.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 0:13 ` Andrew Morton
@ 2004-05-18 0:45 ` Steven Cole
2004-05-18 1:34 ` Larry McVoy
1 sibling, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-18 0:45 UTC (permalink / raw)
To: Andrew Morton; +Cc: mason, torvalds, lm, wli, hugh, adi, support, linux-kernel
On Monday 17 May 2004 06:13 pm, Andrew Morton wrote:
> Steven Cole <elenstev@mesatop.com> wrote:
> >
> > >
> > OK, applied your one-liner above with PREEMPT.
> >
> > ...
> > Found null start 0xfb259a end 0xfb3000 len 0xa66 line 478846
> >
> > The above was on reiserfs and happened on the very first pull.
> >
> > Attaching the source of saga.c for reference.
>
> ok, thanks.
>
> > So, what next doc? Back out that one-liner and try your vmtruncate?
>
> No, it won't help in that case.
>
> > Or try Chris' patch for reiserfs?
> >
> > At the moment I'm testing on ext3, which survived the two pull/unpulls.
> > This is like watching paint dry.
> >
> > I'll do some more bk unpull and bk pull cycles until this breaks on ext3.
>
> I guess it would be interesting to run it on a filesystem which has 2k or
> even 1k blocksize. If the corruption then terminates on a 2k- or
> 1k-boundary then that will rule out a few culprits.
>
> I'd really like to see this happen on some other machine though. It'd be
> funny if you have a dud disk drive or something.
>
>
I have a backup disk as /dev/hdb, which is usually not mounted.
[root@spc steven]# df -T
Filesystem Type Size Used Avail Use% Mounted on
/dev/hda6 ext3 293M 88M 190M 32% /
/dev/hda9 reiserfs 26G 8.3G 18G 32% /home
/dev/hda1 ntfs 3.0G 2.4G 651M 79% /mnt/win_c
/dev/hda5 vfat 3.0G 811M 2.2G 27% /mnt/win_d
/dev/hda7 ext3 3.9G 2.1G 1.7G 56% /usr
/dev/hda8 ext3 491M 69M 397M 15% /var
/dev/hdb1 reiserfs 299M 180M 119M 61% /home/steven/red
/dev/hdb6 reiserfs 3.9G 1.8G 2.2G 46% /home/steven/blue
/dev/hdb7 reiserfs 299M 175M 124M 59% /home/steven/green
/dev/hdb8 reiserfs 197M 38M 159M 19% /home/steven/yellow
/dev/hdb9 reiserfs 12G 5.4G 5.8G 48% /home/steven/purple
I could possibly reformat /dev/hdb6 (3.9G) for testing later.
Let me know the details on what would be most meaningfull.
The ext3 fs (/usr) has survived four cycles of
bk pull bk://linux.bkbits.net/linux-2.5
bk unpull -fq
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 0:13 ` Andrew Morton
2004-05-18 0:45 ` Steven Cole
@ 2004-05-18 1:34 ` Larry McVoy
2004-05-18 1:42 ` Andrew Morton
1 sibling, 1 reply; 68+ messages in thread
From: Larry McVoy @ 2004-05-18 1:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Steven Cole, mason, torvalds, lm, wli, hugh, adi, support,
linux-kernel
On Mon, May 17, 2004 at 05:13:30PM -0700, Andrew Morton wrote:
> I guess it would be interesting to run it on a filesystem which has 2k or
> even 1k blocksize. If the corruption then terminates on a 2k- or
> 1k-boundary then that will rule out a few culprits.
Does anyone have a theory that accounts for the fact that the zeroed
section is always tail aligned and seems to be the same length? The
data seems to be
[ good page ] [ GGGB ] [ more good pages ] [ GGGB ] etc.
where the GGG is the first 2817 bytes and the B is the last 1279 (decimal)
bytes. That's just too weird to be random, right?
> I'd really like to see this happen on some other machine though. It'd be
> funny if you have a dud disk drive or something.
We can easily rule that out. Steven, do a
dd if=/dev/zero of=USE_SOME_SPACE bs=1048576 count=500
which will eat up 500 MB and should eat up any bad blocks. I _really_
doubt it is a bad disk.
Steven, if you have a copy of lmbench then use lmdd from that like so
lmdd of=XXX opat=1
and that will write non-zero data to the disk, you then remove that file
and if we are getting random crud from the disk then we won't have nulls.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 1:34 ` Larry McVoy
@ 2004-05-18 1:42 ` Andrew Morton
2004-05-18 1:56 ` Steven Cole
0 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2004-05-18 1:42 UTC (permalink / raw)
To: Larry McVoy
Cc: elenstev, mason, torvalds, lm, wli, hugh, adi, support,
linux-kernel
Larry McVoy <lm@bitmover.com> wrote:
>
> > I'd really like to see this happen on some other machine though. It'd be
> > funny if you have a dud disk drive or something.
>
> We can easily rule that out. Steven, do a
>
> dd if=/dev/zero of=USE_SOME_SPACE bs=1048576 count=500
>
> which will eat up 500 MB and should eat up any bad blocks. I _really_
> doubt it is a bad disk.
Yes, me too. The sensitivity to CONFIG_PREEMPT makes that unlikely.
Two things I'm not clear on:
a) Has is been established that CONFIG_PREEMPT causes ppp to fail?
b) Has the file corruption been observed when PPP was not in use at all?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-16 2:18 ` Linus Torvalds
2004-05-16 3:44 ` Linus Torvalds
@ 2004-05-18 1:47 ` Benjamin Herrenschmidt
1 sibling, 0 replies; 68+ messages in thread
From: Benjamin Herrenschmidt @ 2004-05-18 1:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Steven Cole, Andrew Morton, adi, scole, support,
Linux Kernel list
On Sun, 2004-05-16 at 12:18, Linus Torvalds wrote:
> On Sat, 15 May 2004, Steven Cole wrote:
> >
> > In the spirit of 'rounding up the usual suspects', I'll unset CONFIG_PREEMT
> > and try again.
>
> Thanks. If that doesn't do it, can you start binary-searching on kernel
> versions? I run with preempt myself (well, not on my current G5 desktop,
> but otherwise), so it _should_ be stable, but you may have a driver or
> something else that doesn't like preempt.
Heh ;) Well... PREEMPT for ppc64 is planned :)
> Or it could be any number of other config options. Do you have anything
> else interesting enabled?
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Benjamin Herrenschmidt <benh@kernel.crashing.org>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 1:42 ` Andrew Morton
@ 2004-05-18 1:56 ` Steven Cole
0 siblings, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-18 1:56 UTC (permalink / raw)
To: Andrew Morton
Cc: Larry McVoy, mason, torvalds, wli, hugh, adi, support,
linux-kernel
On Monday 17 May 2004 07:42 pm, Andrew Morton wrote:
> Larry McVoy <lm@bitmover.com> wrote:
> >
> > > I'd really like to see this happen on some other machine though. It'd be
> > > funny if you have a dud disk drive or something.
> >
> > We can easily rule that out. Steven, do a
> >
> > dd if=/dev/zero of=USE_SOME_SPACE bs=1048576 count=500
I'll get to Larry's test a little later. Pizza's about to come out of the oven.
> >
> > which will eat up 500 MB and should eat up any bad blocks. I _really_
> > doubt it is a bad disk.
>
> Yes, me too. The sensitivity to CONFIG_PREEMPT makes that unlikely.
>
> Two things I'm not clear on:
>
> a) Has is been established that CONFIG_PREEMPT causes ppp to fail?
At the risk of a "post hoc ergo propter hoc" fallacy, I'd say yes. I have not
seen ppp fail without PREEMPT. No PREEMPT, no ppp problems.
>
> b) Has the file corruption been observed when PPP was not in use at all?
No, at least I don't have any specific memory of corruption without ppp.
I made a fresh 3.9G reiserfs fs on my second disk, and cloned a kernel
tree to it. I can use it for testing, but its currently in sync with linux.bkbits.net/linux-2.5.
Later,
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
[not found] ` <200405172319.38853.elenstev@mesatop.com>
@ 2004-05-18 12:42 ` Chris Mason
2004-05-18 14:29 ` Steven Cole
2004-05-18 14:38 ` Linus Torvalds
1 sibling, 1 reply; 68+ messages in thread
From: Chris Mason @ 2004-05-18 12:42 UTC (permalink / raw)
To: Steven Cole
Cc: Linus Torvalds, Andrew Morton, Larry McVoy, wli, hugh, adi,
support, linux-kernel
On Tue, 2004-05-18 at 01:19, Steven Cole wrote:
> 2nd reply:
> I've made four successful rather large bk pulls with Chris' patch.
> Two were into two repos on my /home reiserfs, and I did
> a pull, unpull, and pull again on the new reiserfs on the second disk.
> No problems, and with PREEMPT of course.
> The last two pulls even survived a ppp failure occuring during resolve.
Good news, thank you.
> So, I take it that I should revert that one-liner if I want to get any failure data?
> With it, ext3 was pretty solid for this testing.
Yes, please test ext3 again without Andrew's one liner.
-chris
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 12:42 ` Chris Mason
@ 2004-05-18 14:29 ` Steven Cole
0 siblings, 0 replies; 68+ messages in thread
From: Steven Cole @ 2004-05-18 14:29 UTC (permalink / raw)
To: Chris Mason
Cc: Steven Cole, Andrew Morton, hugh, Larry McVoy, linux-kernel,
support, Linus Torvalds, adi, wli
On May 18, 2004, at 6:42 AM, Chris Mason wrote:
> On Tue, 2004-05-18 at 01:19, Steven Cole wrote:
>
>> 2nd reply:
>> I've made four successful rather large bk pulls with Chris' patch.
>> Two were into two repos on my /home reiserfs, and I did
>> a pull, unpull, and pull again on the new reiserfs on the second disk.
>> No problems, and with PREEMPT of course.
>> The last two pulls even survived a ppp failure occuring during
>> resolve.
>
> Good news, thank you.
>
>> So, I take it that I should revert that one-liner if I want to get
>> any failure data?
>> With it, ext3 was pretty solid for this testing.
>
> Yes, please test ext3 again without Andrew's one liner.
>
> -chris
>
With Andrew's one-liner backed out (and your patch for reiserfs
left in), I made one successful test on my /usr ext3 fs (which
was created with default block size) and one attempt on my
new 1k block ext3 fs on my second disk. Unfortunately, that
new fs is too small for this testing, and I got a not enough
space error while doing the first test. I plan on reformatting
the 3.9G reiserfs on my second disk to ext3 with 1k blocks
to try again tonight.
I attempted to run a script to do successive bk pulls/unpulls on
the larger ext3 fs overnight, but ppp kept failing, so I gave up
after turning into a tired pumpkin.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
[not found] ` <200405172319.38853.elenstev@mesatop.com>
2004-05-18 12:42 ` Chris Mason
@ 2004-05-18 14:38 ` Linus Torvalds
2004-05-19 10:53 ` Steven Cole
1 sibling, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2004-05-18 14:38 UTC (permalink / raw)
To: Steven Cole
Cc: Andrew Morton, Larry McVoy, mason, wli, hugh, adi, support,
linux-kernel
On Mon, 17 May 2004, Steven Cole wrote:
>
> No problems, and with PREEMPT of course.
Ok. Good. It's a small data-set, but the bug made sense, and so did the
fix.
> > If you see a failure on ext3, please try to analyze the corruption pattern
> > again. It might be something different.
>
> So, I take it that I should revert that one-liner if I want to get any failure data?
> With it, ext3 was pretty solid for this testing.
Yes. That one-liner is bogus. It was a good way to test a hypothesis for
the common case of a filesystem that uses the block_write_full_page thing
(and reiser is one of the few that doesn't), but it wasn't the real fix.
The reiser patch was the real fix for the problem on reiser, but ext3
should have been ok already. It uses (through a lot of other functions)
generic_file_aio_write_nolock() as the real write engine, and that one
calls "commit_write()" with the page lock held.
Linus
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-18 14:38 ` Linus Torvalds
@ 2004-05-19 10:53 ` Steven Cole
2004-05-19 12:10 ` Chris Mason
0 siblings, 1 reply; 68+ messages in thread
From: Steven Cole @ 2004-05-19 10:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Larry McVoy, mason, wli, hugh, adi, support,
linux-kernel
On Tuesday 18 May 2004 08:38 am, Linus Torvalds wrote:
>
> On Mon, 17 May 2004, Steven Cole wrote:
> >
> > No problems, and with PREEMPT of course.
>
> Ok. Good. It's a small data-set, but the bug made sense, and so did the
> fix.
Perhaps a final note on this: I did more testing on reiserfs overnight with
Chris' patch, and it survived eleven pulls and unpulls with no failures.
>
> > > If you see a failure on ext3, please try to analyze the corruption pattern
> > > again. It might be something different.
> >
> > So, I take it that I should revert that one-liner if I want to get any failure data?
> > With it, ext3 was pretty solid for this testing.
>
> Yes. That one-liner is bogus. It was a good way to test a hypothesis for
> the common case of a filesystem that uses the block_write_full_page thing
> (and reiser is one of the few that doesn't), but it wasn't the real fix.
> The reiser patch was the real fix for the problem on reiser, but ext3
> should have been ok already. It uses (through a lot of other functions)
> generic_file_aio_write_nolock() as the real write engine, and that one
> calls "commit_write()" with the page lock held.
>
> Linus
I also tested ext3 more extensively (10 pulls/unpulls) and could not repeat
the alleged failure on ext3. That was with akpm's one-liner backed out.
Steven
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.)
2004-05-19 10:53 ` Steven Cole
@ 2004-05-19 12:10 ` Chris Mason
0 siblings, 0 replies; 68+ messages in thread
From: Chris Mason @ 2004-05-19 12:10 UTC (permalink / raw)
To: Steven Cole
Cc: Linus Torvalds, Andrew Morton, Larry McVoy, wli, hugh, adi,
support, linux-kernel
On Wed, 2004-05-19 at 06:53, Steven Cole wrote:
> On Tuesday 18 May 2004 08:38 am, Linus Torvalds wrote:
> >
> > On Mon, 17 May 2004, Steven Cole wrote:
> > >
> > > No problems, and with PREEMPT of course.
> >
> > Ok. Good. It's a small data-set, but the bug made sense, and so did the
> > fix.
>
> Perhaps a final note on this: I did more testing on reiserfs overnight with
> Chris' patch, and it survived eleven pulls and unpulls with no failures.
Good to hear. We probably still need Andrew's truncate fix, this just
isn't the right workload to show it. Andrew, that reiserfs fix survived
testing here, could you please include it?
-chris
^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2004-05-19 12:09 UTC | newest]
Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-17 15:35 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Albert Cahalan
[not found] <6616858C-A5AF-11D8-A7EA-000A95CC3A8A@lanl.gov>
[not found] ` <200405122234.06902.elenstev@mesatop.com>
[not found] ` <15594C37-A509-11D8-A7EA-000A95CC3A8A@lanl.gov>
[not found] ` <20040513183316.GE17965@bitmover.com>
2004-05-14 4:32 ` 1352 NUL bytes at the end of a page? Steven Cole
[not found] ` <20040514144617.GE20197@work.bitmover.com>
[not found] ` <200405131723.15752.elenstev@mesatop.com>
2004-05-14 16:53 ` 1352 NUL bytes at the end of a page? (was Re: Assertion `s && s->tree' failed: The saga continues.) Andy Isaacson
2004-05-14 17:23 ` Steven Cole
2004-05-15 0:54 ` Steven Cole
2004-05-15 3:15 ` Lincoln Dale
2004-05-15 3:41 ` Andrew Morton
2004-05-15 5:39 ` Steven Cole
2004-05-16 1:23 ` Steven Cole
2004-05-16 2:18 ` Linus Torvalds
2004-05-16 3:44 ` Linus Torvalds
2004-05-16 4:31 ` Steven Cole
2004-05-16 4:52 ` Linus Torvalds
2004-05-16 5:22 ` Andrea Arcangeli
2004-05-16 15:28 ` Steven Cole
2004-05-16 17:49 ` Rutger Nijlunsing
2004-05-16 20:38 ` Andrea Arcangeli
2004-05-16 21:19 ` Steven Cole
2004-05-16 21:29 ` Andrew Morton
2004-05-16 22:11 ` Steven Cole
2004-05-16 23:53 ` Andrea Arcangeli
2004-05-17 2:12 ` Steven Cole
2004-05-17 8:21 ` R. J. Wysocki
2004-05-16 5:54 ` Steven Cole
2004-05-16 6:09 ` Andrew Morton
2004-05-16 6:24 ` Andrew Morton
2004-05-16 10:01 ` Andrew Morton
2004-05-16 13:49 ` Steven Cole
2004-05-18 1:47 ` Benjamin Herrenschmidt
2004-05-16 3:20 ` Andrew Morton
2004-05-16 3:58 ` Linus Torvalds
2004-05-17 2:28 ` Larry McVoy
2004-05-17 2:42 ` Linus Torvalds
2004-05-17 3:36 ` Steven Cole
2004-05-17 5:17 ` Linus Torvalds
2004-05-17 6:11 ` Andrew Morton
2004-05-17 14:07 ` Larry McVoy
2004-05-17 14:12 ` Linus Torvalds
2004-05-17 7:25 ` Andrew Morton
2004-05-17 7:46 ` Andrew Morton
2004-05-17 8:39 ` Vladimir Saveliev
2004-05-17 8:44 ` Andrew Morton
2004-05-17 11:58 ` Steven Cole
2004-05-17 14:05 ` Larry McVoy
2004-05-17 14:14 ` Larry McVoy
2004-05-17 14:32 ` Linus Torvalds
2004-05-17 14:52 ` Larry McVoy
2004-05-17 15:02 ` Linus Torvalds
2004-05-17 15:05 ` Larry McVoy
2004-05-17 15:23 ` Chris Mason
2004-05-17 15:49 ` Steven Cole
2004-05-17 20:24 ` Chris Mason
2004-05-17 21:08 ` Steven Cole
2004-05-17 21:29 ` Andrew Morton
2004-05-17 22:15 ` Steven Cole
2004-05-17 23:52 ` Steven Cole
2004-05-18 0:03 ` Chris Mason
2004-05-18 0:15 ` Andrew Morton
2004-05-18 0:13 ` Andrew Morton
2004-05-18 0:45 ` Steven Cole
2004-05-18 1:34 ` Larry McVoy
2004-05-18 1:42 ` Andrew Morton
2004-05-18 1:56 ` Steven Cole
2004-05-17 14:11 ` Larry McVoy
[not found] ` <200405172142.52780.elenstev@mesatop.com>
[not found] ` <Pine.LNX.4.58.0405172056480.25502@ppc970.osdl.org>
[not found] ` <200405172319.38853.elenstev@mesatop.com>
2004-05-18 12:42 ` Chris Mason
2004-05-18 14:29 ` Steven Cole
2004-05-18 14:38 ` Linus Torvalds
2004-05-19 10:53 ` Steven Cole
2004-05-19 12:10 ` Chris Mason
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox