From mboxrd@z Thu Jan 1 00:00:00 1970 From: Harry Kalogirou Subject: Re: webserver stalls [was Re: bug in (linux) slattach] Date: 22 Oct 2002 16:56:42 +0300 Sender: linux-8086-owner@vger.kernel.org Message-ID: <1035285405.1634.123.camel@cool> References: Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-nGVqBN98kKfTPU3t5rPu" Return-path: In-Reply-To: List-Id: To: jb1@btstream.com Cc: Linux-8086 --=-nGVqBN98kKfTPU3t5rPu Content-Type: text/plain Content-Transfer-Encoding: quoted-printable > On 21 Oct 2002, Harry Kalogirou wrote: >=20 > > Mmm.. weird.. I probably got you tired with all this but can you try an= d > > see if the failures are realy random? A good aid at this the -p > > parameter of ping. >=20 > 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to=20 > five errors, too few to account for the 100 percent failure rate of=20 > certain webpage files. 55 had the most errors and was the only one with a= n=20 > error in the pattern data. Most of the other errors were something about=20 > the time-of-day going back; 00 had one extremely long response time=20 > (1074131 mS). >=20 >=20 > I think I can now prove that there's at least one IP Header sum-with-carr= y > that results in a reproducible checksum error. I discovered that if the > ELKS IP address were 192.168.1.135, all my test files could be read; larg= e > files required a few tries, but I was even able to read one 4369 (0x1111) > bytes long! The unique property of the packets that never got ACK'ed is=20 > that their checksum-field contains 0xF6FF instead of the correct value=20 > 0xF5FF (the complement of 0x0A00). >=20 > Each of the webpage files that stall produces a defective packet with thi= s=20 > IP Header (the first twenty bytes of the packet): > 4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205 > The corresponding packet in the 99-byte file is one byte shorter (003e=20 > instead of 003f), consequently having a different IP Header Checksum=20 > (f600 instead of the erroneous f6ff): > 4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205 >=20 > Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP=20 > address from 192.168.1.100 to 192.168.1.105 I was able to produce the=20 > identical erroneous IP Header Checksum with the command: > ping -s 35 192.168.1.105 > resulting in the IP header: > 4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205 >=20 > To demonstrate that the problem is not the total packet size I added 1 to= =20 > the packetsize and subtracted 1 from the ELKS IP address: > ping -s 36 192.168.1.104 > resulting in the IP Header: > 4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205 >=20 > Just for symmetry, I produced the same checksum as that for the 99-byte=20 > webpage file, but the same length as the 100- and 266 byte webpage files=20 > with: > ping -s 35 192.168.1.104 > resulting in the IP Header: > 4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205 >=20 > In all cases, the pings with the defective checksum had 100% loss, while=20 > those with the good checksum succeeded. I didn't try manipulating the=20 > source IP address (c0a8 0205 =3D 192.168.2.5). If you can manipulate the=20 > packetsize and ELKS IP address so that the sum-with-carry of this header=20 > sans checksum-field is 0x09C1 you should be able to reproduce my results;= =20 > otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a=20 > different version of some critical ELKS file). > I'll check and get back to you... =20 > Note: I think bad packets comsume memory. After several unsuccessful=20 > transfers I started seeing "Cannot fork" on the ELKS box when I issued=20 > commands ... eventually I'd have to reboot it. It might be a good idea to= =20 > purge them after a minute or two. These are the web servers that wait for the data to be transmited, when they exit memory will be freed. > Does anything other than the system time depend upon the CMOS clock? It=20 > obviously hasn't been read on any of the four machines on which I tried=20 > ELKS (yes, they all *have* standard, working CMOS clocks). >=20 I don't think so. Harry --=-nGVqBN98kKfTPU3t5rPu Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD4DBQA9tTObXrjIZPA34x8RArjLAJi8+c1sYe/6N8F+18lUmLgHiiq6AJsEGQgD mx0ARTsn8BuvqLfiyrHs5A== =3EYJ -----END PGP SIGNATURE----- --=-nGVqBN98kKfTPU3t5rPu--