From mboxrd@z Thu Jan  1 00:00:00 1970
From: Harry Kalogirou <harkal@gmx.net>
Subject: Re: webserver stalls [was Re: bug in (linux) slattach]
Date: 22 Oct 2002 16:56:42 +0300
Sender: linux-8086-owner@vger.kernel.org
Message-ID: <1035285405.1634.123.camel@cool>
References: <Pine.LNX.4.33.0210220137370.6509-100000@olympus.btstream.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="=-nGVqBN98kKfTPU3t5rPu"
Return-path: <linux-8086-owner@vger.kernel.org>
In-Reply-To: <Pine.LNX.4.33.0210220137370.6509-100000@olympus.btstream.com>
List-Id: <linux-8086.vger.kernel.org>
To: jb1@btstream.com
Cc: Linux-8086 <linux-8086@vger.kernel.org>


--=-nGVqBN98kKfTPU3t5rPu
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

> On 21 Oct 2002, Harry Kalogirou wrote:
>=20
> > Mmm.. weird.. I probably got you tired with all this but can you try an=
d
> > see if the failures are realy random? A good aid at this the -p
> > parameter of ping.
>=20
> 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to=20
> five errors, too few to account for the 100 percent failure rate of=20
> certain webpage files. 55 had the most errors and was the only one with a=
n=20
> error in the pattern data. Most of the other errors were something about=20
> the time-of-day going back; 00 had one extremely long response time=20
> (1074131 mS).
>=20
>=20
> I think I can now prove that there's at least one IP Header sum-with-carr=
y
> that results in a reproducible checksum error. I discovered that if the
> ELKS IP address were 192.168.1.135, all my test files could be read; larg=
e
> files required a few tries, but I was even able to read one 4369 (0x1111)
> bytes long! The unique property of the packets that never got ACK'ed is=20
> that their checksum-field contains 0xF6FF instead of the correct value=20
> 0xF5FF (the complement of 0x0A00).
>=20
> Each of the webpage files that stall produces a defective packet with thi=
s=20
> IP Header (the first twenty bytes of the packet):
> 	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
> The corresponding packet in the 99-byte file is one byte shorter (003e=20
> instead of 003f), consequently having a different IP Header Checksum=20
> (f600 instead of the erroneous f6ff):
> 	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205
>=20
> Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP=20
> address from 192.168.1.100 to 192.168.1.105 I was able to produce the=20
> identical erroneous IP Header Checksum with the command:
> 	ping -s 35 192.168.1.105
> resulting in the IP header:
> 	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205
>=20
> To demonstrate that the problem is not the total packet size I added 1 to=
=20
> the packetsize and subtracted 1 from the ELKS IP address:
> 	ping -s 36 192.168.1.104
> resulting in the IP Header:
> 	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205
>=20
> Just for symmetry, I produced the same checksum as that for the 99-byte=20
> webpage file, but the same length as the 100- and 266 byte webpage files=20
> with:
> 	ping -s 35 192.168.1.104
> resulting in the IP Header:
> 	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205
>=20
> In all cases, the pings with the defective checksum had 100% loss, while=20
> those with the good checksum succeeded. I didn't try manipulating the=20
> source IP address (c0a8 0205 =3D 192.168.2.5). If you can manipulate the=20
> packetsize and ELKS IP address so that the sum-with-carry of this header=20
> sans checksum-field is 0x09C1 you should be able to reproduce my results;=
=20
> otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a=20
> different version of some critical ELKS file).
>

I'll check and get back to you...
=20
> Note: I think bad packets comsume memory. After several unsuccessful=20
> transfers I started seeing "Cannot fork" on the ELKS box when I issued=20
> commands ... eventually I'd have to reboot it. It might be a good idea to=
=20
> purge them after a minute or two.

These are the web servers that wait for the data to be transmited, when
they exit memory will be freed.

> Does anything other than the system time depend upon the CMOS clock? It=20
> obviously hasn't been read on any of the four machines on which I tried=20
> ELKS (yes, they all *have* standard, working CMOS clocks).
>=20

I don't think so.

Harry


--=-nGVqBN98kKfTPU3t5rPu
Content-Type: application/pgp-signature; name=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD4DBQA9tTObXrjIZPA34x8RArjLAJi8+c1sYe/6N8F+18lUmLgHiiq6AJsEGQgD
mx0ARTsn8BuvqLfiyrHs5A==
=3EYJ
-----END PGP SIGNATURE-----

--=-nGVqBN98kKfTPU3t5rPu--