Re: webserver stalls [was Re: bug in (linux) slattach]

Linux-8086 Development Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: webserver stalls [was Re: bug in (linux) slattach]
       [not found] <1035036158.454.17.camel@cool>
@ 2002-10-20  9:34 ` jb1
  2002-10-20 17:06   ` Harry Kalogirou
  0 siblings, 1 reply; 14+ messages in thread
From: jb1 @ 2002-10-20  9:34 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 19 Oct 2002, Harry Kalogirou wrote:

> 
> Since I had used ELKS for long time on my network and I hardly had
> checksum errors, had other biger problems 8), I think that this has

From what I've seen in the mailinglist archives, I think other people have
had similar problems. They probably just gave up when no one answered
their vague, sometimes irrelevant, questions.

> something to do with the serial line altering bytes that ELKS transmits.
> Can you send me the output of "stty -a -F /dev/ttySX" after you setup
> the connection. Maybe the line is not corectly setup on linux side (XOFF
> XON and stuff). 

After issuing "/bin/stty -F /dev/ttyS0 4800" on the Red Hat 7.0 Linux box,
"stty -a -F /dev/ttyS0" displays:
speed 4800 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W;
lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon 
-ixoff
-iuclc -ixany -imaxbel
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 
vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop 
-echoprt
echoctl echoke

> If the above proves the setup of the serial line to be ok, then ELKS has
> a problem. Maybe then the problem is at the assembly optimized checksum
> routines I wrote. Disabling that by undefing USE_ASM in ip.c will show
> that.

Both the C and ASM routines in ip_calc_chksum() in elksnet/ktcp/ip.c from
elksnet-0.1.1.tar.gz look like they should have worked correctly for the
packet with the bad IP Header checksum. The C routine has a lurking bug;
it doesn't account for a possible carry in
	return ~((sum & 0xffff) + ((sum >> 16) & 0xffff));
but even if USE_ASM were undefined it wouldn't have affected that packet.

Even if my serial port handshaking is incorrecty set up, the difference
between the packet that fails and the one that succeeds is trivial. The
former is 63 bytes long with the data "Content-Length: 100^M^J^M^J"; the
latter is 62 bytes long with the data "Content-Length: 99^M^J^M^J and
(after the ACK by the linux box) is followed by a *successful* 139-byte
packet containing the entire 99-byte webpage file. Also, the erroneous
checksum is exactly the same and in exactly the same packet even for the
266-byte original file tcpdump'ed several days earlier ("Content-Length:  
266^M^J^M^J"). The Linux box's /proc/cpuinfo says its AMD-K6 is running at 
360.800 MHz; I wouldn't be surprised if it could run 4800 baud with no 
handshaking at all, and I didn't notice any XON or XOFF characters mixed 
in with the data.

I wonder if there's any significance in the fact that the problem occurs
precisely at the boundary between data obviously generated by the server,
itself, and the contents of the webpage file.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-20  9:34 ` webserver stalls [was Re: bug in (linux) slattach] jb1
@ 2002-10-20 17:06   ` Harry Kalogirou
  2002-10-21  9:44     ` jb1
  0 siblings, 1 reply; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-20 17:06 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 19 Oct 2002, Harry Kalogirou wrote:
> 
> > 
> > Since I had used ELKS for long time on my network and I hardly had
> > checksum errors, had other biger problems 8), I think that this has
> 
> >From what I've seen in the mailinglist archives, I think other people have
> had similar problems. They probably just gave up when no one answered
> their vague, sometimes irrelevant, questions.
> 
> > something to do with the serial line altering bytes that ELKS transmits.
> > Can you send me the output of "stty -a -F /dev/ttySX" after you setup
> > the connection. Maybe the line is not corectly setup on linux side (XOFF
> > XON and stuff). 
> 
> After issuing "/bin/stty -F /dev/ttyS0 4800" on the Red Hat 7.0 Linux box,
> "stty -a -F /dev/ttyS0" displays:
> speed 4800 baud; rows 0; columns 0; line = 0;
> intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
> eol2 = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W;
> lnext = ^V; flush = ^O; min = 1; time = 0;
> -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon 
> -ixoff
> -iuclc -ixany -imaxbel
> opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 
> vt0 ff0
> isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop 
> -echoprt
> echoctl echoke
>

As I suspected, misconfigued line. Configure your like this :

intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof =
<undef>;
eol = <undef>; eol2 = <undef>; start = <undef>; stop = <undef>; susp =
<undef>;
rprnt = <undef>; werase = <undef>; lnext = <undef>; flush = <undef>;
min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon
-ixoff
-iuclc -ixany -imaxbel
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0
bs0 vt0
ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop
-echoprt -echoctl -echoke
 

Basicaly the above configuration is done by the -L parameter of
slattach, except the -crtscts. What I do just to be sure is :

# slattach -p [c]slip -L -e /dev/ttyS0
# stty -F /dev/ttyS0 -crtscts
# slattach -p [c]slip -s 4800 -m /dev/ttyS0 &


Harry




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-20 17:06   ` Harry Kalogirou
@ 2002-10-21  9:44     ` jb1
  2002-10-21  9:55       ` Harry Kalogirou
  0 siblings, 1 reply; 14+ messages in thread
From: jb1 @ 2002-10-21  9:44 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 20 Oct 2002, Harry Kalogirou wrote:

> > On 19 Oct 2002, Harry Kalogirou wrote:
...
> As I suspected, misconfigued line. Configure your like this :
> 
> intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof =
> <undef>;
> eol = <undef>; eol2 = <undef>; start = <undef>; stop = <undef>; susp =
> <undef>;
> rprnt = <undef>; werase = <undef>; lnext = <undef>; flush = <undef>;
> min = 1; time = 0;
> -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon
> -ixoff
> -iuclc -ixany -imaxbel
> -opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0
> bs0 vt0
> ff0
> -isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop
> -echoprt -echoctl -echoke
>  
> 
> Basicaly the above configuration is done by the -L parameter of
> slattach, except the -crtscts. What I do just to be sure is :
> 
> # slattach -p [c]slip -L -e /dev/ttyS0
> # stty -F /dev/ttyS0 -crtscts
> # slattach -p [c]slip -s 4800 -m /dev/ttyS0 &

I did as you suggested (with one exception), confirmed that the settings 
were exactly like yours, and found *no* difference; the 99-byte webpage 
file works, the 100-byte byte webpage files don't. The exception was:
	stty 4800 -F /dev/ttyS0 -crtscts
because my slattach program doesn't seem to change the baud rate. Also, 
I still getting frequent seemingly-random errors when I ping the ELKS box.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-21  9:44     ` jb1
@ 2002-10-21  9:55       ` Harry Kalogirou
  2002-10-22 10:16         ` jb1
  0 siblings, 1 reply; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-21  9:55 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

[-- Attachment #1: Type: text/plain, Size: 1788 bytes --]

> On 20 Oct 2002, Harry Kalogirou wrote:
> 
> > > On 19 Oct 2002, Harry Kalogirou wrote:
> ...
> > As I suspected, misconfigued line. Configure your like this :
> > 
> > intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof =
> > <undef>;
> > eol = <undef>; eol2 = <undef>; start = <undef>; stop = <undef>; susp =
> > <undef>;
> > rprnt = <undef>; werase = <undef>; lnext = <undef>; flush = <undef>;
> > min = 1; time = 0;
> > -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> > ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon
> > -ixoff
> > -iuclc -ixany -imaxbel
> > -opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0
> > bs0 vt0
> > ff0
> > -isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop
> > -echoprt -echoctl -echoke
> >  
> > 
> > Basicaly the above configuration is done by the -L parameter of
> > slattach, except the -crtscts. What I do just to be sure is :
> > 
> > # slattach -p [c]slip -L -e /dev/ttyS0
> > # stty -F /dev/ttyS0 -crtscts
> > # slattach -p [c]slip -s 4800 -m /dev/ttyS0 &
> 
> I did as you suggested (with one exception), confirmed that the settings 
> were exactly like yours, and found *no* difference; the 99-byte webpage 
> file works, the 100-byte byte webpage files don't. The exception was:
> 	stty 4800 -F /dev/ttyS0 -crtscts
> because my slattach program doesn't seem to change the baud rate. Also, 
> I still getting frequent seemingly-random errors when I ping the ELKS box.

Mmm.. weird.. I probably got you tired with all this but can you try and
see if the failures are realy random? A good aid at this the -p
parameter of ping.

I'm just convinsed that there is a problem after the packets leave ELKS.
Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-21  9:55       ` Harry Kalogirou
@ 2002-10-22 10:16         ` jb1
  2002-10-22 13:56           ` Harry Kalogirou
  2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
  0 siblings, 2 replies; 14+ messages in thread
From: jb1 @ 2002-10-22 10:16 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 21 Oct 2002, Harry Kalogirou wrote:

> Mmm.. weird.. I probably got you tired with all this but can you try and
> see if the failures are realy random? A good aid at this the -p
> parameter of ping.

100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
five errors, too few to account for the 100 percent failure rate of 
certain webpage files. 55 had the most errors and was the only one with an 
error in the pattern data. Most of the other errors were something about 
the time-of-day going back; 00 had one extremely long response time 
(1074131 mS).

I think I can now prove that there's at least one IP Header sum-with-carry
that results in a reproducible checksum error. I discovered that if the
ELKS IP address were 192.168.1.135, all my test files could be read; large
files required a few tries, but I was even able to read one 4369 (0x1111)
bytes long! The unique property of the packets that never got ACK'ed is 
that their checksum-field contains 0xF6FF instead of the correct value 
0xF5FF (the complement of 0x0A00).

Each of the webpage files that stall produces a defective packet with this 
IP Header (the first twenty bytes of the packet):
	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
The corresponding packet in the 99-byte file is one byte shorter (003e 
instead of 003f), consequently having a different IP Header Checksum 
(f600 instead of the erroneous f6ff):
	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205

Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
identical erroneous IP Header Checksum with the command:
	ping -s 35 192.168.1.105
resulting in the IP header:
	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205

To demonstrate that the problem is not the total packet size I added 1 to 
the packetsize and subtracted 1 from the ELKS IP address:
	ping -s 36 192.168.1.104
resulting in the IP Header:
	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205

Just for symmetry, I produced the same checksum as that for the 99-byte 
webpage file, but the same length as the 100- and 266 byte webpage files 
with:
	ping -s 35 192.168.1.104
resulting in the IP Header:
	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205

In all cases, the pings with the defective checksum had 100% loss, while 
those with the good checksum succeeded. I didn't try manipulating the 
source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
packetsize and ELKS IP address so that the sum-with-carry of this header 
sans checksum-field is 0x09C1 you should be able to reproduce my results; 
otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
different version of some critical ELKS file).

SOURCE PACKAGES:
        elks-0.1.1.tar.gz, elkscmd_20020501.tar.gz, elksnet-0.1.1.tar.gz,
        Dev86src-0.16.0.tar.gz
CVS PATCHES:
        (none)
COMPILED UNDER:
        Red Hat 7.0 Linux, kernel 2.2.16-22

Note: I think bad packets comsume memory. After several unsuccessful 
transfers I started seeing "Cannot fork" on the ELKS box when I issued 
commands ... eventually I'd have to reboot it. It might be a good idea to 
purge them after a minute or two.

Does anything other than the system time depend upon the CMOS clock? It 
obviously hasn't been read on any of the four machines on which I tried 
ELKS (yes, they all *have* standard, working CMOS clocks).

By the way, I received two copies of this message in addition to the copy 
sent from the mailing list.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 10:16         ` jb1
@ 2002-10-22 13:56           ` Harry Kalogirou
  2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
  1 sibling, 0 replies; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-22 13:56 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

[-- Attachment #1: Type: text/plain, Size: 3597 bytes --]

> On 21 Oct 2002, Harry Kalogirou wrote:
> 
> > Mmm.. weird.. I probably got you tired with all this but can you try and
> > see if the failures are realy random? A good aid at this the -p
> > parameter of ping.
> 
> 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
> five errors, too few to account for the 100 percent failure rate of 
> certain webpage files. 55 had the most errors and was the only one with an 
> error in the pattern data. Most of the other errors were something about 
> the time-of-day going back; 00 had one extremely long response time 
> (1074131 mS).
> 
> 
> I think I can now prove that there's at least one IP Header sum-with-carry
> that results in a reproducible checksum error. I discovered that if the
> ELKS IP address were 192.168.1.135, all my test files could be read; large
> files required a few tries, but I was even able to read one 4369 (0x1111)
> bytes long! The unique property of the packets that never got ACK'ed is 
> that their checksum-field contains 0xF6FF instead of the correct value 
> 0xF5FF (the complement of 0x0A00).
> 
> Each of the webpage files that stall produces a defective packet with this 
> IP Header (the first twenty bytes of the packet):
> 	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
> The corresponding packet in the 99-byte file is one byte shorter (003e 
> instead of 003f), consequently having a different IP Header Checksum 
> (f600 instead of the erroneous f6ff):
> 	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205
> 
> Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
> address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
> identical erroneous IP Header Checksum with the command:
> 	ping -s 35 192.168.1.105
> resulting in the IP header:
> 	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205
> 
> To demonstrate that the problem is not the total packet size I added 1 to 
> the packetsize and subtracted 1 from the ELKS IP address:
> 	ping -s 36 192.168.1.104
> resulting in the IP Header:
> 	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205
> 
> Just for symmetry, I produced the same checksum as that for the 99-byte 
> webpage file, but the same length as the 100- and 266 byte webpage files 
> with:
> 	ping -s 35 192.168.1.104
> resulting in the IP Header:
> 	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205
> 
> In all cases, the pings with the defective checksum had 100% loss, while 
> those with the good checksum succeeded. I didn't try manipulating the 
> source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
> packetsize and ELKS IP address so that the sum-with-carry of this header 
> sans checksum-field is 0x09C1 you should be able to reproduce my results; 
> otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
> different version of some critical ELKS file).
>

I'll check and get back to you...
 
> Note: I think bad packets comsume memory. After several unsuccessful 
> transfers I started seeing "Cannot fork" on the ELKS box when I issued 
> commands ... eventually I'd have to reboot it. It might be a good idea to 
> purge them after a minute or two.

These are the web servers that wait for the data to be transmited, when
they exit memory will be freed.

> Does anything other than the system time depend upon the CMOS clock? It 
> obviously hasn't been read on any of the four machines on which I tried 
> ELKS (yes, they all *have* standard, working CMOS clocks).
> 

I don't think so.

Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 10:16         ` jb1
  2002-10-22 13:56           ` Harry Kalogirou
@ 2002-10-22 13:57           ` Harry Kalogirou
  2002-10-22 16:02             ` Harry Kalogirou
  1 sibling, 1 reply; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-22 13:57 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 21 Oct 2002, Harry Kalogirou wrote:
> 
> > Mmm.. weird.. I probably got you tired with all this but can you try and
> > see if the failures are realy random? A good aid at this the -p
> > parameter of ping.
> 
> 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
> five errors, too few to account for the 100 percent failure rate of 
> certain webpage files. 55 had the most errors and was the only one with an 
> error in the pattern data. Most of the other errors were something about 
> the time-of-day going back; 00 had one extremely long response time 
> (1074131 mS).
> 
> 
> I think I can now prove that there's at least one IP Header sum-with-carry
> that results in a reproducible checksum error. I discovered that if the
> ELKS IP address were 192.168.1.135, all my test files could be read; large
> files required a few tries, but I was even able to read one 4369 (0x1111)
> bytes long! The unique property of the packets that never got ACK'ed is 
> that their checksum-field contains 0xF6FF instead of the correct value 
> 0xF5FF (the complement of 0x0A00).
> 
> Each of the webpage files that stall produces a defective packet with this 
> IP Header (the first twenty bytes of the packet):
> 	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
> The corresponding packet in the 99-byte file is one byte shorter (003e 
> instead of 003f), consequently having a different IP Header Checksum 
> (f600 instead of the erroneous f6ff):
> 	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205
> 
> Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
> address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
> identical erroneous IP Header Checksum with the command:
> 	ping -s 35 192.168.1.105
> resulting in the IP header:
> 	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205
> 
> To demonstrate that the problem is not the total packet size I added 1 to 
> the packetsize and subtracted 1 from the ELKS IP address:
> 	ping -s 36 192.168.1.104
> resulting in the IP Header:
> 	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205
> 
> Just for symmetry, I produced the same checksum as that for the 99-byte 
> webpage file, but the same length as the 100- and 266 byte webpage files 
> with:
> 	ping -s 35 192.168.1.104
> resulting in the IP Header:
> 	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205
> 
> In all cases, the pings with the defective checksum had 100% loss, while 
> those with the good checksum succeeded. I didn't try manipulating the 
> source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
> packetsize and ELKS IP address so that the sum-with-carry of this header 
> sans checksum-field is 0x09C1 you should be able to reproduce my results; 
> otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
> different version of some critical ELKS file).
> 

Ok the quest is over.

After all it was a problem of the checksum functions writen in assembly!
Did you try with USE_ASM undefined? Anyway it works now and I commited
it to the CVS.

Thank you very much for all your efford! Nice work.

Harry





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
@ 2002-10-22 16:02             ` Harry Kalogirou
  2002-10-23  9:37               ` jb1
  2002-10-29 10:25               ` jb1
  0 siblings, 2 replies; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-22 16:02 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: jb1, Linux-8086

[-- Attachment #1: Type: text/plain, Size: 386 bytes --]


> Ok the quest is over.
> 
> After all it was a problem of the checksum functions writen in assembly!
> Did you try with USE_ASM undefined? Anyway it works now and I commited
> it to the CVS.
> 
> Thank you very much for all your efford! Nice work.
> 
> Harry


Actualy the quest is over now... as previously I managed to commit half
the patch to the CVS...

Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 16:02             ` Harry Kalogirou
@ 2002-10-23  9:37               ` jb1
  2002-10-23 11:42                 ` Harry Kalogirou
  2002-10-29 10:25               ` jb1
  1 sibling, 1 reply; 14+ messages in thread
From: jb1 @ 2002-10-23  9:37 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Harry Kalogirou, Linux-8086

On 22 Oct 2002, Harry Kalogirou wrote:

> Actualy the quest is over now... as previously I managed to commit half
> the patch to the CVS...

Maybe not. I found ip.c Version 1.9 by browsing the CVS repository and, as
far as I can tell, the only change was that you moved the first "dec cx";  
this will have *no* effect. The algorithm can still fail if the carry flag
happens to be set going into the routine, or if there is a carry generated
the last time "adc [di]" is executed. I suggest something like this for
_ip_calc_chksum:

        push    bp
        mov     bp,sp
        push    di

        mov     cx, 6[bp]
        sar     cx, 1
        dec     cx
        xor     ax,ax           ; clear carry flag (as well as AX)
        mov     di, 4[bp]
        mov     ax, [di]
        inc     di
        inc     di
loop1:
        adc     ax, [di]
        inc     di
        inc     di

        loop    loop1;          ; a byte shorter and a clock faster
                                ;  than DEC CX/JNZ LOOP1

        adc     ax,0            ; add (just) the final carry
        not     ax

        pop di
        pop bp

        ret

Of course, this algorithm is valid only if the length (6[bp]) is an even
number of bytes. While this is always true for IP headers, for TCP packet 
checksums there would have to be a final test of the length's low bit and 
appropriate handling of an additional odd byte.

I ran the original routine on my "defective" packet IP Header using 
MSDOS' "debug"  (with the carry initially clear and the data 
byte-swapped in memory) and got the correct checksum. Were there any other 
updated files I should have downloaded from the CVS repository?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-23  9:37               ` jb1
@ 2002-10-23 11:42                 ` Harry Kalogirou
  2002-10-24  8:55                   ` jb1
  0 siblings, 1 reply; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-23 11:42 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> Maybe not. I found ip.c Version 1.9 by browsing the CVS repository and, as
> far as I can tell, the only change was that you moved the first "dec cx";  
> this will have *no* effect. The algorithm can still fail if the carry flag
> happens to be set going into the routine, or if there is a carry generated
> the last time "adc [di]" is executed. I suggest something like this for
> _ip_calc_chksum:
> 
>         push    bp
>         mov     bp,sp
>         push    di
> 
>         mov     cx, 6[bp]
>         sar     cx, 1
>         dec     cx
>         xor     ax,ax           ; clear carry flag (as well as AX)
>         mov     di, 4[bp]
>         mov     ax, [di]
>         inc     di
>         inc     di
> loop1:
>         adc     ax, [di]
>         inc     di
>         inc     di
> 
>         loop    loop1;          ; a byte shorter and a clock faster
>                                 ;  than DEC CX/JNZ LOOP1
> 
>         adc     ax,0            ; add (just) the final carry
>         not     ax
> 
>         pop di
>         pop bp
> 
>         ret
>

You can't be more right 8). I just thought I could get away without
opening the 8086 instruction manual, and I just made bad assumptions
about when the carry flag is cleared. 

The CVS now contains all your bugfixes (clear carry before entering the loop,
adding last carry), the use of "loop" and I also unrolled the
loop once. A code review would be gladly appreciated.
 
> Of course, this algorithm is valid only if the length (6[bp]) is an even
> number of bytes. While this is always true for IP headers, for TCP packet 
> checksums there would have to be a final test of the length's low bit and 
> appropriate handling of an additional odd byte.

TCP uses another routine.

Harry





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-23 11:42                 ` Harry Kalogirou
@ 2002-10-24  8:55                   ` jb1
  0 siblings, 0 replies; 14+ messages in thread
From: jb1 @ 2002-10-24  8:55 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 23 Oct 2002, Harry Kalogirou wrote:

> The CVS now contains all your bugfixes (clear carry before entering the loop,
> adding last carry), the use of "loop" and I also unrolled the
> loop once. A code review would be gladly appreciated.

The file ip.c Version 1.10 from the CVS repository looks good. I haven't 
tried it yet, but a "toy" version of _ip_calc_chksum runs correctly in 
DEBUG under MSDOS.

There's a trivial change I'd suggest: "SAR CX,1" to "SHR CX,1". SAR
retains the high bit's value (for signed arithmetic), whereas SHR shifts a
zero into the high bit. Since the IP Internet Header Length from which the
length is derived can be no more that 15, and both inctructions are two
bytes and take two clocks, this is just defensive programming against a
spurious call with a length greater than 32767. Also, the copyright date 
is still last year's.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 16:02             ` Harry Kalogirou
  2002-10-23  9:37               ` jb1
@ 2002-10-29 10:25               ` jb1
  2002-10-29 12:37                 ` Harry Kalogirou
  1 sibling, 1 reply; 14+ messages in thread
From: jb1 @ 2002-10-29 10:25 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 22 Oct 2002, Harry Kalogirou wrote:

> > Ok the quest is over.

Not yet. I think _tcp_chksumraw in tcp_output.c needs the same fixes as 
those you applied to _tcp_chksum. Without them I still got partial files 
with telnet/get.

There's *still* something wrong, but it shows up most frequently when I 
urlget from one ELKS box to another (yes, they have different IP 
addresses). Rarely, all goes as it should; more often, the entire file 
comes in a reasonable time, but I never get the command prompt; often, 
nothing comes in and I never get the command prompt. Once, nothing seemed 
to happen for about 10 minutes, but when I checked the machines about 10 
minutes later, the file had come in but there was no command prompt. I had 
enabled a second getty on that machine, so I was able to log in and run 
netstat on both machines while urlget was hung. Here are the results 
(about an hour later):

On the client ("urlget") machine (192.168.1.100) --
1 ESTABLISHED 4000ms 1025       0.0.0.0  2
2 ESTABLISHED 2400ms 1024 192.168.1.144 80
3      LISTEN 4000MS   80       0.0.0.0  0

On the server ("sender") (1.2.168.1.144) --
1 ESTABLISHED 4000ms 1024       0.0.0.0  2
2      LISTEN 4000ms   80       0.0.0.0  0

Obviously, the server has discarded the connection, but the client machine
thinks it's still connected.

I'm also not sure the client port number (the one that's 1024 or greater) 
is handled properly. Each time I connect from a Linux box the port number 
is incremented, but once I observered that a first, successful, connection 
from ELKS was from port 1024, and the next, hanging, attempt was *also* 
port 1024. Connections from Linux usually, but not always, work; 
connections from ELKS rarely work.

Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
to do anything, so I must reboot both machines. Since ELKS "telnet" 
doesn't do anything but connect (and logs me out when it terminates!) I 
can't compare telnet from Linux and ELKS. I can only compare "urlget" from 
both systems, and since there's no tcpdump for ELKS I can't even determing 
if a failure is actually due to urlget.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-29 10:25               ` jb1
@ 2002-10-29 12:37                 ` Harry Kalogirou
  0 siblings, 0 replies; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-29 12:37 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 22 Oct 2002, Harry Kalogirou wrote:
> 
> > > Ok the quest is over.
> 
> Not yet. I think _tcp_chksumraw in tcp_output.c needs the same fixes as 
> those you applied to _tcp_chksum. Without them I still got partial files 
> with telnet/get.

It is fixed.
 
> There's *still* something wrong, but it shows up most frequently when I 
> urlget from one ELKS box to another (yes, they have different IP 
> addresses). Rarely, all goes as it should; more often, the entire file 
> comes in a reasonable time, but I never get the command prompt; often, 
> nothing comes in and I never get the command prompt. Once, nothing seemed 
> to happen for about 10 minutes, but when I checked the machines about 10 
> minutes later, the file had come in but there was no command prompt. I had 
> enabled a second getty on that machine, so I was able to log in and run 
> netstat on both machines while urlget was hung. Here are the results 
> (about an hour later):
> 
> On the client ("urlget") machine (192.168.1.100) --
> 1 ESTABLISHED 4000ms 1025       0.0.0.0  2
> 2 ESTABLISHED 2400ms 1024 192.168.1.144 80
> 3      LISTEN 4000MS   80       0.0.0.0  0
> 
> 
> On the server ("sender") (1.2.168.1.144) --
> 1 ESTABLISHED 4000ms 1024       0.0.0.0  2
> 2      LISTEN 4000ms   80       0.0.0.0  0
> 
> Obviously, the server has discarded the connection, but the client machine
> thinks it's still connected.
> 
> I'm also not sure the client port number (the one that's 1024 or greater) 
> is handled properly. Each time I connect from a Linux box the port number 
> is incremented, but once I observered that a first, successful, connection 
> from ELKS was from port 1024, and the next, hanging, attempt was *also* 
> port 1024. Connections from Linux usually, but not always, work; 
> connections from ELKS rarely work.

ELKS reuses the last used port if it is not still in use. I don't think
that this is a problem. 

> Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
> to do anything, so I must reboot both machines. Since ELKS "telnet" 

The kernel in the CVS will probably handle this more gracefully and
actualy the process.

> doesn't do anything but connect (and logs me out when it terminates!) I 
> can't compare telnet from Linux and ELKS. I can only compare "urlget" from 
> both systems, and since there's no tcpdump for ELKS I can't even determing 
> if a failure is actually due to urlget.

You mean that you do "telnet bla.bla 80" and after you connect you can't
do "get /"?

Harry





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
       [not found] <Pine.LNX.4.33.0210300110270.32451-100000@olympus.btstream.com>
@ 2002-10-30 10:31 ` Harry Kalogirou
  0 siblings, 0 replies; 14+ messages in thread
From: Harry Kalogirou @ 2002-10-30 10:31 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 29 Oct 2002, Harry Kalogirou wrote:
> 
> > > Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
> > > to do anything, so I must reboot both machines. Since ELKS "telnet" 
> > 
> > The kernel in the CVS will probably handle this more gracefully and
> > actualy the process.
> 
> I don't understand your reply. When I issue "kill <process id>" from the 
> command line the process is still reported by "ps" and there's no evidence 
> that the process has actually been killed. I think kill.c calls the 
> kernel function.
> 

Ofcource the "kill" calles the kernel and the kernel kills the process.
So try the latest kernel from the CVS and you might get better results.

Harry




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2002-10-30 10:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1035036158.454.17.camel@cool>
2002-10-20  9:34 ` webserver stalls [was Re: bug in (linux) slattach] jb1
2002-10-20 17:06   ` Harry Kalogirou
2002-10-21  9:44     ` jb1
2002-10-21  9:55       ` Harry Kalogirou
2002-10-22 10:16         ` jb1
2002-10-22 13:56           ` Harry Kalogirou
2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
2002-10-22 16:02             ` Harry Kalogirou
2002-10-23  9:37               ` jb1
2002-10-23 11:42                 ` Harry Kalogirou
2002-10-24  8:55                   ` jb1
2002-10-29 10:25               ` jb1
2002-10-29 12:37                 ` Harry Kalogirou
     [not found] <Pine.LNX.4.33.0210300110270.32451-100000@olympus.btstream.com>
2002-10-30 10:31 ` Harry Kalogirou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox