public inbox for linux-8086@vger.kernel.org
 help / color / mirror / Atom feed
* [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 10:16 jb1
@ 2002-10-22 13:57 ` Harry Kalogirou
  2002-10-22 16:02   ` Harry Kalogirou
  0 siblings, 1 reply; 8+ messages in thread
From: Harry Kalogirou @ 2002-10-22 13:57 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 21 Oct 2002, Harry Kalogirou wrote:
> 
> > Mmm.. weird.. I probably got you tired with all this but can you try and
> > see if the failures are realy random? A good aid at this the -p
> > parameter of ping.
> 
> 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
> five errors, too few to account for the 100 percent failure rate of 
> certain webpage files. 55 had the most errors and was the only one with an 
> error in the pattern data. Most of the other errors were something about 
> the time-of-day going back; 00 had one extremely long response time 
> (1074131 mS).
> 
> 
> I think I can now prove that there's at least one IP Header sum-with-carry
> that results in a reproducible checksum error. I discovered that if the
> ELKS IP address were 192.168.1.135, all my test files could be read; large
> files required a few tries, but I was even able to read one 4369 (0x1111)
> bytes long! The unique property of the packets that never got ACK'ed is 
> that their checksum-field contains 0xF6FF instead of the correct value 
> 0xF5FF (the complement of 0x0A00).
> 
> Each of the webpage files that stall produces a defective packet with this 
> IP Header (the first twenty bytes of the packet):
> 	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
> The corresponding packet in the 99-byte file is one byte shorter (003e 
> instead of 003f), consequently having a different IP Header Checksum 
> (f600 instead of the erroneous f6ff):
> 	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205
> 
> Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
> address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
> identical erroneous IP Header Checksum with the command:
> 	ping -s 35 192.168.1.105
> resulting in the IP header:
> 	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205
> 
> To demonstrate that the problem is not the total packet size I added 1 to 
> the packetsize and subtracted 1 from the ELKS IP address:
> 	ping -s 36 192.168.1.104
> resulting in the IP Header:
> 	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205
> 
> Just for symmetry, I produced the same checksum as that for the 99-byte 
> webpage file, but the same length as the 100- and 266 byte webpage files 
> with:
> 	ping -s 35 192.168.1.104
> resulting in the IP Header:
> 	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205
> 
> In all cases, the pings with the defective checksum had 100% loss, while 
> those with the good checksum succeeded. I didn't try manipulating the 
> source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
> packetsize and ELKS IP address so that the sum-with-carry of this header 
> sans checksum-field is 0x09C1 you should be able to reproduce my results; 
> otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
> different version of some critical ELKS file).
> 

Ok the quest is over.

After all it was a problem of the checksum functions writen in assembly!
Did you try with USE_ASM undefined? Anyway it works now and I commited
it to the CVS.

Thank you very much for all your efford! Nice work.

Harry





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 13:57 ` [SOLVED] " Harry Kalogirou
@ 2002-10-22 16:02   ` Harry Kalogirou
  2002-10-23  9:37     ` jb1
  2002-10-29 10:25     ` jb1
  0 siblings, 2 replies; 8+ messages in thread
From: Harry Kalogirou @ 2002-10-22 16:02 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: jb1, Linux-8086

[-- Attachment #1: Type: text/plain, Size: 386 bytes --]


> Ok the quest is over.
> 
> After all it was a problem of the checksum functions writen in assembly!
> Did you try with USE_ASM undefined? Anyway it works now and I commited
> it to the CVS.
> 
> Thank you very much for all your efford! Nice work.
> 
> Harry


Actualy the quest is over now... as previously I managed to commit half
the patch to the CVS...

Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 16:02   ` Harry Kalogirou
@ 2002-10-23  9:37     ` jb1
  2002-10-23 11:42       ` Harry Kalogirou
  2002-10-29 10:25     ` jb1
  1 sibling, 1 reply; 8+ messages in thread
From: jb1 @ 2002-10-23  9:37 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Harry Kalogirou, Linux-8086

On 22 Oct 2002, Harry Kalogirou wrote:

> Actualy the quest is over now... as previously I managed to commit half
> the patch to the CVS...

Maybe not. I found ip.c Version 1.9 by browsing the CVS repository and, as
far as I can tell, the only change was that you moved the first "dec cx";  
this will have *no* effect. The algorithm can still fail if the carry flag
happens to be set going into the routine, or if there is a carry generated
the last time "adc [di]" is executed. I suggest something like this for
_ip_calc_chksum:

        push    bp
        mov     bp,sp
        push    di

        mov     cx, 6[bp]
        sar     cx, 1
        dec     cx
        xor     ax,ax           ; clear carry flag (as well as AX)
        mov     di, 4[bp]
        mov     ax, [di]
        inc     di
        inc     di
loop1:
        adc     ax, [di]
        inc     di
        inc     di

        loop    loop1;          ; a byte shorter and a clock faster
                                ;  than DEC CX/JNZ LOOP1

        adc     ax,0            ; add (just) the final carry
        not     ax

        pop di
        pop bp

        ret

Of course, this algorithm is valid only if the length (6[bp]) is an even
number of bytes. While this is always true for IP headers, for TCP packet 
checksums there would have to be a final test of the length's low bit and 
appropriate handling of an additional odd byte.


I ran the original routine on my "defective" packet IP Header using 
MSDOS' "debug"  (with the carry initially clear and the data 
byte-swapped in memory) and got the correct checksum. Were there any other 
updated files I should have downloaded from the CVS repository?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-23  9:37     ` jb1
@ 2002-10-23 11:42       ` Harry Kalogirou
  2002-10-24  8:55         ` jb1
  0 siblings, 1 reply; 8+ messages in thread
From: Harry Kalogirou @ 2002-10-23 11:42 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> Maybe not. I found ip.c Version 1.9 by browsing the CVS repository and, as
> far as I can tell, the only change was that you moved the first "dec cx";  
> this will have *no* effect. The algorithm can still fail if the carry flag
> happens to be set going into the routine, or if there is a carry generated
> the last time "adc [di]" is executed. I suggest something like this for
> _ip_calc_chksum:
> 
>         push    bp
>         mov     bp,sp
>         push    di
> 
>         mov     cx, 6[bp]
>         sar     cx, 1
>         dec     cx
>         xor     ax,ax           ; clear carry flag (as well as AX)
>         mov     di, 4[bp]
>         mov     ax, [di]
>         inc     di
>         inc     di
> loop1:
>         adc     ax, [di]
>         inc     di
>         inc     di
> 
>         loop    loop1;          ; a byte shorter and a clock faster
>                                 ;  than DEC CX/JNZ LOOP1
> 
>         adc     ax,0            ; add (just) the final carry
>         not     ax
> 
>         pop di
>         pop bp
> 
>         ret
>

You can't be more right 8). I just thought I could get away without
opening the 8086 instruction manual, and I just made bad assumptions
about when the carry flag is cleared. 

The CVS now contains all your bugfixes (clear carry before entering the loop,
adding last carry), the use of "loop" and I also unrolled the
loop once. A code review would be gladly appreciated.
 
> Of course, this algorithm is valid only if the length (6[bp]) is an even
> number of bytes. While this is always true for IP headers, for TCP packet 
> checksums there would have to be a final test of the length's low bit and 
> appropriate handling of an additional odd byte.

TCP uses another routine.

Harry





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-23 11:42       ` Harry Kalogirou
@ 2002-10-24  8:55         ` jb1
  0 siblings, 0 replies; 8+ messages in thread
From: jb1 @ 2002-10-24  8:55 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 23 Oct 2002, Harry Kalogirou wrote:

> The CVS now contains all your bugfixes (clear carry before entering the loop,
> adding last carry), the use of "loop" and I also unrolled the
> loop once. A code review would be gladly appreciated.

The file ip.c Version 1.10 from the CVS repository looks good. I haven't 
tried it yet, but a "toy" version of _ip_calc_chksum runs correctly in 
DEBUG under MSDOS.

There's a trivial change I'd suggest: "SAR CX,1" to "SHR CX,1". SAR
retains the high bit's value (for signed arithmetic), whereas SHR shifts a
zero into the high bit. Since the IP Internet Header Length from which the
length is derived can be no more that 15, and both inctructions are two
bytes and take two clocks, this is just defensive programming against a
spurious call with a length greater than 32767. Also, the copyright date 
is still last year's.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 16:02   ` Harry Kalogirou
  2002-10-23  9:37     ` jb1
@ 2002-10-29 10:25     ` jb1
  2002-10-29 12:37       ` Harry Kalogirou
  1 sibling, 1 reply; 8+ messages in thread
From: jb1 @ 2002-10-29 10:25 UTC (permalink / raw)
  To: Harry Kalogirou; +Cc: Linux-8086

On 22 Oct 2002, Harry Kalogirou wrote:

> > Ok the quest is over.

Not yet. I think _tcp_chksumraw in tcp_output.c needs the same fixes as 
those you applied to _tcp_chksum. Without them I still got partial files 
with telnet/get.

There's *still* something wrong, but it shows up most frequently when I 
urlget from one ELKS box to another (yes, they have different IP 
addresses). Rarely, all goes as it should; more often, the entire file 
comes in a reasonable time, but I never get the command prompt; often, 
nothing comes in and I never get the command prompt. Once, nothing seemed 
to happen for about 10 minutes, but when I checked the machines about 10 
minutes later, the file had come in but there was no command prompt. I had 
enabled a second getty on that machine, so I was able to log in and run 
netstat on both machines while urlget was hung. Here are the results 
(about an hour later):

On the client ("urlget") machine (192.168.1.100) --
1 ESTABLISHED 4000ms 1025       0.0.0.0  2
2 ESTABLISHED 2400ms 1024 192.168.1.144 80
3      LISTEN 4000MS   80       0.0.0.0  0


On the server ("sender") (1.2.168.1.144) --
1 ESTABLISHED 4000ms 1024       0.0.0.0  2
2      LISTEN 4000ms   80       0.0.0.0  0

Obviously, the server has discarded the connection, but the client machine
thinks it's still connected.

I'm also not sure the client port number (the one that's 1024 or greater) 
is handled properly. Each time I connect from a Linux box the port number 
is incremented, but once I observered that a first, successful, connection 
from ELKS was from port 1024, and the next, hanging, attempt was *also* 
port 1024. Connections from Linux usually, but not always, work; 
connections from ELKS rarely work.

Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
to do anything, so I must reboot both machines. Since ELKS "telnet" 
doesn't do anything but connect (and logs me out when it terminates!) I 
can't compare telnet from Linux and ELKS. I can only compare "urlget" from 
both systems, and since there's no tcpdump for ELKS I can't even determing 
if a failure is actually due to urlget.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-29 10:25     ` jb1
@ 2002-10-29 12:37       ` Harry Kalogirou
  0 siblings, 0 replies; 8+ messages in thread
From: Harry Kalogirou @ 2002-10-29 12:37 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 22 Oct 2002, Harry Kalogirou wrote:
> 
> > > Ok the quest is over.
> 
> Not yet. I think _tcp_chksumraw in tcp_output.c needs the same fixes as 
> those you applied to _tcp_chksum. Without them I still got partial files 
> with telnet/get.

It is fixed.
 
> There's *still* something wrong, but it shows up most frequently when I 
> urlget from one ELKS box to another (yes, they have different IP 
> addresses). Rarely, all goes as it should; more often, the entire file 
> comes in a reasonable time, but I never get the command prompt; often, 
> nothing comes in and I never get the command prompt. Once, nothing seemed 
> to happen for about 10 minutes, but when I checked the machines about 10 
> minutes later, the file had come in but there was no command prompt. I had 
> enabled a second getty on that machine, so I was able to log in and run 
> netstat on both machines while urlget was hung. Here are the results 
> (about an hour later):
> 
> On the client ("urlget") machine (192.168.1.100) --
> 1 ESTABLISHED 4000ms 1025       0.0.0.0  2
> 2 ESTABLISHED 2400ms 1024 192.168.1.144 80
> 3      LISTEN 4000MS   80       0.0.0.0  0
> 
> 
> On the server ("sender") (1.2.168.1.144) --
> 1 ESTABLISHED 4000ms 1024       0.0.0.0  2
> 2      LISTEN 4000ms   80       0.0.0.0  0
> 
> Obviously, the server has discarded the connection, but the client machine
> thinks it's still connected.
> 
> I'm also not sure the client port number (the one that's 1024 or greater) 
> is handled properly. Each time I connect from a Linux box the port number 
> is incremented, but once I observered that a first, successful, connection 
> from ELKS was from port 1024, and the next, hanging, attempt was *also* 
> port 1024. Connections from Linux usually, but not always, work; 
> connections from ELKS rarely work.

ELKS reuses the last used port if it is not still in use. I don't think
that this is a problem. 

> Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
> to do anything, so I must reboot both machines. Since ELKS "telnet" 

The kernel in the CVS will probably handle this more gracefully and
actualy the process.

> doesn't do anything but connect (and logs me out when it terminates!) I 
> can't compare telnet from Linux and ELKS. I can only compare "urlget" from 
> both systems, and since there's no tcpdump for ELKS I can't even determing 
> if a failure is actually due to urlget.

You mean that you do "telnet bla.bla 80" and after you connect you can't
do "get /"?

Harry





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
       [not found] <Pine.LNX.4.33.0210300110270.32451-100000@olympus.btstream.com>
@ 2002-10-30 10:31 ` Harry Kalogirou
  0 siblings, 0 replies; 8+ messages in thread
From: Harry Kalogirou @ 2002-10-30 10:31 UTC (permalink / raw)
  To: jb1; +Cc: Linux-8086

> On 29 Oct 2002, Harry Kalogirou wrote:
> 
> > > Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
> > > to do anything, so I must reboot both machines. Since ELKS "telnet" 
> > 
> > The kernel in the CVS will probably handle this more gracefully and
> > actualy the process.
> 
> I don't understand your reply. When I issue "kill <process id>" from the 
> command line the process is still reported by "ps" and there's no evidence 
> that the process has actually been killed. I think kill.c calls the 
> kernel function.
> 

Ofcource the "kill" calles the kernel and the kernel kills the process.
So try the latest kernel from the CVS and you might get better results.

Harry




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-10-30 10:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.33.0210300110270.32451-100000@olympus.btstream.com>
2002-10-30 10:31 ` [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach] Harry Kalogirou
2002-10-22 10:16 jb1
2002-10-22 13:57 ` [SOLVED] " Harry Kalogirou
2002-10-22 16:02   ` Harry Kalogirou
2002-10-23  9:37     ` jb1
2002-10-23 11:42       ` Harry Kalogirou
2002-10-24  8:55         ` jb1
2002-10-29 10:25     ` jb1
2002-10-29 12:37       ` Harry Kalogirou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox