How do packet analyzers stack up in detecting and reporting the simplest and most fundamental indication of an anomaly, the venerable TCP retransmission? I recently looked at five tools ranging from free to $80,000. I did this because I saw something suspicious as reported by the big gun.
In my book I recommend that anytime you see suspect reporting or diagnostics, that you verify them at least once the hard way – by hand – so that from then on you know whether they are accurate under those circumstances.
In this particular case, the "AppDoctor" of the high-end tool was informing me that “There are many packet retransmissions. The network may be heavily congested, or there may be an error-prone link.” The value it gave me was over 5% which is certainly of concern. That’s 50+ packets out of every 1,000.
I didn't suspect at the time that the network path was congested and didn't want to chase down duplex mismatches and the like, so I wanted a second opinion and ran the trace through the free tool. It came up with all of 3 retransmissions per 1000 packets or 1/3 of a percent.
Why the large difference of opinion? Shouldn’t packets be cut and dry, i.e. factual? Before answering the question, I’d like to point out some of the various idiosyncrasies in how a number of analyzers report TCP retransmissions.
As you may have guessed, the free tool is Wireshark. The $80,000 tool, Opnet’s ACE. I also ran the trace through the latest Network Instruments Observer, WildPackets OmniPeek, and Network General Sniffer analyzers and focused on one section where the Opnet ACE diagnosed 72 retransmissions out of some 1,100 packets.
Observer reported the same high retransmission count. It tried to be extra helpful in noting that they were also of the “too fast retransmission” variety (at or below 180 ms by default) and that they were “excessive” (2% or more of the total packets by default for the critical level). That would have been a great diagnosis if it had been correct. More on this later.
The Sniffer values were a little strange, depending on if you are looking at the number of Sniffer symptom objects or the packet summary decodes. The summary decodes contain three “Expert: Retransmission” notifications yet the tally in the expert summary lists only two possibly due to a grouping by TCP flow/conversation (i.e. two of the three retransmissions were in the same TCP flow.)
So the individual packet retransmission counts for each analyzer were:
- Opnet: 72
- Wireshark: 3
- Observer: 72
- OmniPeek: 3
- Sniffer: 3
As I used to say in my classes, shall we just average the results and call it a day? Not!
The right answer verified manually is three, making WireShark, OmniPeek and Sniffer were correct in this particular scenario. That’s not to say that these tools are always correct in every situation - they aren't. Again, the purpose of this exercise is to verify your data. I'm not picking on any particular tool.
The reason for the large number of false positives in and Opnet and Observer was due to its misinterpretation of the TCP close connection sequence.
A graceful TCP close (i.e. not a RST or reset) is a four-packet TCP FIN sequence consisting of a FIN followed by an ACK to close one half of the connection, and then another FIN-ACK pair closes the second half of the connection to bring it to a full close (or in short FIN-ACK, FIN-ACK) as shown in the following figure (from Network General Sniffer).
Textbook TCP Close
In the trace in question, I noticed a different close sequence: FIN one direction, FIN the other direction, ACK the FIN, ACK the FIN (or in short, FIN-FIN-ACK-ACK). Also unusual was the fact that the FIN bit was set in the ACK from the server. The following shows the alternate TCP close.
Non-Textbook Close
Observer and Opnet apparently are fooled into thinking that the final TCP ACK packet is a retransmission since the FIN bit is set again (which is irrelevant as the connection is already closed) and the TCP sequence number matches the previous FIN packet. Sniffer et. el. do not report them as retransmissions because they are simply the last of the four packets in a TCP FIN close sequence.
This particular application used hundreds of such TCP sessions and subsequent closes to transfer a relatively small amount of data, another problem in itself.
The lesson is that when in doubt, seek a second opinion from another tool or roll up your sleeves and perform manual analysis on a small test section of packet data to confirm your suspicions.
Cool Stuff,
We didn't do that bad! (I wrote the TCP analysis part of wireshark).
Retransmissions are somewhat tricky to analyze in a capture, in particular since in a capture file you also have to assume that there were missing segments. (the capture missed some packets that were actually on the wire).
Wireshark tries to classify retransmissions into three distinct classes:
1, "normal" retransmissions
2, Fast retransmissions
3, (not really retransmissions) out-of-order packets if the network path doesnt guarantee time-integrity and reorders packets.
I think it does an ok job. Not perfect but reasonable.
Want to try something a bit harder?
If you want to test with something a bit more challenging (I would be plesantly surprised if the other tools can also do this)
please have a look at the capture I created for you at
http://samba.org/~sahlberg/zero-window-solaris.cap.gz
This trace shows a solaris client sending data to a server. Eventually the server application hangs (SIGSTOP) and the advertized window fills up completely.
Packet 131 shows the final packet when the advertised window has been completely filled and the client must stop. Do the other tools mark this packet as "WindowFull" or equivalent? It would be interesting for me to know.
After this there are just a bunch of ZeroWindow probes (wireshark assumes a packet is a zero window probe IF the window is full and IF the segment contains exactly one byte of data and IF this byte of data is immediately to the right of the right edge of the window)
You can also have a look at
http://samba.org/~sahlberg/zero-window-linux.cap.gz
which shows the same thing but for a linux client.
Note that the linux stack does not technically use ZeroWindow probes but instead just issues KeepAlive packets (0 or 1 byte of random data immediately prior to the left edge of the window).
The purpose and end result is the same but it is technically speaking not ZeroWindow probes.
ronnie s
Posted by: Ronnie Sahlberg | March 24, 2008 at 11:01 PM