One of the most common symptoms that we look for as evidence of dropped packets in a network is TCP retransmissions. Virtually every protocol analyzer today will alert you when it detects a retransmission.
The recovery mechanism for dropped or delayed TCP packets has changed over the years. The question is: Have analysis tools kept up?
A receiver’s TCP stack can address lost (or delayed) packets a number of different ways including:
- Acknowledging up to and including the last TCP segment that has been contiguously received (i.e. there no gaps in the received byte stream);
- Sending “fast” duplicate ACKs immediately upon sensing a gap; or,
- Using the selective acknowledgement (SACK) feature.
Acknowledging only the last “good” segment received (the first two aforementioned techniques) will cause the sender to back up and resend from that point forward. Typically more than one packet has already been sent due to the TCP windowing mechanism, which allows multiple packets to be sent before an ACK is required. This means that not only is the missing segment resent, but all subsequent segments as well, even if they were already received by their destination. The number of packets outstanding worsens as networks get faster and window sizes get larger (i.e. window scaling beyond 65K bytes).
There is a better way to recover lost TCP segments. A client can use SACKs to inform the sender of all segments that have arrived via a sequence number range or block. Up to four blocks can be acknowledged in one SACK packet. Note that a receiver can only use SACKs if the sender indicates that it is a supported option. You can check for this in the TCP SYN and TCP SYN-ACK packets with your analyzer, where one side will indicate to the other in the TCP options field that SACK is permitted.
For more operational details and some complex scenarios, please refer to RFC 2018, “TCP Selective Acknowledgement Options”.
How well does SACK work?
Analysis of SACK in action proves that it definitely does the job in cutting down on the number of packets that are retransmitted. However, I’ve noticed some caveats in SACK behavior as well as how certain network analysis tools (i.e. expert systems) report this behavior.
In theory, the SACK mechanism should also cut down on delay due to dropped packets. The retransmitted packets should be streamed right into the regular flow from the sender without hesitation.
In practice, I’ve noticed that this is not always the case. For example, I examined a remote file transfer between a client and server over a WAN experiencing some packet loss. I noticed that whenever the server resent a packet due to a SACK from the client, the sender would often a pause for up to several hundred milliseconds between the last good packet sent and the recovered packet (far longer than the round trip delay in the WAN). This was followed by a similar delay before the stream got going again.
How can you analyze these SACK recovery delays? A good place to start is to employ a display filter to find TCP packets with SACK information in the header. A quick and dirty way is to check for TCP headers longer than 20 bytes. As mentioned previously, SYN packets will advertise the sender’s capabilities and therefore will have longer headers. Thus, SYN packets can be excluded by the filter.
The reason I suggest using a display filter and not capture filter is because this is a situation where you’ll want to capture all TCP packets between a client and server and then apply some post-capture analysis. If you only capture packets with SACK information, you’ll see which packets needed to be retransmitted but you won’t be able to deduce how long the sender took to recover and actually resend the packets. You may also wish to trigger a capture on a SACK packet-- chances are if you see one, you'll see more.
If the analyzer is capturing close to the source, you’ll probably also notice TCP retransmissions from your expert system or duplicate TCP sequence numbers if you like doing things by hand (be sure to check that the IP ID is not the same in the repeated packet, which can happen when analyzing VLANs off a SPAN port). One pitfall is that you will not see TCP retransmissions flagged as such if the packets were dropped prior to the segment from which you are capturing.
What then? Focus on packets that contain SACK information. You will typically see a few SACKs with a widening range of bytes received until the missing segment or segments are received. When the SACKs stop, you know that the segment(s) in question have been sent. Checking the sequence number of a packet following a “SACK burst” will confirm this. The sequence number will be lower than the previous transmission.
I will cover how a number of protocol analyzers faired in detecting this problem in my next blog.