« September 2007 | Main | November 2007 »

October 26, 2007

Network Performance Management – Winners and Losers

Network Physics, a company that venture capitalists invested some $55m (click here for the history), essentially liquidated this week to OPNET for a basement bargain $10m.

I have fond memories of Network Physics at Interop, with guys and gals running around in those loud “It’s Not the Network!” t-shirts.  The premise was that their packet-based NetSensory appliance monitors application flows on the network and distinguishes server vs. network delay.  I also worked with one of their brilliant engineers on the Apdex board of directors.

It looks like a loss for Network Physics (certainly the investors) but a win for OPNET. Near as I can tell, OPNET will ditch further development in their own their performance management product, ACE Live (announced just last month), and instead go with NetSensory and rename it ACE Live.  Got it?

There are many players in the network performance management space with overlapping areas and niches.  For instance, Coradiant focuses on Web based performance analytics.  NetQoS focuses on Application Performance Management although they are beginning to broaden.  They also have some nice integration with Cisco WAAS (Wide Area Application Services).  Larger umbrella companies like Compuware and Computer Associates, also have tools that watch application performance.  Compuware’s ApplicationVantage comes to mind (part of their Application Assurance portfolio) and CA has tools in this area focused on SLA’s in the web space.

As with any acquisition, how well will the acquiring technologies integrate into a much larger beast?  Track records have been mixed.  Some 25% of Cisco acquisitions don’t pan out.  Witness Network General’s failed comeback and acquisitions and now it’s NetScout’s problem.

Will larger companies lose their identity in this market, specifically application performance management/end-user satisfaction within the network performance space?   Or will smaller, but significant and fast growing players like NetQoS trump the big guys?
 
Speaking of NetQoS, how did they get so far ahead of Network Physics in the performance appliance biz?  I’d say partly that Network Physics suffered when shifting gears (like changing your college major mid-term) from shaping traffic to measuring the stuff and that NetQoS recognized early on the value of a feature rich (as in superb GUI) browser-based console coupled with a smart appliance, but I digress.

One thing I do know is that all the long and hard work over the years from Peter Sevcik (the NetForecast guy that started Apdex) pushing the importance of measuring the user-experience, is coming front and center.

October 16, 2007

When is a standard not a standard?

As standards become more and more complex, there tend to be numerous options included which makes one wonder whether or a not a standard is really a standard.

Take the controversial IEEE P802.11n/D2.00 Draft for instance, which was recently questioned in an interesting blog at lovemytool.com.  There are tons of options including the number of transmit and receive streams (i.e. MIMO operation), single or dual channel operation, various new data unit types, new ACKs, the list goes on.

Luckily, there are mandatory requirements as well as options. In fact, to obtain Wi-Fi Alliance 802.11n Draft 2.0 Certification, a device must implement a minimum set of mandatory capabilities specified in the IEEE draft.  Specifically, a device must implement 2 spatial streams in transmit mode, 2 spatial streams in receive mode, the A-MPDU and A-MSDU, and block ACK.  This simplifies things in that only a single 20 MHz channel in the 2.4 GHz band is required.   These minimum mandatory requirements roughly double the raw data rate over 802.11g in a single channel.  To gain the full benefit of 802.11n, a device may also implement the optional 40 MHz operational mode (that requires using the 5 GHz band in order to minimize interference with legacy b/g devices) as well as utilize additional spatial streams to boost throughput.

I would venture to say that a goal of the IEEE was to provide a minimal must-do set of requirements to gain at least some benefit over 802.11 b/g.  Beyond the minimal requirements, options allow vendors and customers to boost throughput depending on their environmental and legacy requirements.  This includes co-existence with2.4 GHz devices as well as optional deployment in the less-interference prone, albeit shorter-range  5 GHz band.  The Wi-Fi Alliance will test these options as well, ensuring interoperability for vendors that choose to implement them.  In fact, the majority of 2.0-certified devices to date support at least 3 spatial steams.  In rough numbers, this gives us 150 Mbps for single channel operation, 300 Mbps for dual.  I think the dual channel controversy (see 802.11n Going Enterprise?) will be put to rest especially in light of Cisco’s recent 802.11n equipment rollout for dual channel support in the 5 GHz band – the first to be Wi-Fi Alliance certified for that option.

Thank goodness we have organizations like the Wi-Fi Alliance to keep us on the straight and narrow.

October 11, 2007

TCP Selective ACK (SACK) Packet Recovery Analysis: Part 2 - The Analyzer

My previous blog looked at some of the operational details of TCP SACK, both from a performance and analysis perspective.  I noted how packet retransmissions are reduced when the client performs selective acknowledgements but the server performance may still be impacted, lowering the overall throughput.  In this second of two parts I’ll comment on how things look from a protocol analyzer perspective.

One of the first things that should have been obvious from the discussion thus far is that analyzers should also be looking at SACK information from the client to determine whether or not retransmissions exist.  Unfortunately, most analyzers only look for retransmissions the old fashion way:  duplicate TCP sequence numbers.  If the packet was dropped prior to your analyzer insertion point into the network, retransmissions go undetected.

One must also be careful when seeing a TCP packet with a lower sequence number than the previous packet.  It is not necessarily an out of order packet.  This can easily be determined by looking for prior SACK information, which virtually all clients and servers in use today support by default.

Finally, it is of interest to see how efficiently a server is able to process and send the missing segment or segments.  The case I referred to in the previous blog showed server delay in segment recovery.

Let’s look at three analyzers in order by name: Observer (Network Instruments), OmniPeek (WildPackets), and Wireshark (the Ethereal replacement spearheaded by CACE).

Note:  I tested latest shipping version of each analyzer as of this blog posting. The three I picked are all representative of today’s protocol analysis tools and all have some type of expert system.  I ran the same exact trace file (TCP packet loss as discussed in Part 1) through all three analyzers, using the .enc format.  The trace was originally captured from the client’s Ethernet, with the remote server located on the other side of a WAN.

Observer (12.1)

Observer reported no TCP events in the expert analsysis summary. Yet in looking at the TCP retrans column at the far right, it did show 32 retransmissions.

Observersack_2

Bottom line:  With Observer, be sure to examine the retransmission column in the connection details for problematic application/TCP performance.  Be careful with the Connection Dynamics feature which identifies the packets as out of order, not as retransmissions like the expert.

OmniPeek (5.0)

OmniPeek has a unique expert event called “TCP Slow Segment Recovery” as you can see in the screen shot below.  Thus, OmniPeek does look at both the SACK information from the client followed by the respective recovered segments sent by the server.  The default delay setting is 250 milliseconds.  By using a little trick and lowering this to 0 milliseconds, we can catch all recovered segments.  Per the screenshot, the number of segments recovered is 30.

Omnipeeksack

Bottom line:  Slow segment recovery may or may not be a problem, but is a strong hint.  You need to analyze further to see if it goes with the flow or disrupts the flow.  One hint is to look at the max, min, and average throughput reported by the expert.

Wireshark (99.6)

As shown in the next screen shot, the Wireshark expert info correctly ‘Notes' that there are TCP Retransmissions (32 total) and 'Warns' about 'Previous Segment Lost' (30 total).  Thus, it took 32 retransmissions to recover 30 segments.

Wiresharksack_2

Bottom line:  Look prior to the retransmissions to see if SACK is being using.  Also, the duplicate ACK’s shown in the screen shot are not really duplicates per se.  As packets continue to stream in following a missing segment, the client continues to SACK each segment, incrementing the received block range.  Once the missing segment or segments are received, the client resorts back to normal periodic ACKs.

Conclusion

Hopefully I’ve whet your appetite for probing further.  TCP Selective ACKs work well in reducing TCP retransmissions but watch out for how efficiently servers handle it.  This is but one example of how we can analyze better, smarter, deeper.

October 04, 2007

TCP Selective ACK (SACK) Packet Recovery Analysis: Part 1 - The Problem

One of the most common symptoms that we look for as evidence of dropped packets in a network is TCP retransmissions.  Virtually every protocol analyzer today will alert you when it detects a retransmission.

The recovery mechanism for dropped or delayed TCP packets has changed over the years.  The question is: Have analysis tools kept up?

A receiver’s TCP stack can address lost (or delayed) packets a number of different ways including:

  • Acknowledging up to and including the last TCP segment  that has been contiguously received (i.e. there no gaps in the received byte stream);
  • Sending “fast” duplicate ACKs immediately upon sensing a gap; or,
  • Using the selective acknowledgement (SACK) feature.

Acknowledging only the last “good” segment received (the first two aforementioned techniques) will cause the sender to back up and resend from that point forward.  Typically more than one packet has already been sent due to the TCP windowing mechanism, which allows multiple packets to be sent before an ACK is required.  This means that not only is the missing segment resent, but all subsequent segments as well, even if they were already received by their destination. The number of packets outstanding worsens as networks get faster and window sizes get larger (i.e. window scaling beyond 65K bytes).

There is a better way to recover lost TCP segments.  A client can use SACKs to inform the sender of all segments that have arrived via a sequence number range or block.  Up to four blocks can be acknowledged in one SACK packet.  Note that a receiver can only use SACKs if the sender indicates that it is a supported option.  You can check for this in the TCP SYN and TCP SYN-ACK packets with your analyzer, where one side will indicate to the other in the TCP options field that SACK is permitted.

For more operational details and some complex scenarios, please refer to RFC 2018, “TCP Selective Acknowledgement Options”.

How well does SACK work?

Analysis of SACK in action proves that it definitely does the job in cutting down on the number of packets that are retransmitted.  However, I’ve noticed some caveats in SACK behavior as well as how certain network analysis tools (i.e. expert systems) report this behavior.

In theory, the SACK mechanism should also cut down on delay due to dropped packets.  The retransmitted packets should be streamed right into the regular flow from the sender without hesitation.

In practice, I’ve noticed that this is not always the case. For example, I examined a remote file transfer between a client and server over a WAN experiencing some packet loss. I noticed that whenever the server resent a packet due to a SACK from the client, the sender would often a pause for up to several hundred milliseconds between the last good packet sent and the recovered packet (far longer than the round trip delay in the WAN). This was followed by a similar delay before the stream got going again.

How can you analyze these SACK recovery delays?  A good place to start is to employ a display filter to find TCP packets with SACK information in the header.  A quick and dirty way is to check for TCP headers longer than 20 bytes. As mentioned previously, SYN packets will advertise the sender’s capabilities and therefore will have longer headers. Thus, SYN packets can be excluded by the filter.

The reason I suggest using a display filter and not capture filter is because this is a situation where you’ll want to capture all TCP packets between a client and server and then apply some post-capture analysis.  If you only capture packets with SACK information, you’ll see which packets needed to be retransmitted but you won’t be able to deduce how long the sender took to recover and actually resend the packets. You may also wish to trigger a capture on a SACK packet-- chances are if you see one, you'll see more.

If the analyzer is capturing close to the source, you’ll probably also notice TCP retransmissions from your expert system or duplicate TCP sequence numbers if you like doing things by hand (be sure to check that the IP ID is not the same in the repeated packet, which can happen when analyzing VLANs off a SPAN port). One pitfall is that you will not see TCP retransmissions flagged as such if the packets were dropped prior to the segment from which you are capturing.

What then?  Focus on packets that contain SACK information.  You will typically see a few SACKs with a widening range of bytes received until the missing segment or segments are received.  When the SACKs stop, you know that the segment(s) in question have been sent.  Checking the sequence number of a packet following a “SACK burst” will confirm this.  The sequence number will be lower than the previous transmission.

I will cover how a number of protocol analyzers faired in detecting this problem in my next blog.