April 08, 2008

Pilot Sneak Preview: A New Direction in Network Analysis?

Build a better analysis front-end and they will come. That’s what CACE Technologies hopes to achieve with its Pilot visualization and reporting tool (expected to be announced on or prior to 4/18). Pilot (named after the fish that "congregate around sharks, rays, and sea turtles, where it eats parasites on and leftovers around the host species" according to Wikipedia) was previewed at the Wireshark developer’s conference last week. I was fortunate enough to get my hands on a beta.

In my opinion, established vendors had nothing to fear from Wireshark. Build a superior expert system, high performance capture and aggregation hardware, easy to use distributed tools and data mining and you have a winner. That is, until now.

CACE is the first commercial vendor to truly embrace Wireshark as a platform while other vendors stood back in fear. Why are they afraid? Why not embrace open source rather than try to hide it as others have done. An example of this is incorporating so-called third-party decodes from Wireshark’s predecessor, Ethereal.

Pilot is different the moment you fire it up.  Notice in the screen shot below, the modern GUI and the ability to learn several aspects of the product via a series of short videos.  I'd love to see other vendors follow this refreshing approach.

Pilot_overview_3

Pilot is more than just a pretty face. It also serves as a data mining tool to cull data from a large number of Wireshark files. In a recent situation, I had an analyzer-less customer deploy a number of Wiresharks at several suspected problems areas in their network for long-term capture to disk. We were then able to go back and manually mine data from several capture points when a particular event occurred and zero in on the problem. With Pilot, we can now bring those long term capture files together to assist in the mining and analysis process.

At the heart of the product is a Google Finance-like chart that slides across statistics collected from one or more packet traces, shown in the screen shot below. The highlighted part is a section I selected by hand to "send to Wireshark" for deep packet inspection. Pilot leaves not only the packet decodes but all packet display functions to Wireshark, a departure from other vendors that merely grabbed the Wireshark decoders. Pilot can also take advantage of WinPcap and AirPcap to grab real-time wired and wireless packet-derived data.

Pilot_graph_2

There are other goodies in the interface like dragging and dropping a view such as top MAC or IP sources, conversations, bandwdith by bytes or packets, and so on top of your selected files(s) or a section of a graph.  For instance, perhaps you only want the IP Conversations view for the highlighted portion in the bytes per second graph in the above screenshot. Merely drag the view from the selection tree on the left-hand side over the highlighted part in the graph and instantly see the conversations only for that time span. Way cool.

Linux users are out of luck though – this is a Windows only product built using Microsoft Visual Studio tools, as clearly evidenced by the Office 2007 ribbon interface. Frankly, when I first used Office 2007, I didn’t like the new interface as I was used to using previous versions. Once I forced myself to learn it however, I felt that it was superior (who says you can’t teach an old dog new tricks?). As such, I felt right at home with Pilot.

There are a couple of improvements I'd like to see, however. For instance, you can "send" a statistic or part of a graph, such as one or more parts of a histogram (using multi-select) for top talkers (sources) to Wireshark for deep packet inspection. Unfortunately, you only see one-way packets streams from those source addresses. I’d love to see a feature pioneered by WildPackets with its Select Related feature and imitated by others as a "quick filter", to select a choice of source and/or source and peers, so I can follow the flows. Analyzing one-way top talkers at the packet level makes sense for broadcasts, but less so for unicast traffic.

There's more to the product including a number of output options for reporting in a variety of formats from PDF to Excel.  Watch for the annoucement and check out a demo.

I was thinking it would be interesting if CACE supported more than just the Wireshark analyzer. Despite claiming to be integrated with Wireshark, it really boils down to passing a portion of one or more trace files as a capture source along with a filter to Wireshark. Why not support the same for other analyzers? On second thought, that could cause some serious heartburn for competing vendors.

With over 300,000 Wireshark downloads per month, users will finally have a real tool to go hand-in-hand to help ease some of their analysis pains. One question that comes to mind is how many users of a free open source tool will be willing to pay real money for Pilot at $1,295 a pop including maintenance (the projected introductory pricing)? Only time will tell.

Meanwhile by feasting on those morsels surrounding the Wireshark community, Pilot could prove be an industry disruptor, even more so when the distributed version becomes available.

March 11, 2008

A Tale of Five Analyzers

How do packet analyzers stack up in detecting and reporting the simplest and most fundamental indication of an anomaly, the venerable TCP retransmission?  I recently looked at five tools ranging from free to $80,000.  I did this because I saw something suspicious as reported by the big gun.

In my book I recommend that anytime you see suspect reporting or diagnostics, that you verify them at least once the hard way – by hand – so that from then on you know whether they are accurate under those circumstances.

In this particular case, the "AppDoctor" of the high-end tool was informing me that “There are many packet retransmissions.  The network may be heavily congested, or there may be an error-prone link.”  The value it gave me was over 5% which is certainly of concern.  That’s 50+ packets out of every 1,000.

I didn't suspect at the time that the network path was congested and didn't want to chase down duplex mismatches and the like, so I wanted a second opinion and ran the trace through the free tool.   It came up with all of 3 retransmissions per 1000 packets or 1/3 of a percent.

Why the large difference of opinion?  Shouldn’t packets be cut and dry, i.e. factual?  Before answering the question, I’d like to point out some of the various idiosyncrasies in how a number of analyzers report TCP retransmissions.

As you may have guessed, the free tool is Wireshark.  The $80,000 tool, Opnet’s ACE.  I also ran the trace through the latest Network Instruments Observer, WildPackets OmniPeek, and Network General Sniffer analyzers and focused on one section where the Opnet ACE diagnosed 72 retransmissions out of some 1,100 packets.

Observer reported the same high retransmission count.  It tried to be extra helpful in noting that they were also of the “too fast retransmission” variety (at or below 180 ms by default) and that they were “excessive”  (2% or more of the total packets by default for the critical level).  That would have been a great diagnosis if it had been correct.  More on this later.

The Sniffer values were a little strange, depending on if you are looking at the number of Sniffer symptom objects or the packet summary decodes.   The summary decodes contain three “Expert: Retransmission” notifications yet the tally in the expert summary lists only two possibly due to a grouping by TCP flow/conversation (i.e. two of the three retransmissions were in the same TCP flow.)

So the individual packet retransmission counts for each analyzer were:

  • Opnet: 72
  • Wireshark: 3
  • Observer: 72
  • OmniPeek: 3
  • Sniffer: 3

As I used to say in my classes, shall we just average the results and call it a day?  Not!

The right answer verified manually is three, making WireShark, OmniPeek and Sniffer were correct in this particular scenario.  That’s not to say that these tools are always correct in every situation - they aren't.  Again, the purpose of this exercise is to verify your data.  I'm not picking on any particular tool.

The reason for the large number of false positives in and Opnet and Observer was due to its misinterpretation of the TCP close connection sequence.

A graceful TCP close (i.e. not a RST or reset) is a four-packet TCP FIN sequence consisting of a FIN followed by an ACK to close one half of the connection, and then another FIN-ACK pair closes the second half of the connection to bring it to a full close (or in short FIN-ACK, FIN-ACK) as shown in the following figure (from Network General Sniffer).

Normal

Textbook TCP Close

In the trace in question, I noticed a different close sequence:  FIN one direction, FIN the other direction, ACK the FIN, ACK the FIN (or in short, FIN-FIN-ACK-ACK). Also unusual was the fact that the FIN bit was set in the ACK from the server.  The following shows the alternate TCP close.

Abnormal_2

Non-Textbook Close

Observer and Opnet apparently are fooled into thinking that the final TCP ACK packet is a retransmission since the FIN bit is set again (which is irrelevant as the connection is already closed) and the TCP sequence number matches the previous FIN packet.  Sniffer et. el. do not report them as retransmissions because they are simply the last of the four packets in a TCP FIN close sequence.

This particular application used hundreds of such TCP sessions and subsequent closes to transfer a relatively small amount of data, another problem in itself.

The lesson is that when in doubt, seek a second opinion from another tool or roll up your sleeves and perform manual analysis on a small test section of packet data to confirm your suspicions.

February 01, 2008

Live Webinar and Survey Reveals Wireless Secrets

I recently attended a live Cisco Mobility TV webinar co-sponsored by AirMagnet entitled "Designing and Deploying 802.11n Next-Generation Wireless." Apparently it was a big hit; according to the moderator, a record "thousands" of viewers logged in to watch it. Here’s what I thought were a couple of interesting takeaways.

Drop in Replacement or New Site Survey Required?

A Cisco representative started out by recommending a 1-for-1 access point replacement of legacy APs giving priority to performance over coverage. In other words, swap in a Cisco dual band 1250 AP to handle both legacy 802.11bg devices with the same coverage pattern as before while providing 802.11n access (in the 5 GHz band) for new 802.11n clients.

I thought this was a bit strange since 802.11bg does not like reflections whereas 802.11n using MIMO thrives on it. APs must be relocated accordingly. Later in the Webinar, the AirMagnet guy noted that an active site survey using a laptop with a live 802.11n client adapter is required to figure out the optimal 802.11n AP placement to take advantage of multipath. This seemed to contradict the Cisco 1-for-1 forklift strategy.

AirMagnet also recommended surveying with more than one 802.11n client adapter type if you are supporting more than one brand. I think this is a good idea since, unlike 802.11bg, Cisco does not provide a client side adapter.

Upping the Power Requirements

Perhaps the most controversial aspect of an 802.11n wireless LAN upgrade is the additional power requirements for dual band APs. Cisco claims that their enhanced PoE is the only viable single port solution for dual radio operation. They stated that dual PoE is twice the cable cost, 4x the cost to pull the cable, and uses more switch ports. They also noted that competitors supporting dual band operation over standard 802.3af PoE do so at reduced transmission power.

Luckily, CPUs are not the only chips going green. Witness the recent announcement from Siemens that generated a flurry of online articles discussing the 802.11n power controversy. Siemens managed to cut 3W off a full 802.11n MIMO (roughly 600 Mbps using transmission over 3 antenna with 3 streams each) AP running at maximum radio output in both the 2.4 GHz and 5 GHz bands simultaneously and yet operate over standard PoE.

Audience Survey

Having a captive audience of thousands, Cisco conducted three polls during the Webinar. Here are the questions and results.

"How do you expect to benefit from the use of 5 GHz with 11n?"

Surveywhy_5ghz

No surprises here. Only 5% claim that they will not use 5 GHz for 11n, vindicating the use of 5 GHz in the enterprise.

"What do you see as the biggest inhibitor to 11n adoption?"

Surveyinhibitors

Is the undetermined business need from those not deploying wireless whatsoever or are their needs currently met by 802.11bg? Also note the lack of a warm fuzzy for the current 802.11 draft 2.0 standard.

"How do you plan to power your 802.11n access points?"

Surveypowering_11n

Not much to add here. Looks like the largely Cisco audience prefers enhanced PoE.

So there you go. Some inside info and survey results from a wildly popular wireless webinar.

January 16, 2008

Optimize Your Web Site with YSlow

Some of the best free tools in life are the niche ones such as this nifty little utility available from the Yahoo! Developer Network called YSlow (a nice play on Why Slow). You'll need to install Firefox along with the Firebug web development tool to run it, but it's definitely worth it. I'll first describe a controversial aspect of the tool, then get on to the interesting stuff.

The Controversial Part

YSLow will analyze a web page and assign a grade of A through F based on 14 criteria. For a complete list of the criteria, check out "Best Practices for Speeding up Your Web Site." Some of the grading criteria is straight forward like the number of HTTP requests or turns (the fewer the better). Others are more complex and designed for larger enterprises such as having a Content Delivery Network (CDN) like the Akamai provides for the likes of Apple, Dell, and IBM. Thus if you have a small to medium sized business, you'll get dinged for not having a CDN.

You'll also get nicked for not using something known as "far future Expires header", GZIP components (even for small objects), and ETags. ETags allow web servers and browsers to figure out if an object in the user's cache matches the one on the server. This is somewhat controversial as even the YSlow developers admit that they are "unique to a specific server hosting a site." Please check out the aforementioned "Best Practices" for more details.

Naturally, the Yahoo! Web site gets an "A" grade which is interesting, considering there's far more junk on the page then a lightweight page like Google. Yahoo! used to be lightweight, but thanks to Expires headers, cache priming (i.e. the biggest hit is when you visit the site for the first time), they manage a good score and load very quickly on subsequent visits. One area that could be improved as YSlow points out is to reduce the number of HTTP requests.

Just for fun, I decided to see what the YSlow score was from the very vendors that sell stuff to help us to troubleshoot our network and application performance. It surprised me that the web home pages from companies like Fluke Networks, OPNET, Network General, Network Instruments, NetQoS, NetScout, Niksun, and WildPackets all received a grade of D or F.

Embarrassing? Perhaps. Even without a CDN, one can get much better than a failing grade by employing some of the other techniques. Keeping your pages fairly lightweight and minimizing the number of HTTP requests can go a long ways towards improving the score. Read on.

The Good Stuff

YSlow will show the total bytes and total time to download a page. For instance, the OPNET website weighed in at a whopping 501k bytes and most surprising, 4.0 seconds to download even over a fast 8 Mbps connection. These timings were with a primed cache (i.e. I merely did a page refresh after already visiting the side). Why so long? YSlow to the rescue.

YSlow will show you the timing for each object downloaded, for each TCP session that the browser supports (both IE and Firefox support two TCP sessions to a Web server by default.) The following screenshot of the YSLow "Net" tab gives you an idea of what this looks like. (In the interest of size and readability, I'm only showing you a portion of the web page and timing analysis.)


Opnet1_2

YSlow Component Load Time Analysis

You can expand the information for each component, including the source code from where it came and even see what the object (such as a gif image) looks like in isolation in another browser window. Each component (HTML, CSS, images, Flash, etc.) can be inspected in detail.

A closer look at the timing analysis revealed that the web page has a large number of HTTP requests including nine external JavaScript files. In all, there were 40 HTTP requests generating 508k bytes worth of traffic that make up this web page, yet roughly 501k bytes are downloaded every time even with a primed cache. Why?

YSlow revealed that only 7k bytes worth of objects are loaded from the user's browser cache. The biggest culprit was the flash animation, weighing in at 477k bytes and downloaded each and every time. This, and the large number of HTTP turns (37 according to a packet capture) coupled with round trip network delay, leads to the longish four second load time for the page even for just a refresh.

How about after clearing the browser cache? Does it get worse? Surprise! There was little real difference. I saw one more turn count and a bit more data. The HTTP turn count is there regardless to check the browser cache against the server and/or download the objects, most of which were tiny gif images or small JavaScript files so it didn't make a whole lot of difference in this case.

Furthermore, the round trip delay (56 ms in this example) starts to become a factor. Fewer turns would help. It also wouldn't hurt to optimize the flash. I'll leave the rest of the web page optimization as an exercise for the student. As a hint, not all web pages that use animation are penalized. The WildPackets home page for instance, went from 471k to 47k when cached (the next figure shows how YSlow gives you these stats). The animation stays cached. The number of objects and HTTP subsequent requests however, are quite high at 62. That's a lot of turns and should be optimized.


Wildpackets

YSlow Cache Analysis: Note the drop in bandwidth, but high turn count.

Conclusion

Let's not get complacent just because we can throw more bandwidth at it; i.e. the Internet pipes are getting fatter all the time. Web sites can and should be optimized. If everyone did so we could save, oh, perhaps a few billion bytes worldwide every minute or so. Even corporate intranets need to consider web tuning to best serve up their users.