This is the HTML version of a paper I've written for school. Sorry for the poor formatting, but it was generated (semi-)automatically with Google Docs from an ODT document. You can download a nicer, PDF version of it here.
Introduction
Humans have a natural ability to correlate patterns from multiple sources to become aware of information. Most current human-computer interfaces however are limited to visual and more exactly text-based interaction. While it is true that current computer systems use a large number of symbolic elements and metaphors in the Graphical User Interface (like icons, windows, drag and drop, and so on), the primary mean of information exchange is still the written (or in the case of computers typed) word.
This approach (while having the advantage of simplicity) is very limiting from an information-bandwidth (the rate at which information can be exchanged between the user and the computer) point of view. Because typing / reading has an upper limit (although estimate vary from 40 to 250 words per minute for typing and between 250 and 700 words per minute for reading, it is clear that such limits do exists and that most people are not able to cross certain barriers, no matter the amount of training), it is logical to conclude that we must look to alternative ways to consume more information quicker. From a technical point of view the two methods which are the easiest to implement with current hardware are:
-
alternative (non-textual) representation of (aggregated) data, making use of our ability to observe trends and patterns
-
auditory feedback for certain events
Both of these methods can easily be implemented with almost all consumer-grade computing equipment. The fact that both of these are “output” (from the computer point of view) methods can be justified that the rate at which outputs is produced (and needs to be consumed) is much larger than the rate of input we need to provide (two of the reasons being that a great deal of output is produced by automated systems and also because in todays interconnected world we consume the sum of many outputs). In the following pages I will focus on the first method and more specifically the way in which it applies to networking data.
Each part (after the introductory part about data visualization) will present one way to visualize data and discuss the applicability of the given method in the context of computer networking.
Data Visualization
Data visualization is defined as: “any technique for creating images, diagrams, or animations to communicate a message”i. In general data visualization is used as a way to aggregate large quantities of data and present them in a way whichii:
-
allows us to quickly communicate rich messages (communication)
-
allows for discovering new, previously unknown facts and relationships (discovery)
-
allows us to get better insight into things we already know (insight)
This list is usually augmented with a forth element: provides an aesthetically pleasing way to present information. While this is an important topic under certain circumstances (for example in newspapers trying to popularize a certain fact), it is not a common requirement in the field of networking / security, thus only one example will be presented:
Illustration 1: A visual representation of a sample from the PWS-Lineage family by Alex Dragulescu
The image above is one of the manyiii pictures from Alex Dragulescu based on malware samples. From the author's site:
“Malwarez is a series of visualization of worms, viruses, trojans and spyware code. For each piece of disassembled code, API calls, memory addresses and subroutines are tracked and analyzed. Their frequency, density and grouping are mapped to the inputs of an algorithm that grows a virtual 3D entity. Therefore the patterns and rhythms found in the data drive the configuration of the artificial organism. “
While they provide no additional information about the code or its workings, they are aesthetically very pleasing. Aesthetics and usefulness do not necessarily exclude each-other, as can be seen from the picture below, however this article focuses on automatic visualization of automatically gathered data and while some aesthetic qualities can be provided (foreground-background colors for example) either manually or programatically, creative visualizations such as the one below can not be automatically generated (nor would we want to, because one of the purpose of visualization is to help answer queries by displaying different parts of the data, a process in which uniformity is key for the observer to be able to detect patterns and infer information).
Illustration 2: Charles Minard's information graphic of Napoleon's march
Finally, before getting to the different types of visualization methods, I would like to mention two criteriaiv which are considered essential in creating visual representation of data:
-
Maximize the data to ink ratio
-
First, tell the truth and second, do it efficiently, with clarity, precision...
While these are essential requirements for any visualization, computer generated charts are less likely to have any of the two pitfalls. This is because computer graphs are by definition minimalistic (mostly because it is very hard to define rules which could produce complex, yet aesthetical graphics from arbitrary sets of data) and because (with regards to the second point) visualization programs only concern themselves with presenting the data, not how “good data” should look like (which is why dubious tactics like not presenting the whole range, using non-linear scales, filtering of the data and so on get used – by humans who have a clear idea how “good” data should look like, what it should represent or what is the point that they are trying to make). An other reason which makes such problems unlikely is that visualization in this context is (will be) used for decision making and any software which produces misleading visualizations, which in turn translates to wrong decisions, wont get many usage.
The only problem with computer visualizations is that they can be too coarse or too detailed which could result in the wrong conclusions (see the discussion about bar charts / histograms later on for an example). This appears because, generally speaking, the computer can not guess the level of detail at which the user needs the data. The problem can be resolved by giving the user the possibility to quickly change the level of detail for a given chart she is looking at or, even better, if the program has specialized knowledge about the intended use of the data (for example it will be used to generate blocklists of /24 subnets for firewalls) then set the level of detail appropriate for this scope (groups of /24 subnets in our example). Or to maximize both flexibility and usability, make it possible for the user to change the level of detail, but set the default accordingly.
Methods for visualizing data
Line chart
A line chart is a two-dimensional scatterplot of ordered observations where the observations are connected following their orderv. It can be used to represent both two dimensional and one dimensional data (in the later case the second dimension defaults to the measurement unit of the data – time, length, etc). The data points can be connected or unconnected. Unconnected usually means that the data measured is discrete and connected by convention means continuous, although sometimes the connecting line is draw for aesthetic reasons or to facilitate the drawing of conclusions (observing trends for example). Drawing the points on the continuous line by convention means that the represented quantity is continuous, but it was only measured in the marked points.
An important issue is also the formula used to connect the points (if it is the case). This must be based on the knowledge we have about the nature of the measured quantity (does it change linearly? logarithmically? exponentially? if so, what is the base of the exponential function? and so on). Two common functions used for interpolation are the linear function (because of its simplicity) and splinesvi.
An other common practice is color the area between the line and the X axis to make it easier to read, especially when multiple data series are represented at the same time.
Illustration 3: Network appliance IOPs per protocol generated with rrdtool (http://oss.oetiker.ch/rrdtool/gallery/index.en.html)
Miniature versions of line charts embedded in the text to better illustrate a trend or pattern are called sparklinesvii. They don't include legends, axis or numeric values and their sole purpose is to illustrate a trend or pattern, without giving exact numeric details.
Illustration 4: Sparkline example, from Wikipedia (http://en.wikipedia.org/wiki/Sparkline at 02-05-2008)
Scatterplot
The scatterplot is the graphical representation of two quantitative variables. When one of the variable is controlled by the experimenter (for example the linearly increasing time), it is customarily drawn on the horizontal axis and the result is a line chart. However when both variables are a result of the observation, the graph show correlation (not to be confused with causality) between the variables. In this case it makes possible for the observer to discover correlation patterns between the variables (like positive or negative liner correlation).
Illustration 5: Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. From Wikipedia - http://en.wikipedia.org/wiki/Scatterplot - at 02-05-2008
A variation of this technique was used by [Teoh et al., 2002]viii to visualize BGP routing table changes. This differs from the classical scatter-plot in two ways: first all four edges of the bounding rectangle have explicit meaning assigned to them (each represent a distinct group of AS numbers). Second, the rectangle itself is partitioned in a special way (using quadtrees) to encode the different IP address prefixes. This means that the relation between a point and the axis can not be determined by a simple orthogonal projection of the point (because we are not using a Cartesian space), thus it must be drawn explicitly with a color-coded line. The paper has found that this type of representation, while not very intuitive in a static case, it is very helpful in an animated situation (where data from one day is mapped to one frame in the animation) because it created patters which could easily be categorized by a human observer as normal or anomalous. (Remark: there are alternatives to quad-trees in representing the complete IPv4 space. See the heat maps section for a discussion. Also, an other service using quad-trees to represent the networking data is IPMapsix, which can be used to browse the BGP advertisements and the darknet data with a Google-Maps style interface.)
Illustration 6: Quadtree coding of IP prefixes. Left: Top levels of the tree, and the most significant bits of the IP prefixes represented bye ach sub-tree (sub-square). 4 lines representing AS numbers surround the square representing the IP prefix space. Right: Actual data. A line is drawn for every IP-AS pair in an OASC.
An other example for using scatter maps to visualize networking data (this time using a Cartesian coordinate system) can be seen in the user interface of the NVisionIP projectx:
Illustration 7: "Galaxy View" of the NVisionIP project, showing an oveview of the activity (color coded per port) on all the subnets and machines
Yet an other version of scatter maps was used during the SC03 conference held in Phoenix, Arizona between November 15th, 2003 and November 21st, 2003. Dubbed “The Spinning Cube of Potential Doom”, if was a three dimensional cube which represented the source IP, destination IP as well as destination ports. A point is drawn for each connection defined by its (source address, destination address, destination port) tuple. The cube is also animated with time. While the primary goal of the software was educationxi, it has been observed that several types of port-scanning produce very distinct and remarkable patterns. Even so, the practical use of the software is limited, because these patterns can be observed only from specific angles, which vary depending on the (source, destination, port) tuple. This means that an analyst would have to constantly rotate the cube and observe it from different angles to ensure that she get all the information. An alternative solution would be to project three views of the cube (one for each plane) in parallel for the analyst to observe. Even with these limitations, it is an interesting concept.
Illustration 8: The Spinning Cube of (Potential) Doom
Bar chart
Bar charts are used to represent relative frequency (or size) of several types of data. One axis holds the unit of measurement while the other the list of groups. It is also possible to add a third dimension to the data by presenting different types of measurements for each group (from different time-frames for example) to represent both the relative frequency between groups and the change in time inside of one group. One must be careful however because the method doesn't scale to large number of categories in the “third dimension” and quickly becomes cluttered. In this case an alternative would be to redraw the data as a line chart containing multiple lines (one for each group of data) and placing the categories on the horizontal axis.
Illustration 9: Part of the NVisionIP interface showing the traffic distribution by ports with bar charts
Histograms are a special type of bar chart where we area (not the height!) of the bars denote the frequency of a given group. This means that bars can have variable widths, although in practice this doesn't happen often (thus it is not widely known) because the horizontal scale usually contains equal width parts (and thus the frequency will get represented by the height alone as in classical bar charts). An important aspect of histogram drawing is choosing them number of bins (horizontal categories the values get grouped in). There have been attempts to define the number of bins with a formulaxii, however these make strong assumptions with regards to the type of data which gets represented and are not suited for any type of data. This is the Achilles heel of the histograms, because a wrong choice of the bin numbers can lead to the observer drawing the wrong conclusion from the data. As an example, below are four examples of the same dataset represented with varying bin numbers (histograms taken from CMSC 120 – Visualizing Information):
Suppose that we are interested in determining the interval which contains the maximum number of elements. While the first three steps progressively enhance the accuracy, the last histogram contains to much noise for us to be able to make this determination.
Box plot
A box plot (also known as box-and-whisker plot) is a method for representing data together with its statistically significant properties. The properties included in the drawing are:
-
Minimum – the lowest value in the dataset
-
Lower quartile – cuts off at ¼ of the data
-
Median – the middle value
-
Upper quartile – cuts off at ¾ of the data
-
Maximum – the highest value in the dataset
Illustration 10: Example for box plot taken from CMSC 120: Visualizing Information
Additionally to these rigorously defined values, two other values are considered: the largest and smallest non-outliers. An outlier is a value which is numerically distant from the rest of the data. There is no unique method to determine if a value is an outlier or not, thus it can vary from case to case. Outliers are depicted as dots in the plot.
While the box plot presents a great deal of information in a compact manner, its interpretation (as well its construction) requires at least some knowledge in the field of statistics. This makes its usability limited and as of now, it isn't present in any network oriented data visualization tool that I know of.
Pie chart
A pie chart (or a circle graph) is a circular chart divided into sectors, illustrating relative magnitudes or frequencies or percents. In a pie chart, the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it representsxiii. A pie-chart is used when the sum of the individual elements is a meaningful value (for example 100%). There are three variants of the pie-chart: the exploded pie chart, the perspective pie chart and the polar area diagram, where the central angle is constant and the radius of each element varies relative with the data.
Illustration 11: Example for a three dimensional exploded pie chart - Wikipedia
Pie charts are not a good choice for representing networking data for several reasons. First of all, research has shown that comparison by angle is far less accurate than comparison by length. The only exception to this is the case when the comparison is made between one slice and the whole. Second of all, pie charts by their nature present a bounded view of the data (for example: the traffic divided by protocol for one day) and thus are not useful for situations when a timeline of the events has to be studied (as it is often the case in networking).
Bubble chart
Bubble charts are a way to represent multi-dimensional (greater than two) data. The basic idea is the two of the dimensions are placed alongside of each of the axis. The center of circles is placed at the intersection of the horizontal and vertical coordinate the datapoint has and the radius of the circle represents the third dimension (this assumes that the third dimension is quantitative). Additional dimensions can be added via the color, shading, border with and other visual attributes of the circles, however this method doesn't scale to more than 4-5 dimensions. Conversly, the number of dimensions can be reduced by eliminating one or both of the axis.
Illustration 12: A bubble chart, taken from "Visualizing the Signatures of Social Roles in Online Discussion Groups" by Howard T. Welser
These types of charts are not yet used in network monitoring systems, although they have the capacity for a successful use, because they can map time-series data into visual charts. Possible drawbacks are lack of software support and familiarity of people with this type of visual representation.
Waterfall chart
A waterfall chart is a special type of floating-column chart. A typical waterfall chart shows how an initial value is increased and decreased by a series of intermediate values, leading to a final value. An invisible column keeps the increases and decreases linked to the heights of the previous columnsxiv.
Illustration 13: Example waterfall chart taken from Wikipedia
While these types of charts could be useful for representing the increase / decrease in network traffic for example, in current software bar charts of delta values are used for this purpose.
Heat maps
A heat map is a graphical representation of data where the values taken by a variable in a two-dimensional map are represented as colorsxv. An essential requirement for heat maps to be useful is that the parameters along the horizontal and vertical axis must be placed in a way that “closeness” on the map equals “closeness” in the conceptual space of the problem. This is needed to make it possible to translate patterns from the visual space to the problem space. In the case of IPv4 there are two methods which have been used to achieve this: one is the quadtree encoding mentioned at the scatter plots. An other is based on the Hilbert curvexvi depicted below:
Illustration 14: Hilbert curve, first to third order, from Wikipedia
This curve has the property that it tends to fill the entire two-dimensional space in an equal manner, and if we place an ordered set of elements (like IPv4 addresses) alongside of it, numerically close elements will be placed close (from an Euclidean-distance point of view) in the representation. Below are two examples using this technique to provide a heatmap for the entire IPv4 range. The first one is based on the “malicious activity” observed by Team Cymruxvii and the second one represents the size of networks advertised via BGPxviii.
Illustration 15: Team Cymru Internet Malicious Activity Map at 02-05-2008
Illustration 16: The Measurement Factory - IPv4 address range assignment based on the BGP routes advertised at 02-05-2008
Heat maps need not be of rectangular forms. Any arrangement which maps well to the contextual problem space is useful. For example the Akamai Content Delivery Network provides a heatmapxix representing the traffic delivered to each individual country using a world map as their base:
Illustration 17: Akamai - overview of the delivered traffic at 02-05-2008
An interesting aspect of the above interface is that it uses an interface to show the details inspired from the “Toolglass and Magic Lenses”xx paper (although it does not include the advanced interaction modes or the ambidextrous method of control described therein).
Graphs
Graphs are a natural view to represent networks or networking related information because it maps nicely to the paradigm of “a group of interconnected things” (network elements in the case of networks and nodes in the case of graphs). One of the very well known packages of using this paradigm is the “Network Weathermap”xxi, which uses a background (usually a geographical map), points (corresponding to routers, switches, media converters and other networking equipments) and connection between these points (representing the physical connection between the elements). The connections are drawn as two opposing arrows (one for upstream and one for downstream traffic) and are color-coded (from white to red) to show the link usage percentage:
Illustration 18: Example Network Weathermap for the french ISP "Free" at 02-05-2008
An other frequent way of using graphs is using the unique addresses of networking elements (for example MAC for layer 2 or IP for layer 3) as nodes and the connection between them is used to represent “flows” (which can be defined, depending on the protocol, as simply as “all the packets having a given source and destination” or as complex as a TCP flow with handshake, sequence numbers and so on). A tool using this approach is EtherApexxii, a screenshot of which which can be seen below:
Illustration 19: EtherApe
The nodes are active network elements (routers, NICs, ...) represented by their addresses. The link between them represents the traffic, color coded by protocol. The size of the node is proportional with the amount of traffic it generates.
While the previous two examples arranged the nodes based on external methods (like geography or an arbitrarily chosen shape (ellipse in the case of EtherApe), a common practice is to let them “auto arrange” by defining an attracting-opposing force between them (similar to a spring) and let it auto-arrange in a way which minimizes these tensions. If the forces are well defined, the result usually is both aesthetically pleasing and useful. As an example below is presented a color-coded “map of the Internet” from 1998, where distances are based on the hopcount reported by traceroute from Bell labs and the color coding is used to mark equipment belonging to the same organization (based on the first two parts of the DNS name retrieved by a reverse-lookup):
Illustration 20: Map of the Internet from "Mapping and Visualizing the Internet" by Bill Cheswick, Hal Burch and Steve Branigan
Parallel coordinates
Parallel coordinates are a way to visualize high-dimensional data. Its most important property is that it can be easily extended to an arbitrary number of dimensions. To represent a point in an n-dimensional space, n equally spaced parallel axes are drawn which correspond to the dimensions. The point is represented as a polyline connecting the axis 1 to n. The points of contact between the polyline and the i-th axis is determined by the numeric value of the i-th coordinate of the point. This representation is a natural fit for networking data which has several dimensions. Below you can see a screenshot from the rumint toolxxiii depicting a traffic capture along four coordinates (source IP, source port, destination IP and destination port):
Illustration 21: runmint tool showing a network trace - taken from the DEFCON 12 presentation "Network Attack Visualization" by Greg Conti
An additional positive aspect of this representation is that many interesting network level activities (like an external host scanning the network or vice-versa) map to very distinct, easy to recognize and logical patterns, even if the observer doesn't know about parallel coordinates.
A similar layout has been proposed by the book Visualizing Dataxxiv as an alternative scatterplots when inquiring about the correlation between two data sets. In the example of the book the two data sets are positioned on two vertical axis and the connection line is colored blue or red depending on its angle (a downward or upward angle corresponding to a positive or a negative correlation repectively). Also the width of the line is proportional with the angle.
Finally, a simple version of this method, with just two vertical axis representing the source and destination IP space is used in other monitoring tools, for example tnvxxv.
Illustration 22: The two-axis version of parallel coordinates used in tnv to represent the source and destination IP addresses as well as the source and destination ports
Grid layouts
We've seen two methods of mapping IP addresses to grid layouts. If the number of IP addresses is small (for example an internal network, as opposed to all the possible IP addresses, or the subset of IPs internal hosts communicate with), it is possible to arrange them in a grid where individual cells have an acceptable size (for example 8 by 8 pixels). This means that individual IP addresses can be interacted with (clicked upon, cursor hovered over) to filter the view or to display additional information.
The first example is from a tool called VISUALxxvi. The central idea of this tool is that the system administrator doing the monitoring is inherently both more familiar with the internal network and more interested in it. So the internal network is depicted as a large(ish) grid. External hosts involved in the communication are presented to the right. Their size is proportional with the amount of data flowing to/from them and the small vertical lines depict the TCP/UDP port used at its end during the communication. The external hosts are arranged based on their IP address: the first two bytes from it are used to calculate the vertical coordinate, while the second two bytes are used for the horizontal coordinate (of course these coordinates are then scaled to the available screen real-estate). This means that external hosts are naturally grouped in C classes. The lines of course represent communication between an internal and external hosts.
Illustration 23: VISUAL, Home-Centric Visualization of Network Traffic for Security Administration
An other tool having similar ideas is PGVis and its three-dimensional brother PGVis3D. These tools don't seem to be public yet, so all the data was collected from a presentation entitled “Visualization for Network Traffic Monitoring & Security ” by Erwan Le Malécot. The idea is that both the external and internal IP space can be mapped to a usable grid size. This is possible because the internal IP space is small and because from the external IP space we consider only the ones we have seen communication with. There are four distinct grids: internal source, internal destination, external source and external destination. The lines are color-coded based on the traffic type.
Illustration 24: Example PGVis screen
The author of PGVis also experimented with a three-dimensional representation, but this interface still has the same problems as other 3D systems: it requires constant adjustment for the viewpoint.
Treemaps
Treemaps are a way to represent nested structures (trees) while giving information about their size and parent-children relationship at the same time. It was originally developed to help discover large files and directories on a drivexxvii, but later got applied in many other fields. The original concepts (and still most of the current implementations) used rectangular shapes, but alternative solutions have appeared, like circular tree mapsxxviii.
Illustration 25: Example treemap showing the space distribution from a harddisk - from Wikipedia
In the context of network data visualization it was used as a way to map out the IP space either by “owner” (based on the AS numbers and BGP routing data, as other “internet mapping” projects presented earlier), by country (size of IP blocks assigned to a given continent / country) or by traffic to/from that AS by the DBVIS research projectxxix.
Illustration 26: Example treemap from the DBVIS project mapping out the connections to the University of Konstanz on Nov. 29, 2006 aggregated on autonomous systems.
Conclusion
Data visualization for networking data is an emerging field which should experience a rapid uptake in the following years as the volume of data grows and the pressure to make correct, fast and informed decisions becomes bigger and bigger.
While “classical” representations (like bar charts, pie charts, etc) don't provide enough detail for them to be useful in the multi-dimensional world of networking data, they are still useful for particular sub-parts of it. For the “big picture” overview several methods have been devised, some of which scale to the level of the entire Internet. Combining the two probably is the future of user interfaces in network-data analysis systems.
ivEdward Tufte – Envisioning Information (1990)
viiiCase Study: Interactive Visualization for Internet Security – Soon Tee Teoh, Kwan-Liu Ma, S. Felix Wu 2002
xNVisionIP: An Animated State Analysis Tool for Visualizing NetFlows – Ratna Bearavol, Kiran Lakkaraju, William Yurcik
xiThe Spinning Cube of Potential Doom – Stephen Lau – http://www.nersc.gov/nusers/security/TheSpinningCube.php at 02-05-2008
“The primary goal was education. There has always been misinformation and hearsay regarding computer security and it is sometimes difficult for those unfamiliar with computer security to conceptualize the overall extent of malicious traffic on the Internet today.”
xxToolglass and Magic Lenses: The See-Through Interface – Eric A. Bier, Maureen C. Stone, Ken Pier, William Buxton,
Tony D. DeRose – 1993
xxivVisualizing Data – Exploring and Explaining Data with the Processing Environment by Ben Fry, p 123
xxviHome-Centric Visualization of Network Traffic for Security Administration – Robert Ball, Glenn A. Fink, Anand Rathi, Sumit Shah and Chris North