hype-free: May 2008

Friday, May 30, 2008

Multi-threaded Visual C rand

I was helping out a friend who was trying to generate random numbers in several threads and he was expecting those numbers to be (at least somewhat) different. After getting the obvious problem resolved (you must call srand to initialize the seed - using for example the current time), we were still getting threads which were outputting the same sequence of (pseudo-)random number.

At this point the question came up whether srand and rand were thread safe / thread aware (meaning that the static variables they use were per process or per thread). The documentation of srand didn't say anything nor did the documentation of rand. A few quick searches for keywords (like Visual C, stdlib, thread safety, MSDN) turned up only articles related to C++ STL. At this point we decided to give it a last go and tried setting the seed in each thread to time + thread number (because the time would have been identical, given that any modern OS can start up a few threads within the same second). The results were the same: different threads producing the same sequence. At this moment we had two options:

Creating a central wrapper around rand protected by a mutex
Implementing our own random number generator.

We went with the second option, searching for a public domain random number generator and found this page. We simply inlined it with our thread function and used a local (stack) variable to store the seed, which was unique to the thread (since stacks are unique to the threads).

PS. These random numbers are for toying only (doing some simple experiments). If you need to use them for production grade software, take a look at methods to generate strong random numbers.

Update: after loading a test program in IDA, it seems that the VC runtime does take threading into account (ie it stores the random values using the TLS on a per-thread basis), which means that my initial diagnosis is wrong...

Update: ok, I think I got it (man, multithreading is hard). The code was something like this:

InterlockedIncrement(&aGlobalVariable);
threadID = aGlobalVariable;

This code is not thread-safe, because only the incrementation part is guaranteed to be atomic and it is possible to have the following scenario (which is probably what was happening):

Thread A increments the value
Thread B increments the value
Thread A reads the value
Thread B reads the value

To avoid this, here are a couple of possible solutions:

Use the value returned by InterlockedIncrement directly, rather than doing a separate read of the value
Transmit the identifier to each thread through the parameter it is passed to them.
Use the GetCurrentThreadId function together with the current time to initialize the random numbere generator (this is again not 100% foolproof because thread id's - just as process id's - get reused), but is should be good enough.

Wednesday, May 28, 2008

Stop thinking in stereotypes!

0 comments

Stereotypes may help you form a quick opinion about matters, however you almost certainly would be wrong. Romania has a few such associated stereotypes (like orphans), but the one related to IT security is East-European criminals.

In-line with this perception we have the latest F-Secure blog post Romanian Whack-A-Mole and Linux Bots (disclaimer: I work for a security company, but these views are my own).

People please wake up and smell the roses: security is a global problem! The fact that a given IP is located in a give country gives you almost no information about the real perpetrators behind the scene! I could quote studies saying USA is the biggest source of spam, China has the most bots, but these are meaningless. The Internet is truly a cross-nation phenomenon (I invite you to check out the C&C map and bot map from shadowserver - as you can see they follow approximately the distribution of computers rather than arbitrarily country/region borders). Shutting it down in one place will just move it to an other country, unless we can act united.

PS. I have no connection to Shadowserver (just a fan of their effort). Check out this presentation for AusCert08 given by one of the members. Good stuff.

The human aspect of security

0 comments

The weakest link in security are humans. This is both good (when you have a system where the weakest link isn't technology means that you succeeded from a technological standpoint) and bad (because you must learn new skills to try to mitigate the new threat).

A couple of days ago I downloaded the Ubuntu 8.04 ISO and checked the MD5 (mainly to see if it was corrupted) when I remembered something I've read somewhere (sorry for remembering the exact source, if someone could provide the link I'll update the post - or maybe it is just one of these ideas floating out there):

While MD5's aren't the strongest hash out there, the way people tend to check them makes it even less secure. We tend to check them by looking at the first few digits and (maybe) at the last ones. We are trained to recognize words by length and letter composition, not necessarily by the exact content. This means that creating a hash which would fool a human with a big probability is much simpler than creating an exact match.

The method I usually use if pasting the two hashes I wish to compare in a text editor (vi, emacs, nano, gedit, whatever makes you happy) one below the other, and then scanning horizontally for differences. I found that doing this is a quick (and much more reliable) method for spotting differences.

BTW, the same idea applies to other hashes or situations where people are asked to compare complex (and for them meaningless) strings. These steps should always be performed by computers in the security systems. This is something SSH gets right for example, because it compares the fingerprint of the server and warns you if there isn't an exact match. Of course this is only a partial solution because you have to confirm the new fingerprint by some other channel (by phone for example) and we're back to the situation of people comparing long and complex strings.

The new rm -rf /

0 comments

There are many urban legends out there talking about n00bs asking a *nix related question and getting the answer just do rm -rf / from the terminal (by the way, you don't want to do that - it tries to recursively erase all the files from your hard-drives - in general when you get advice from a website, you really should corroborate the advice with the documentation).

Anyway, back to the story: I was sitting beside a sysadmin friend and we needed to kick out some processes connected to the machine, so that we can change port bindings and all that stuff. I don't know why, but instead of doing netstat -anpt and then killing off the processes connected, I suggested doing rmmod ipv6. This can't hurt since, we're connected through IPv4, right? And when it refused, said the module was in-use I said surely there must me a force switch.

And indeed there was. And the network connectivity went away. And we had to task someone from that building (this was all going on through SSH, over several hundreds of kilometers) to go and reset the machine. Surprisingly (or maybe not) the same night the box developed a filesystem corruption and caused a couple of hours of outage.

So there you go: rmmod -f ipv6 is the new rm -rf /.

Monday, May 26, 2008

Kernel 2.6.24 + PostgreSQL != love?

2 comments

It seems so. The sad thing is that at the moment Ubuntu Server 8.04 (LTS) comes with 2.6.24... Ouch. Hopefully the patches will trickle down quickly. Also, note to myself: newer is not always better.

Update: back to the future - 2.6.24. Thank you Marcin for pointing it out. Also corrected the tags.

Installing Perl 5.10 on Ubuntu

3 comments

So I upgraded to Ubuntu 8.04 and I'm not very impressed unfortunately. I got Compiz working, thanks to Compiz-Check (it works only at lower resolutions, so I switched it off, however it's nice to have the option) and also Monodevelop 1.0, however the installed Firefox is slightly outdated (Beta 5 rather than RC1) and sound doesn't seem to work with Flash. Also, there is no Perl 5.10 package in the repositories. After searching around a bit I decided that doing an install from source was not a good idea, unless I wanted to screw up my whole system and/or was ready to pull all the dependencies and choose an alternative install path.

ActivePerl to the rescue. Get your .deb from the site (which is a little tricky, but if you read carefully, you realized that you don't need to enter your contact information go get to the download) and follow the instructions to get Perl 5.10 installed in the /opt directory.

Sunday, May 25, 2008

Test for available modules in Perl

0 comments

As I mentioned earlier the difference between use and require is that the second is evaluated only at execution time, making it possible to test if a given module was imported successfully. One possible use for this is to make your script deployable on multiple machines where you might or might not have the option to install modules from CPAN/PPM.

Below you can see an example for finding the YAML library in multiple possible places and then aliasing it to a variable, so that it can be referred to in an unique way ($yaml_dump-($stuff)), not matter from which library it got imported:

#try to determine the installed YAML library
my $yaml_dump;
eval {
 require Module::Build::YAML;
 $yaml_dump = \&Module::Build::YAML::Dump;
};
eval {
 require PPM::YAML;
 $yaml_dump = \&PPM::YAML::serialize;
} unless ($yaml_dump);
eval {
 require YAML;
 $yaml_dump = \&YAML::Dump;
} unless ($yaml_dump);
die("Failed to locate the YAML library!\n") unless ($yaml_dump);

Mixed links and commentary

0 comments

MS Office took a page out of IdeaJ's book and uses every available method to annoy users (check for valid licenses) - on the bright side hopefully I will time to update my machine to Ubuntu 8.04 today :-)

The Backup Song. Very, very funny!

Writing a small web crawler in Python. Why it demonstrates how you can write almost anything in a few lines of code, it most probably will annoy the hell out of people like IncrediBill (for example there is no checking for robots.txt there!).

A slightly older post I discovered recently: Four C Programming Anti-Idioms. A followup from David LeBlanc involving C++: Checking Allocations & Potential for Int Mayhem. This reminds me C++ is hard. I recently tried to read a book about it written by the man himself and was terrified by the incredible amount of complexities and possibilities to get it wrong. I think that it is a safe bet to say that 80% of the C++ programmers don't know event 10% of these pitfalls (then again, 60% of all statistics is made up on the spot) - my university professors certainly didn't seem to know them. Here is a post from Tomas Ptacek making the same points.

Perl split gotcha

0 comments

One of those things which are spelled out in the documentation, but most people (including myself) don't really read the fine manual, until really, really forced to, and from the way it's described, it's not immediately clear how it can byte you. From perldoc:

Empty trailing fields, on the other hand, are produced when there is a match at the end of the string (and when LIMIT is given and is not 0), regardless of the length of the match.

Now consider the following example:

print join(',', split(/,/, 'a,b,c,d,'));

Which prints out "a,b,c,d", because split ignored the last (empty) element(s). To fix this, specify a negative limit. As per the documentation:

If LIMIT is negative, it is treated as if an arbitrarily large LIMIT had been specified.

Combined with the previous snippet, this gives the resolution to the problem. You might want this behavior or you might not. The idea is to be aware of it so that you can apply it when needed. I needed it for stripping whitespaces off of lines in source code. Since the change was intrusive enough, it was very important to preserve the number of newlines at the file end, thus needing the technique described above when splitting at newlines.

The difference between use and require in Perl

2 comments

Contrary to PHP (if you ever used it), require is the more dynamic one of the two. Both statements are used to import modules, however use is executed compile time (ie when the parser runs through the script), and require is executed when the actual line is evaluated.

Generally speaking you should use use (:-)), because you get the maximum and earliest warning if you referenced an unavailable module. However there are some uses for require, like detecting available modules (as I will discuss shortly in an other post) or reducing memory consumption.

Concrete case: I manage a server which processes files. The processing has conceptually a pipe architecture: a file comes in, a script processes it, puts in in a directory for a second script to process it, which in turn puts in in a third directory for a third script to process it and so on. Until recently the script were activated at regular intervals, cron style. However, in an effort to reduce latency, I redesigned it to work the following way:

The scheduler has the sole role to restart scripts which ended
At the start the script waits for a given period for a synchronization object
When the period passes or the synchronization object becomes signaled, it starts execution

This means that there are two possible reasons a script can start execution:

The timeout has passed (just like in the original design)
An other script signaled that it has provided work for the given script

However one side-effect of the new architecture which I did not anticipate (although it shouldn't be rocket-science) is that processes sit idling around most of the time consuming 20-30MB of memory, because they load all their modules as they start up. The way I resolved this problem was:

Made a minimal module containing the code to wait for the event

Made sure that scripts reference as soon as possible (made the script start like this):

use strict;
use warnings;
use waiter;

wait_for(600);

Converted the import method for other modules to require. What you need to remember is that:

use Foo;
# is equivalent to
BEGIN {
  require Foo;
  Foo->import();
}

use Foo qw(foo bar);
# is equivalent to
BEGIN {
  require Foo;
  Foo->import(qw(foo bar));
}

use Foo (); #not importing anything, not even the default things
# is equivalent to
BEGIN { require Foo; }

This cut down the memory usage from 20-30MB to around 5MB. Woot!

Update: I forgot to mention the use if module from CPAN which is somewhat similar in scope.

Update: As Anonymous correctly points out, for the perfect equivalence between use and require I should put them in "BEGIN" blocks. You can see the difference if you try using the imported things before the actual import (without the BEGIN block it won't work, with the BEGIN block it will work). But that is bad practice in general.

Saturday, May 24, 2008

The problem with amateur crimefighters

0 comments

I wish to preface this with the fact that I am a deep believer in cooperation and data sharing. Also, I really appreciate the work that volunteers put into maintaining different resources (like the excellent CastleCops forums).

But you have to remember that these people are not professionals and sometimes don't have a complete understanding of all the aspects of issue. Still people cite them as references and base decisions on their opinions. The Internet was regarded as the ultimate place for meritocracy, however sometimes it turns into a how can yell louder and/or a popularity contest.

Concrete example:

The DNS Black Hole project puts out a list of domains to block (or black-hole - hence the name I suppose). Until recently they did not have an official policy on removing domains. Recently they put up a post in which they try to clarify their take on the issue of false positives, and seem to take a (from their point of view) quite reasonable stance that they are just an aggregator and if you wish a domain to be removed, you should contact the original source.

However this begs the question about the quality of their data. I understand that they don't have the capacity to validate every single submission, but if they can't even check out false positives, is this really a blocklist you wish to use? You might as well start blocking entire countries...

Sometimes they realize that they are blocking an unrelated third party service (like recently when they announced that they are adding some dynamic dns providers to the blacklist because they are used extensively by malware, and sometimes they don't. The current list includes at least two free services from Romania which offer free webhosting and probably from time to time host malware, just like geocities. But you won't find all geocities sites blocked by it, even though both of these Romanian / lesser known sites are blocked completely. I tried to contact them a few weeks back to let them know about the problem and I yet to receive any feedback.

The maintainer of the site also offers his (or her?) expert opinion about how to fix the problem: remove iframes and detect obfuscated javascript. This basically demonstrates that s/he has no substantial understanding of either HTML or Javascript.

I'll talk a little about the Javascript idea, because this seems to be the wider misconception: first of all, javascript packers (for better or for worse) are used in commercial sites (think CNN, CNET, etc). If you've seen the code beginning with function(p,a,c,k,e,d), that's actually a commercial packer used on a lot of sites! Second of all, the idea of detecting obfuscation is too vague, and possibly (similar to the problem of writing an algorithm to detect any computer virus) impossible to solve. Third of all, browsers are not (and should not be) in the business of producing blacklists/whitelists (becoming some sort of AV company basically). They should try to create additional measures of security, however creating such lists probably is to big of an overhead for most of them (many of them being open source) and just replicates the problems of AV engines on yet an other level. If you want a blacklist based protection against malicious Javascript code, get an AV which offers this.

PS Sorry for ranting / sounding jaded. I want emphasize again that I do appreciate all the work put into these (free) services, it's only that I wish that people would investigate claims before putting their faith in some of these sources (and also the fact that I can't seem to get to sleep :-)).

An alternative for Perl heredoc's

0 comments

Perl has (true to its motto - there more than one way to do it) many methods for declaring string. Here are a few:

The single quote (') - does not interpolate variables, does not understand escape sequences (like \n for newline)
The double quote (") - interpolates variables (replaces $foo with the value of the scalar foo), understands escape sequences
The generic single quote operator (can be written as q{}, q//, q() and so on - for details see the perlop page) - behaves just like the single quote, but uses an alternative separator character
The generic double quote operator (qq{}, qq//, qq(), etc) - the same for the double quote operator (interpolates variables)
The custom or heredoc (presumably an abbreviation for here comes a document) syntax

In general the best-practice is to choose an quotation operator based on two things:

Do you need variable interpolation? If not, don't choose an operator which interpolates them. This will make the script faster and communicate your intent
Do you need to use characters in the string which are identical to the quotation operator itself and thus need to be escaped (like " in "this is a \"test\"")? If so, you should choose an other quotation operator, the separator of which is not present in the script, thus no escaping (line-noise) is needed.

Finally, if you need to inject larger pieces of text (like HTML), you should use heredocs. Their general syntax is as follows (where separator can be an arbitrary word):

my $foo = $suff . <<SEPARATOR . $more_stuff;
some lines
and more lines
...
SEPARATOR

Some important things to remember about heredocs:

You need to use two less-than signs (not three as in PHP)
The separator is placeholder in the statement. So you need to finish your statement properly (with a ;) before you begin the actual content of the strings. This also means that it needs not be at the very end of a line (as it is shown in most examples).
The final separator needs to be at the beginning of the line with no ; following it!

Finally, heredocs have also two variants: interpolating and non-interpolating. You can choose between the two versions by putting the initial separator between double quotes (for interpolating) or single quotes (for non-interpolating). By default, if you omit quotes, interpolation is used.

The biggest problem with heredocs is that they can't be indented nicely, thus making a mess of your nicely formatted source code. String::TT comes to the rescue. The idea is really simple: take a multi-line string, get the shortest run of spaces preceeding a line and remove that many spaces from each line. It creates a much nicer look. If you have restrictions on installing modules from CPAN, you can take a peek at the source code and create your own version of it.

Friday, May 23, 2008

Web Applicaiton Firewalls - are they usefull?

1 comments

I was looking through a presentation by .mario about PHPIDS (embedded below for your convenience), which got me thinking about Web Application Firewalls (or WAFs for short).

Currently I don't see very much value in WAFs. My way of thinking goes something like this - there are two types of web applications you might run on your server:

Those you have the source code for (either because it's open source, developed in-house or delivered in a source-code format)
Those you don't have the source code for

From what I've seen (disclaimer: I did not have the opportunity to use PHPIDS or mod_security in production yet), to really tailor the WAF for your site, you must have a detailed knowledge about the type of data expected. If you take the time to gain this knowledge, wouldn't it be better/easier to fix the source code directly?

My conclusion is that WAFs can be a last-step preventative measure with their generic rules, however source code review is much more effective in finding (and fixing) the vulnerabilities. Probably most of you will say we knew this already, but there are tendencies out there to equate the two (from what I understand the PCI guidelines presents the two as alternatives to one another).

| View | Upload your own

PS. You should also check out the article over at nullbyte about how input validation is not the be all and end all of security.

PS no. 2: It may seem that I've been very dismissive of WAFs and would never use them. Just to clarify: I think that they should be used because they provide an other layer of defense (and also have a very good ROI from a business standpoint), however if you need some serious security (if you are handling sensitive information), you shouldn't stop there.

Thursday, May 22, 2008

Converting rows (records) to and from arrays in Postgres

5 comments

Arrays are one of those more special features in PostgreSQL. Like any more esoteric features, you have people both in favor and opposed to them. On the pro side you have the fact that you can have an arbitrary number of elements without wasting space and/or having cumbersome table structure. On the con side you have the fact that this steps outside of the relational realm. Also, you can't easily (ie without writing triggers) enforce foreign key constraints on them.

Like them or dislike them, if you need to use them, here is how to convert rows to array and vice-versa.

First to convert from an array into a rowset, you can use the solution given by merlin:

create or replace function explode_array(in_array anyarray) returns setof anyelement as
$$
    select ($1)[s] from generate_series(1,array_upper($1, 1)) as s;
$$
language sql immutable;

select * from explode_array(array[1,2,3,4]);

The beauty of this solution is that SQL is guaranteed to be present in every database (unlike other possible languages). Also, remember that most of the time you don't need to transform array to rows because PostgreSQL supports arrays directly in the ANY/ALL operators.

To get the reverse (for which credit goes to the guys and gals on the #postgresql IRC channel), do the following:

select array(select id from table)

Watch out that if you get a long array (more than 1000 element for example), pgAdmin won't show it (it will seem as if the returned result would be empty). This is just an interface glitch, but can be misleading.

PS. A little rant/request about the documentation: when you search for postgres related information in search engines, many times outdated versions of the documentation come up at the top, which can be slightly annoying (and also it can give the wrong advice). I suggested on #postgresql to either automatically forward the search-engine crawlers to the latest version of the documentation or have a link at the top saying view this page for the latest release. My idea didn't seem to get any traction, so here are some alternatives:

As was pointed out to me in the discussion current is an alias for the latest release, so the following two URLs are equivalent at the moment (given that 8.3 is the latest release):

http://www.postgresql.org/docs/current/static/functions-comparisons.html

http://www.postgresql.org/docs/8.3/static/functions-comparisons.html

You can use this information in two ways:

First, when you search for the documentation, include inurl:current (if you are using Google) in the search to get the most up-to-date documentation (probably there are similar operators for other search engines as well).

Second, when you are linking to the documentation, use the version with current rather than with a specific version number, so that over time we can direct search engines more towards those URLs (which are updated after every release).

Update: in 8.4 there will be built-in functions to do this. Now if we could get an unsigned int datatype...

Update: If you have a function which takes an array as a parameter and you would like to pass in the result of a query, just wrap it into ARRAY(...) like this: SELECT * FROM count_array(ARRAY(SELECT * FROM test))

Sunday, May 18, 2008

Random links and commentary

0 comments

From the Mechanix blog comes the tale of the blocking CREATE INDEX call under PostgreSQL - I consider myself lucky that the databases I run are of internal use and I can permit myself to take them offline for a couple of minutes.

Via use Perl;: comments in the Perl debugger. Reminds me of the Linux kernel swear count.

Also use Perl;: 10 modules you might not know (embedded below for your convenience).

| View | Upload your own

Via the Financial Cryptography blog: What makes a Security Project?. Interesting conclusions. In my opinion Microsoft had made great strides in the last years to focus on security and their current code quality is lightyears ahead of many of its competitors (Oracle anyone? :-))

Via the 1 Raindrop blog: security isn't a big topic at developer conferences. Really sad.

F-Secure says in their system requirements: Pentium 4 1.6 GHz equals AMD Sempron 3400+. Those Pentiums have to have some secret sauce in them. It's all about the Pentiums baby :-).

Via the Google Security blog: they will start to offer more information about malicious sites. This is great since IMHO information sharing is the weakest point of the current security industry.

Via the /dev/random blog: security@work (embedded below for your viewing pleasure)

| View | Upload your own

Two humorous takes on the Debian radom number generator problem:

Dilbert:

XKCD:

An argument (against) PHP

1 comments

Via Perlbuzz I landed at the blog posting An Argument for PHP, which I disagree with.

First a little about my background: I've been programming in PHP almost twice as long (6+ years) as in Perl, so (hopefully) it isn't the case that I don't know what I'm talking about.

PHP seemed nice and shiny when I started using it, however after trying Perl, I realized that it's nothing more than a Perl wannabe. The two languages have similar roots (both started out as simple projects because someone needed to get something done and evolved from there), however Perl is ahead of PHP by almost 10 years (Perl appeared in 1987, and PHP in 1995). I know that this is not a scientific measure at all, but still I feel that Perl 5.10 is something like PHP 8.

PHP touts ease of use as its main main argument, however this is very measleading. As I said earlier you must know an awful lot before you consider doing a web application. As, what I consider to be poetic justice, this propaganda came back to bite PHP and basically is responsible for the widespread opinion PHP equals insecurity. No, PHP is not more or less secure than other server-side programming languages, but it attracted beginner programmers who wrote insecure programs with PHP in the name (phpBB anyone?) and thus ruined its reputation.

Perl has more features and more codes available (just compare the number of modules from CPAN with PEAR, which, from what I've heard on the PHP architect podcast, is kind of dead). It can run in-process (via mod_perl) just like PHP. And while debugging tools and profiling tools for PHP are slowly evolving or almost dead (the WinCacheGrind project hasn't had a release in more than three years, as did its Linux counterpart KCachegrind), you can write a debugger, profiler, etc for Perl in under 10 lines.

Perl had support for bound variables in SQL statements since before the PHP project started (!), while the PHP project went on for years producing many vulnerable applications.

And where are my language-integrated regular expressions, safety verifications (use strict), tainted variables, anonymous subroutines, closures, meta-programming support and so on?

In conclusion: certainly Perl is not without its flaws, but PHP doesn't even come near it. It does need a more substantial investment to learn, but the payoff and productivity gained from it is certainly worth it. And don't be fooled: while PHP might be easier to pick up, there are still many things you must learn which are outside the scope of PHP to make good (and safe) web applications and PHP doesn't do the Internet any favors by advertising itself as an all you need to know language.

Saturday, May 17, 2008

Dynamic languages, the universe and everything

0 comments

From Planet Perl I somehow ended up at a transcript of a talk about dynamic languages. It just so happens that during the same time I was reading the paper Eudaemon: Involuntary and On-Demand Emulation Against Zero-Day Exploits.

The paper is an extension of the Argos project, which tries to identify zero-days by correlating network traffic with executed code. Their basic idea is that if something gets executed which resulted from the network traffic, it is the sign of a zero-day. While it is an interesting idea and remarkable that they got it to work at a reasonable speed, in my opinion it has very limited use since:

People having undisclosed vulnerabilities will use them in targeted attacks (hence the probability of it hitting honeypots is fairly small)
Most vulnerabilities these days are of type push rather than pull (firewalls did have some effect on the ecosystem). This means that you have to visit the right URLs at the right time with the right browser (although this list criteria is relatively easy to satisfy, just use IE).
The method can have FP issues with technologies which use JIT to speed up execution (Java and if I recall correctly the next version of the Flash player and even some Javascript interpreters), since they execute code resulting from network data

This analysis is done by using Dynamic Code Translation, that is, disassembling the code to be executed and creating an equivalent code which includes instrumentation. The difference between the Eudaemon and Argos projects is that the first applies the method on-demand on individual processes running inside of an OS, while Argos virtualizes the whole machine.

This (and the point from the talk that dynamic languages can be made fast) got me thinking: what would be the advantages of an OS built entirely on a VM? One would be verifiability (given the small(ish) size of the VM, one could verify it, or even use mathematical proof style methods to show that certain things - such one process writing to the address space of an other - can not happen). An other would be the possibility of rapid change. OSs wouldn't have to wait any more for CPU vendors to include hardware support for virtualization for example, everything could be changed by simply changing the VM.

Then again, there is nothing new under the sun: a mathematically sound VM would be very similar to a CPU, in the design of which there is a lot of mathematics involved. Binary translators exists since a long time (just take the Qemu project as an example). And there is the Singularity project which aims to implement a entire OS in MSIL on top of a thin VM. Also tools like Valgrind and Dynamo are already doing user-mode only DCT (not to mention the user-mode version of Qemu).

Using DCT to implement an alternative security model is an interesting idea (and certainly viable with todays processor power). The only problem would be that the original code and the translated code had to co-exists in the same memory space, which under 32 bit processors can lead to memory exhaustion. This can be solved by creating a separate memory space for the compiled code or by moving to 64 bit (or even by saying that this technology is not memory-hungry applications like SQL Server).

Then again, maybe I'm overcomplicating things, since in current OSs processes by definition must turn to the kernel to do anything, and monitoring the user-kernel communication channel should be enough to implement any kind of security system. Although MS is against this currently (see Patchguard), it is not impossible that they will introduce the possibility to filter all such calls in the future.

Luminous CD envelopes

0 comments

While reading the Luminous band-aids post over at the Universe of Disclosure blog, I was reminded of a similar event with a CD envelope a couple of years ago (the kind CD's attached in magazines, with a round plastic window in the middle). I was opening it in the semi-dark (the kind of don't wake your parents, only the CRT monitor is glowing dark). It was a white-ish light as far as I can remember, very cool. In the next issue the magazine said that it wasn't intentional and they don't know how/why it happened, but we can consider it an added bonus :-)

Advanced MySQL features

0 comments

I think usually of MySQL as a simpler alternative to more feature rich RDBMS's like Postgres. However recently I listened to an interview with Brian Moon, the author of Phorum which is the oldest PHP and MySQL based forum software. The interview was very cool and demonstrated that you can do a lot if you know your software. I would recommend to everyone who uses MySQL to listen to the interview. The slides are available from Brian Moon's blog, or below, embedded via SlideShare for your convenience.

| View | Upload your own

As a side-note: in his talk he uses yet an other name for the dogpile effect: cache stampede. From the slides it seems that they are using the solution detailed in the first point of my post (separating out the update in a separate process and running it on a schedule).

Update: In the first presentation, instead of SQL_COUNT_FOUND_ROWS, the correct syntax is SQL_CALC_FOUND_ROWS.

An inspirational song

1 comments

I don't like baseball all that much (it isn't played very often on this part of the ocean), but I find the song by Kenny Rogers very inspiring. The story behind the song seems to be (according to Wikipedia) that a baseball player named Kenny Rogers (not to be confused with the country singer - the author of the song) had moderate success as a shortstopper (I apologize for any potential misuse of technical terms) but after starting to play as a pitcher, had a wildly successful career.

The takeaway message of the song (in my opinion) is (and the same time part of my life philosophy) that good things come to those who believe in themselves and persevere. If at first you don't succeed, use a bigger hammer :-)

You can download the MP3 version of the song from the site of the Bolton Baseball club (click on Song About Baz).

Lyrics:

Little boy, in a baseball hat,
Stands in a field, with his ball and bat,
says "I am the greatest, player of them all"
puts his bat on his shoulder, and tosses up his ball.

And the ball goes up, and the ball comes down,
he swings his bat all the way around,
and the worlds so still you can hear the sound
as the baseball falls, to the ground.

Now the little boy, doesn't say a word,
picks up his ball, he is undeterred,
Says "I am the greatest, there has ever been,"
and he grits his teeth, and tries it again.

And the ball goes up, and the ball comes down,
he swings his bat all the way around,
and the worlds so still you can hear the sound
as the baseball falls, to the ground.

He makes no excuses, He shows no fear,
He just closes his eyes, and listens to the cheers.

Little boy, he adjusts his hat
picks up his ball, stares at his bat,
says "I am the greatest, the game is on the line,"
and he gives his all, one last time.

And the ball goes up, like the moon so bright,
Swings his bat, with all his might,
and the worlds as still, as still as can be,
and the baseball falls, and that's strike three.

Now its supper time, and his Mama calls,
little boy starts home, with his bat and ball,
says "I am the greatest, that is a fact,
but even I didn't know, I could pitch like that."

Says, "I am the greatest, that is understood,
but even I didn't know, I could pitch that good."

A (not so new) technique for breaking databases

0 comments

There is joke which goes something like: those who know how to do it, do it. Those who don't, teach it. Those who don't even know how to teach it, supervise it. Sadly this is true for many tech journalists who make up sensationalized titles both because of lack of comprehension and because they have to sell their writing. Of course people pitching topics to the journalists aren't all that innocent themselves.

One such example would be the New attack technique threatens databases piece from The Register. What this boils down to if a plain SQL injection attack, at a different level.

The summary of the paper (warning, pdf!) is: suppose someone, who should know better, writes the following stored procedure (because I don't know Oracle, it will be written in pseudo SQL, but you will get the point):

CREATE PROCEDURE test1(stuff DATE) RETURNS varchar AS
BEGIN
 query = "SELECT * FROM products WHERE arrival > '" || stuff || "'";
 EXECUTE query;
END;

The thought process (if there was any) behind it probably was along the lines: I know that constructing dynamic SQL queries is bad (both because I expose myself ot SQL injection attacks and because syntax errors aren't verified during the creation of the procedure - given that query is just a string from the point of view of the parser), but I've put the value between quotes and I know that Oracle will validate the parameter before passing it to the procedure. As dates can't have quotes in them, I'm ok.

The problem is (as the paper describes) that by altering a session variable, you can define the format of a date for Oracle, making these types of procedures exploitable. Solution: don't create SQL queries using string concatenation, because it will bite you in the rear sooner or later.

As I mentioned earlier I'm no Oracle guru (in fact I haven't used Oracle in my life), but being curious and all I looked how Postgres and MySQL would behave in a similar situation. Postgres had a flawless behavior, you can write queries which include input variables without the need to construct it in a string having the dual benefit of proper quotation and syntactical verification at procedure creation time. With MySQL you have to use 5.0.13 at least (this is not a big deal at all, given that you have to use at least version 5 if you want stored procedures), from which version onwards you can take advantage of prepared statements inside of stored procedures.

You must be ye high to play

0 comments

I would be for my readers indulgence, but here is an other philosophical post. Recently I had the chance to try to teach somebody make dynamic websites and I realized that you must know an awful lot to do this. Just to enumerate a couple of things:

(X)HTML
CSS - which is by no means an easy task by itself, but you must also know the different browser quircks
Javascript and DOM - again, this means knowing abou (a) programming (b) dynamic languages (c) OOP (and not just any old plain OOP, but one of the more unusual ones, prototype based inheritance) and (d) (some) functional programming
Some server side programming language (PHP for example)
Databases and SQL
Webservers (like Apache) and at least some knowledge on how to configure them
HTTP (including headers, encoding, cookies and how they can be used to create the illusion of a session)

So there you have eight different technologies, including two different programming languages. And I didn't even include things like a graphics editor, Flash, Silverlight or knowing how licenses work (what you can and can't take, where you must give credit and so on).

To give all this a security angle: it is not hard to see that you must be somewhat of a fanatic to know all this, yet alone know them at a level where you've wrapped your head around all the security implications (RFI, XSS, SQL injection, CSRF, ...). An other aspect of the problem is that clients don't necessarily know how to judge the quality of work (other than it looks good and how much does it costs), which first of all makes it possible for a group of people to do this who don't know the security best practices. (You can hear a good discussion about a similar problem on the Scope Creep and Other Villains talk from the 2008 SXSW conference, where the presenter talks towards the about how you must educate your clients about the difference between you and your competitors, to avoid turning it into a price war.)

With all this circumstances, it is very plausible that Mass Hacks Likely to Hang Around for a While. I wish I could say something encouraging, but this just has to be a depressing post I guess.

Thursday, May 15, 2008

All the perl documentation

0 comments

A quick note:

When I talked earlier about turning off warnings in Perl, I referenced the perldiag page. If you wish to see a list of all the perl... documentation available, you can look at the language reference at perldoc.perl.org (there is also a 5.8.8 version if you haven't upgraded yet, although the differences should be too big between the two). Most of them are a very informative read, although you can leave some of them out if you are not interested in doing special things (for example if you don't wish to use C code from Perl, you can leave out the perlxs page).

Tuesday, May 13, 2008

Visualization techniques for networking data

0 comments

This is the HTML version of a paper I've written for school. Sorry for the poor formatting, but it was generated (semi-)automatically with Google Docs from an ODT document. You can download a nicer, PDF version of it here.

Introduction

Humans have a natural ability to correlate patterns from multiple sources to become aware of information. Most current human-computer interfaces however are limited to visual and more exactly text-based interaction. While it is true that current computer systems use a large number of symbolic elements and metaphors in the Graphical User Interface (like icons, windows, drag and drop, and so on), the primary mean of information exchange is still the written (or in the case of computers typed) word.

This approach (while having the advantage of simplicity) is very limiting from an information-bandwidth (the rate at which information can be exchanged between the user and the computer) point of view. Because typing / reading has an upper limit (although estimate vary from 40 to 250 words per minute for typing and between 250 and 700 words per minute for reading, it is clear that such limits do exists and that most people are not able to cross certain barriers, no matter the amount of training), it is logical to conclude that we must look to alternative ways to consume more information quicker. From a technical point of view the two methods which are the easiest to implement with current hardware are:

alternative (non-textual) representation of (aggregated) data, making use of our ability to observe trends and patterns
auditory feedback for certain events

Both of these methods can easily be implemented with almost all consumer-grade computing equipment. The fact that both of these are “output” (from the computer point of view) methods can be justified that the rate at which outputs is produced (and needs to be consumed) is much larger than the rate of input we need to provide (two of the reasons being that a great deal of output is produced by automated systems and also because in todays interconnected world we consume the sum of many outputs). In the following pages I will focus on the first method and more specifically the way in which it applies to networking data.

Each part (after the introductory part about data visualization) will present one way to visualize data and discuss the applicability of the given method in the context of computer networking.

Data Visualization

Data visualization is defined as: “any technique for creating images, diagrams, or animations to communicate a message”ⁱ. In general data visualization is used as a way to aggregate large quantities of data and present them in a way whichⁱⁱ:

allows us to quickly communicate rich messages (communication)
allows for discovering new, previously unknown facts and relationships (discovery)
allows us to get better insight into things we already know (insight)

This list is usually augmented with a forth element: provides an aesthetically pleasing way to present information. While this is an important topic under certain circumstances (for example in newspapers trying to popularize a certain fact), it is not a common requirement in the field of networking / security, thus only one example will be presented:

Illustration 1: A visual representation of a sample from the PWS-Lineage family by Alex Dragulescu

The image above is one of the manyⁱⁱⁱ pictures from Alex Dragulescu based on malware samples. From the author's site:

“Malwarez is a series of visualization of worms, viruses, trojans and spyware code. For each piece of disassembled code, API calls, memory addresses and subroutines are tracked and analyzed. Their frequency, density and grouping are mapped to the inputs of an algorithm that grows a virtual 3D entity. Therefore the patterns and rhythms found in the data drive the configuration of the artificial organism. “

While they provide no additional information about the code or its workings, they are aesthetically very pleasing. Aesthetics and usefulness do not necessarily exclude each-other, as can be seen from the picture below, however this article focuses on automatic visualization of automatically gathered data and while some aesthetic qualities can be provided (foreground-background colors for example) either manually or programatically, creative visualizations such as the one below can not be automatically generated (nor would we want to, because one of the purpose of visualization is to help answer queries by displaying different parts of the data, a process in which uniformity is key for the observer to be able to detect patterns and infer information).

Illustration 2: Charles Minard's information graphic of Napoleon's march

Finally, before getting to the different types of visualization methods, I would like to mention two criteria^iv which are considered essential in creating visual representation of data:

Maximize the data to ink ratio
First, tell the truth and second, do it efficiently, with clarity, precision...

While these are essential requirements for any visualization, computer generated charts are less likely to have any of the two pitfalls. This is because computer graphs are by definition minimalistic (mostly because it is very hard to define rules which could produce complex, yet aesthetical graphics from arbitrary sets of data) and because (with regards to the second point) visualization programs only concern themselves with presenting the data, not how “good data” should look like (which is why dubious tactics like not presenting the whole range, using non-linear scales, filtering of the data and so on get used – by humans who have a clear idea how “good” data should look like, what it should represent or what is the point that they are trying to make). An other reason which makes such problems unlikely is that visualization in this context is (will be) used for decision making and any software which produces misleading visualizations, which in turn translates to wrong decisions, wont get many usage.

The only problem with computer visualizations is that they can be too coarse or too detailed which could result in the wrong conclusions (see the discussion about bar charts / histograms later on for an example). This appears because, generally speaking, the computer can not guess the level of detail at which the user needs the data. The problem can be resolved by giving the user the possibility to quickly change the level of detail for a given chart she is looking at or, even better, if the program has specialized knowledge about the intended use of the data (for example it will be used to generate blocklists of /24 subnets for firewalls) then set the level of detail appropriate for this scope (groups of /24 subnets in our example). Or to maximize both flexibility and usability, make it possible for the user to change the level of detail, but set the default accordingly.

Methods for visualizing data

Line chart

A line chart is a two-dimensional scatterplot of ordered observations where the observations are connected following their order^v. It can be used to represent both two dimensional and one dimensional data (in the later case the second dimension defaults to the measurement unit of the data – time, length, etc). The data points can be connected or unconnected. Unconnected usually means that the data measured is discrete and connected by convention means continuous, although sometimes the connecting line is draw for aesthetic reasons or to facilitate the drawing of conclusions (observing trends for example). Drawing the points on the continuous line by convention means that the represented quantity is continuous, but it was only measured in the marked points.

An important issue is also the formula used to connect the points (if it is the case). This must be based on the knowledge we have about the nature of the measured quantity (does it change linearly? logarithmically? exponentially? if so, what is the base of the exponential function? and so on). Two common functions used for interpolation are the linear function (because of its simplicity) and splines^vi.

An other common practice is color the area between the line and the X axis to make it easier to read, especially when multiple data series are represented at the same time.

Illustration 3: Network appliance IOPs per protocol generated with rrdtool (http://oss.oetiker.ch/rrdtool/gallery/index.en.html)

Miniature versions of line charts embedded in the text to better illustrate a trend or pattern are called sparklines^vii. They don't include legends, axis or numeric values and their sole purpose is to illustrate a trend or pattern, without giving exact numeric details.

Illustration 4: Sparkline example, from Wikipedia (http://en.wikipedia.org/wiki/Sparkline at 02-05-2008)

Scatterplot

The scatterplot is the graphical representation of two quantitative variables. When one of the variable is controlled by the experimenter (for example the linearly increasing time), it is customarily drawn on the horizontal axis and the result is a line chart. However when both variables are a result of the observation, the graph show correlation (not to be confused with causality) between the variables. In this case it makes possible for the observer to discover correlation patterns between the variables (like positive or negative liner correlation).

Illustration 5: Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. From Wikipedia - http://en.wikipedia.org/wiki/Scatterplot - at 02-05-2008

A variation of this technique was used by [Teoh et al., 2002]^viii to visualize BGP routing table changes. This differs from the classical scatter-plot in two ways: first all four edges of the bounding rectangle have explicit meaning assigned to them (each represent a distinct group of AS numbers). Second, the rectangle itself is partitioned in a special way (using quadtrees) to encode the different IP address prefixes. This means that the relation between a point and the axis can not be determined by a simple orthogonal projection of the point (because we are not using a Cartesian space), thus it must be drawn explicitly with a color-coded line. The paper has found that this type of representation, while not very intuitive in a static case, it is very helpful in an animated situation (where data from one day is mapped to one frame in the animation) because it created patters which could easily be categorized by a human observer as normal or anomalous. (Remark: there are alternatives to quad-trees in representing the complete IPv4 space. See the heat maps section for a discussion. Also, an other service using quad-trees to represent the networking data is IPMaps^ix, which can be used to browse the BGP advertisements and the darknet data with a Google-Maps style interface.)

Illustration 6: Quadtree coding of IP prefixes. Left: Top levels of the tree, and the most significant bits of the IP prefixes represented bye ach sub-tree (sub-square). 4 lines representing AS numbers surround the square representing the IP prefix space. Right: Actual data. A line is drawn for every IP-AS pair in an OASC.

An other example for using scatter maps to visualize networking data (this time using a Cartesian coordinate system) can be seen in the user interface of the NVisionIP project^x:

Illustration 7: "Galaxy View" of the NVisionIP project, showing an oveview of the activity (color coded per port) on all the subnets and machines

Yet an other version of scatter maps was used during the SC03 conference held in Phoenix, Arizona between November 15th, 2003 and November 21st, 2003. Dubbed “The Spinning Cube of Potential Doom”, if was a three dimensional cube which represented the source IP, destination IP as well as destination ports. A point is drawn for each connection defined by its (source address, destination address, destination port) tuple. The cube is also animated with time. While the primary goal of the software was education^xi, it has been observed that several types of port-scanning produce very distinct and remarkable patterns. Even so, the practical use of the software is limited, because these patterns can be observed only from specific angles, which vary depending on the (source, destination, port) tuple. This means that an analyst would have to constantly rotate the cube and observe it from different angles to ensure that she get all the information. An alternative solution would be to project three views of the cube (one for each plane) in parallel for the analyst to observe. Even with these limitations, it is an interesting concept.

Illustration 8: The Spinning Cube of (Potential) Doom

Bar chart

Bar charts are used to represent relative frequency (or size) of several types of data. One axis holds the unit of measurement while the other the list of groups. It is also possible to add a third dimension to the data by presenting different types of measurements for each group (from different time-frames for example) to represent both the relative frequency between groups and the change in time inside of one group. One must be careful however because the method doesn't scale to large number of categories in the “third dimension” and quickly becomes cluttered. In this case an alternative would be to redraw the data as a line chart containing multiple lines (one for each group of data) and placing the categories on the horizontal axis.

Illustration 9: Part of the NVisionIP interface showing the traffic distribution by ports with bar charts

Histograms are a special type of bar chart where we area (not the height!) of the bars denote the frequency of a given group. This means that bars can have variable widths, although in practice this doesn't happen often (thus it is not widely known) because the horizontal scale usually contains equal width parts (and thus the frequency will get represented by the height alone as in classical bar charts). An important aspect of histogram drawing is choosing them number of bins (horizontal categories the values get grouped in). There have been attempts to define the number of bins with a formula^xii, however these make strong assumptions with regards to the type of data which gets represented and are not suited for any type of data. This is the Achilles heel of the histograms, because a wrong choice of the bin numbers can lead to the observer drawing the wrong conclusion from the data. As an example, below are four examples of the same dataset represented with varying bin numbers (histograms taken from CMSC 120 – Visualizing Information):

Suppose that we are interested in determining the interval which contains the maximum number of elements. While the first three steps progressively enhance the accuracy, the last histogram contains to much noise for us to be able to make this determination.

Box plot

A box plot (also known as box-and-whisker plot) is a method for representing data together with its statistically significant properties. The properties included in the drawing are:

Minimum – the lowest value in the dataset
Lower quartile – cuts off at ¼ of the data
Median – the middle value
Upper quartile – cuts off at ¾ of the data
Maximum – the highest value in the dataset

Illustration 10: Example for box plot taken from CMSC 120: Visualizing Information

Additionally to these rigorously defined values, two other values are considered: the largest and smallest non-outliers. An outlier is a value which is numerically distant from the rest of the data. There is no unique method to determine if a value is an outlier or not, thus it can vary from case to case. Outliers are depicted as dots in the plot.

While the box plot presents a great deal of information in a compact manner, its interpretation (as well its construction) requires at least some knowledge in the field of statistics. This makes its usability limited and as of now, it isn't present in any network oriented data visualization tool that I know of.

Pie chart

A pie chart (or a circle graph) is a circular chart divided into sectors, illustrating relative magnitudes or frequencies or percents. In a pie chart, the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents^xiii. A pie-chart is used when the sum of the individual elements is a meaningful value (for example 100%). There are three variants of the pie-chart: the exploded pie chart, the perspective pie chart and the polar area diagram, where the central angle is constant and the radius of each element varies relative with the data.

Illustration 11: Example for a three dimensional exploded pie chart - Wikipedia

Pie charts are not a good choice for representing networking data for several reasons. First of all, research has shown that comparison by angle is far less accurate than comparison by length. The only exception to this is the case when the comparison is made between one slice and the whole. Second of all, pie charts by their nature present a bounded view of the data (for example: the traffic divided by protocol for one day) and thus are not useful for situations when a timeline of the events has to be studied (as it is often the case in networking).

Bubble chart

Bubble charts are a way to represent multi-dimensional (greater than two) data. The basic idea is the two of the dimensions are placed alongside of each of the axis. The center of circles is placed at the intersection of the horizontal and vertical coordinate the datapoint has and the radius of the circle represents the third dimension (this assumes that the third dimension is quantitative). Additional dimensions can be added via the color, shading, border with and other visual attributes of the circles, however this method doesn't scale to more than 4-5 dimensions. Conversly, the number of dimensions can be reduced by eliminating one or both of the axis.

Illustration 12: A bubble chart, taken from "Visualizing the Signatures of Social Roles in Online Discussion Groups" by Howard T. Welser

These types of charts are not yet used in network monitoring systems, although they have the capacity for a successful use, because they can map time-series data into visual charts. Possible drawbacks are lack of software support and familiarity of people with this type of visual representation.

Waterfall chart

A waterfall chart is a special type of floating-column chart. A typical waterfall chart shows how an initial value is increased and decreased by a series of intermediate values, leading to a final value. An invisible column keeps the increases and decreases linked to the heights of the previous columns^xiv.

Illustration 13: Example waterfall chart taken from Wikipedia

While these types of charts could be useful for representing the increase / decrease in network traffic for example, in current software bar charts of delta values are used for this purpose.

Heat maps

A heat map is a graphical representation of data where the values taken by a variable in a two-dimensional map are represented as colors^xv. An essential requirement for heat maps to be useful is that the parameters along the horizontal and vertical axis must be placed in a way that “closeness” on the map equals “closeness” in the conceptual space of the problem. This is needed to make it possible to translate patterns from the visual space to the problem space. In the case of IPv4 there are two methods which have been used to achieve this: one is the quadtree encoding mentioned at the scatter plots. An other is based on the Hilbert curve^xvi depicted below:

Illustration 14: Hilbert curve, first to third order, from Wikipedia

This curve has the property that it tends to fill the entire two-dimensional space in an equal manner, and if we place an ordered set of elements (like IPv4 addresses) alongside of it, numerically close elements will be placed close (from an Euclidean-distance point of view) in the representation. Below are two examples using this technique to provide a heatmap for the entire IPv4 range. The first one is based on the “malicious activity” observed by Team Cymru^xvii and the second one represents the size of networks advertised via BGP^xviii.

Illustration 15: Team Cymru Internet Malicious Activity Map at 02-05-2008

Illustration 16: The Measurement Factory - IPv4 address range assignment based on the BGP routes advertised at 02-05-2008

Heat maps need not be of rectangular forms. Any arrangement which maps well to the contextual problem space is useful. For example the Akamai Content Delivery Network provides a heatmap^xix representing the traffic delivered to each individual country using a world map as their base:

Illustration 17: Akamai - overview of the delivered traffic at 02-05-2008

An interesting aspect of the above interface is that it uses an interface to show the details inspired from the “Toolglass and Magic Lenses”^xx paper (although it does not include the advanced interaction modes or the ambidextrous method of control described therein).

Graphs

Graphs are a natural view to represent networks or networking related information because it maps nicely to the paradigm of “a group of interconnected things” (network elements in the case of networks and nodes in the case of graphs). One of the very well known packages of using this paradigm is the “Network Weathermap”^xxi, which uses a background (usually a geographical map), points (corresponding to routers, switches, media converters and other networking equipments) and connection between these points (representing the physical connection between the elements). The connections are drawn as two opposing arrows (one for upstream and one for downstream traffic) and are color-coded (from white to red) to show the link usage percentage:

Illustration 18: Example Network Weathermap for the french ISP "Free" at 02-05-2008

An other frequent way of using graphs is using the unique addresses of networking elements (for example MAC for layer 2 or IP for layer 3) as nodes and the connection between them is used to represent “flows” (which can be defined, depending on the protocol, as simply as “all the packets having a given source and destination” or as complex as a TCP flow with handshake, sequence numbers and so on). A tool using this approach is EtherApe^xxii, a screenshot of which which can be seen below:

Illustration 19: EtherApe

The nodes are active network elements (routers, NICs, ...) represented by their addresses. The link between them represents the traffic, color coded by protocol. The size of the node is proportional with the amount of traffic it generates.

While the previous two examples arranged the nodes based on external methods (like geography or an arbitrarily chosen shape (ellipse in the case of EtherApe), a common practice is to let them “auto arrange” by defining an attracting-opposing force between them (similar to a spring) and let it auto-arrange in a way which minimizes these tensions. If the forces are well defined, the result usually is both aesthetically pleasing and useful. As an example below is presented a color-coded “map of the Internet” from 1998, where distances are based on the hopcount reported by traceroute from Bell labs and the color coding is used to mark equipment belonging to the same organization (based on the first two parts of the DNS name retrieved by a reverse-lookup):

Illustration 20: Map of the Internet from "Mapping and Visualizing the Internet" by Bill Cheswick, Hal Burch and Steve Branigan

Parallel coordinates

Parallel coordinates are a way to visualize high-dimensional data. Its most important property is that it can be easily extended to an arbitrary number of dimensions. To represent a point in an n-dimensional space, n equally spaced parallel axes are drawn which correspond to the dimensions. The point is represented as a polyline connecting the axis 1 to n. The points of contact between the polyline and the i-th axis is determined by the numeric value of the i-th coordinate of the point. This representation is a natural fit for networking data which has several dimensions. Below you can see a screenshot from the rumint tool^xxiii depicting a traffic capture along four coordinates (source IP, source port, destination IP and destination port):

Illustration 21: runmint tool showing a network trace - taken from the DEFCON 12 presentation "Network Attack Visualization" by Greg Conti

An additional positive aspect of this representation is that many interesting network level activities (like an external host scanning the network or vice-versa) map to very distinct, easy to recognize and logical patterns, even if the observer doesn't know about parallel coordinates.

A similar layout has been proposed by the book Visualizing Data^xxiv as an alternative scatterplots when inquiring about the correlation between two data sets. In the example of the book the two data sets are positioned on two vertical axis and the connection line is colored blue or red depending on its angle (a downward or upward angle corresponding to a positive or a negative correlation repectively). Also the width of the line is proportional with the angle.

Finally, a simple version of this method, with just two vertical axis representing the source and destination IP space is used in other monitoring tools, for example tnv^xxv.

Illustration 22: The two-axis version of parallel coordinates used in tnv to represent the source and destination IP addresses as well as the source and destination ports

Grid layouts

We've seen two methods of mapping IP addresses to grid layouts. If the number of IP addresses is small (for example an internal network, as opposed to all the possible IP addresses, or the subset of IPs internal hosts communicate with), it is possible to arrange them in a grid where individual cells have an acceptable size (for example 8 by 8 pixels). This means that individual IP addresses can be interacted with (clicked upon, cursor hovered over) to filter the view or to display additional information.

The first example is from a tool called VISUAL^xxvi. The central idea of this tool is that the system administrator doing the monitoring is inherently both more familiar with the internal network and more interested in it. So the internal network is depicted as a large(ish) grid. External hosts involved in the communication are presented to the right. Their size is proportional with the amount of data flowing to/from them and the small vertical lines depict the TCP/UDP port used at its end during the communication. The external hosts are arranged based on their IP address: the first two bytes from it are used to calculate the vertical coordinate, while the second two bytes are used for the horizontal coordinate (of course these coordinates are then scaled to the available screen real-estate). This means that external hosts are naturally grouped in C classes. The lines of course represent communication between an internal and external hosts.

Illustration 23: VISUAL, Home-Centric Visualization of Network Traffic for Security Administration

An other tool having similar ideas is PGVis and its three-dimensional brother PGVis3D. These tools don't seem to be public yet, so all the data was collected from a presentation entitled “Visualization for Network Traffic Monitoring & Security ” by Erwan Le Malécot. The idea is that both the external and internal IP space can be mapped to a usable grid size. This is possible because the internal IP space is small and because from the external IP space we consider only the ones we have seen communication with. There are four distinct grids: internal source, internal destination, external source and external destination. The lines are color-coded based on the traffic type.

Illustration 24: Example PGVis screen

The author of PGVis also experimented with a three-dimensional representation, but this interface still has the same problems as other 3D systems: it requires constant adjustment for the viewpoint.

Treemaps

Treemaps are a way to represent nested structures (trees) while giving information about their size and parent-children relationship at the same time. It was originally developed to help discover large files and directories on a drive^xxvii, but later got applied in many other fields. The original concepts (and still most of the current implementations) used rectangular shapes, but alternative solutions have appeared, like circular tree maps^xxviii.

Illustration 25: Example treemap showing the space distribution from a harddisk - from Wikipedia

In the context of network data visualization it was used as a way to map out the IP space either by “owner” (based on the AS numbers and BGP routing data, as other “internet mapping” projects presented earlier), by country (size of IP blocks assigned to a given continent / country) or by traffic to/from that AS by the DBVIS research project^xxix.

Illustration 26: Example treemap from the DBVIS project mapping out the connections to the University of Konstanz on Nov. 29, 2006 aggregated on autonomous systems.

Conclusion

Data visualization for networking data is an emerging field which should experience a rapid uptake in the following years as the volume of data grows and the pressure to make correct, fast and informed decisions becomes bigger and bigger.

While “classical” representations (like bar charts, pie charts, etc) don't provide enough detail for them to be useful in the multi-dimensional world of networking data, they are still useful for particular sub-parts of it. For the “big picture” overview several methods have been devised, some of which scale to the level of the entire Internet. Combining the two probably is the future of user interfaces in network-data analysis systems.

iWikipedia – Visualization (computer graphics) – http://en.wikipedia.org/wiki/Visualization_(graphic) at 01-05-2008

iiCMSC 120: Introduction to Computing: Visualizing information – http://mainline.brynmawr.edu/Courses/cs120/spring2008/

iiiAlex Dragulescu – Malwarez – http://www.sq.ro/malwarez.php

ivEdward Tufte – Envisioning Information (1990)

vWikiepdia – Chart – http://en.wikipedia.org/wiki/Chart at 02-05-2008

viWolfram Mathworld – http://mathworld.wolfram.com/Spline.html at 02-05-2008

viiEdward Tufte – Sparklines: theory and practice – http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR&topic_id=1 at 02-05-2008

viiiCase Study: Interactive Visualization for Internet Security – Soon Tee Teoh, Kwan-Liu Ma, S. Felix Wu 2002

ixIPMaps – http://monkey.org/~phy/ipmaps/ at 02-05-2008

xNVisionIP: An Animated State Analysis Tool for Visualizing NetFlows – Ratna Bearavol, Kiran Lakkaraju, William Yurcik

xiThe Spinning Cube of Potential Doom – Stephen Lau – http://www.nersc.gov/nusers/security/TheSpinningCube.php at 02-05-2008
“The primary goal was education. There has always been misinformation and hearsay regarding computer security and it is sometimes difficult for those unfamiliar with computer security to conceptualize the overall extent of malicious traffic on the Internet today.”

xiiWikipedia – Number of bins and width – http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width at 02-05-2008

xiiiWikipedia – Pie chart - http://en.wikipedia.org/wiki/Pie_chart at 02-05-2008

xivWikipedia – Waterfall chart – http://en.wikipedia.org/wiki/Waterfall_chart at 02-05-2008

xvWikipedia – Heat map – http://en.wikipedia.org/wiki/Heatmap at 02-05-2008

xviWolfram Mathworld – http://mathworld.wolfram.com/HilbertCurve.html at 02-05-2008

xviiTeam Cymru – Internet Malicious Activity Map – http://www.team-cymru.org/?cont=11 at 02-05-2008

xviiiThe Measurement Factory – IPv4 Heatmaps – http://maps.measurement-factory.com/ at 02-05-2008

xix Akamai – Visualizing the Internet – http://www.akamai.com/visualize at 02-05-2008

xxToolglass and Magic Lenses: The See-Through Interface – Eric A. Bier, Maureen C. Stone, Ken Pier, William Buxton,

Tony D. DeRose – 1993

xxiNetwork Wheathermap – http://netmon.grnet.gr/weathermap/ at 02-05-2008

xxiiEtherApe – a graphical network monitor – http://etherape.sourceforge.net/ at 02-05-2008

xxiiirunmint (room-int) – “an open source network and security visualization tool” – http://www.rumint.org/ at 02-05-2008

xxivVisualizing Data – Exploring and Explaining Data with the Processing Environment by Ben Fry, p 123

xxvtnv (The Network Visualizer or Time-based Network Visualizer) – http://tnv.sourceforge.net/ at 02-05-2008

xxviHome-Centric Visualization of Network Traffic for Security Administration – Robert Ball, Glenn A. Fink, Anand Rathi, Sumit Shah and Chris North

xxviiTreemaps for space-constrained visualization of hierarchies – Ben Shneiderman http://www.cs.umd.edu/hcil/treemap-history/ at 02-05-2008

xxviiipebbles – using Circular Treemaps to visualize disk usage – http://lip.sourceforge.net/ctreemap.html at 02-05-2008

xxixInteractive Exploration of Data Traffic with Hierarchical Network Maps and Radial Traffic Analyzer – http://infovis.uni-konstanz.de/index.php?region=research&reg2=hnmap at 02-05-2008

Friday, May 30, 2008

Wednesday, May 28, 2008

Monday, May 26, 2008

Sunday, May 25, 2008

Saturday, May 24, 2008

Friday, May 23, 2008

Thursday, May 22, 2008

Sunday, May 18, 2008

Saturday, May 17, 2008

Thursday, May 15, 2008

Tuesday, May 13, 2008

Introduction

Data Visualization

Methods for visualizing data

Line chart

Scatterplot

Bar chart

Box plot

Pie chart

Bubble chart

Waterfall chart

Heat maps

Graphs

Parallel coordinates

Grid layouts

Treemaps

Conclusion

Search the blog

Subscribe

Enter your email address:

Delivered by FeedBurner

Del.icio.us

Tag cloud

Powered by

Blog Archive

Recent comments

Top Commenters

My time

About Me

Things I read / listen to

Random blogs I'm reading

My stackoverflow feed

License and disclaimer

Visitors

Weather