hype-free: 2012

Monday, December 31, 2012

What every programmer should know about X

Piggybacking on some memes floating around on the internet I would like to publish my list of "what every programmer should know".

A couple of introductory words: in my opinion the two most important things to learn for new programmers are terminology - to know what things / ideas / algorithms / concepts are called so that they can search for them on the internet and discuss their ideas) and humility (if something doesn't exists or doesn't work the way we expected, the first thing we should ask ourselves is: "what am I missing?" instead of proclaiming the predecessors to be idiots). Moving along to the list:

What every programmer should know about X - the Breaking Eggs And Making Omelettes blog contains a very eclectic list of articles. The only thing I would add is the PDF link to "What every programmer should know about memory" rather than the LWN version. Also worth checking out the discussion in the comments.
Latency numbers every programmer should know and the discussion on hackernews (this is not the original source of the data, but a particularly good visualization of it nevertheless - see also my extended version which allows you to transform the numbers into whichever time-unit - s, ms, us - you need).
From the "Falsehoods programmers believe about X" department comes:
- Falsehoods programmers believe about build systems (additionally to the comments on the post, also see the discussion on Reddit and a follow-up blogpost)
- Falsehoods Programmers Believe About Names
- Falsehoods programmers believe about time and More falsehoods programmers believe about time
A video giving an overview of the different Google systems (search, map-reduce, GFS, etc). Nothing revolutionary, but a good first-time overview:

Happy holiday reading/watching to all!

Friday, December 21, 2012

Ensuring the order of execution for tasks

0 comments

This post was originally published as part of the Java Advent series. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the Java Advent blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazs to contribute!

Sometimes it is necessary to impose certain order on the tasks in a threadpool. Issue 206 of the JavaSpecialists newsletter presents one such case: we have multiple connections from which we read using NIO. We need to ensure that events from a given connection are executed in-order but events between different connections can be freely mixed.

I would like to present a similar but slightly different situation: we have N clients. We would like to execute events from a given client in the order they were submitted, but events from different clients can be mixed freely. Also, from time to time, there are "rollup" tasks which involve more than one client. Such tasks should block the tasks for all involved clients (but not more!). Let's see a diagram of the situation:

As you can see tasks from client A and client B are happily processed in parallel until a "rollup" task comes along. At that point no more tasks of type A or B can be processed but an unrelated task C can be executed (provided that there are enough threads). The skeleton of such an executor is available in my repository. The centerpiece is the following interface:

public interface OrderedTask extends Runnable {
    boolean isCompatible(OrderedTask that);
}

Using this interface the threadpool decides if two tasks may be run in parallel or not (A and B can be run in parallel if A.isCompatible(B) && B.isComaptible(A)). These methods should be implemented in a fast, non locking and time-invariant manner.

The algorithm behind this threadpool is as follows:

If the task to be added doesn't conflict with any existing tasks, add it to the thread with the fewest elements.
If it conflicts with elements from exactly one thread, schedule it to be executed on that thread (and implicitly after the conflicting elements which ensures that the order of submission is maintained)
If it conflicts with multiple threads, add tasks (shown with red below) on all but the first one of them on which a task on the first thread will wait, after which it will execute the original task.

More information about the implementation:

The code is only a proof-of-concept, some more would would be needed to make it production quality (it needs code for exception handling in tasks, proper shutdown, etc)
For maximum performance it uses lock-free* structures where available: each worker thread has an associated ConcurrentLinkedQueue. To achieve the sleep-until-work-is-available semantics, an additional Semaphore is used**
To be able to compare a new OrderedTask with currently executing ones, a copy of their reference is kept. This list of copies is updated whenever new elements are enqueued (this is has the potential of memory leaks and if tasks are infrequent enough alternatives - like an additional timer for weak references - should be investigated)
Compared to the solution in the JavaSpecialists newsletter, this is more similar to a fixed thread pool executor, while the solution from the newsletter is similar to a cached thread pool executor.
This implementation is ideal if (a) the tasks are (mostly) short and (mostly) uniform and (b) there are few (one or two) threads submitting new tasks, since multiple submissions are mutually exclusive (but submission and execution isn't)
If immediately after a "rollup" is submitted (and before it can be executed) tasks of the same kind are submitted, they will unnecessarily be forced on one thread. We could add code rearrange tasks after the rollup task finished if this becomes an issue.

Have fun with the source code! (maybe some day I'll find the time to remove all the rough edges).

* somewhat of a misnomer, since there are still locks, only at a lower - CPU not OS - level, but this is the accepted terminology

** - benchmarking indicated this to be the most performant solution. This was inspired from the implementation of the ThreadPoolExecutor.

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazs to contribute!

Java Runtime options

0 comments

The Java runtime is a complex beast - and it has to be since it runs officially on seven platforms and unofficially on many more. Give this, it is normal that there are many knobs and dials to control how things function. The more well known ones are:

-Xmx for the maximum heap size
-client and -server for selecting the default set of parameters from classes of defaults
-XX:MaxPermGen for controlling the permanent generation size

Other than these, it is (very) rarely the case that you need to change the defaults. However, thanks to Java being open source you can see the list of options, their default values and a short explanation directly from the source code. Currently there are almost 800 options in there!

An other way to see the options (but one which doesn't display the explanations unfortunately) is the following command:

java -XX:+UnlockDiagnosticVMOptions -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version

These options are well worth studying. Not for tweaking them (since there is a wealth of testing behind the defaults the extent of which would be very hard to replicate), but rather to understand the different functionalities offered by the JVM (for example why you might not see stacktraces in exceptions).

Wednesday, December 12, 2012

Changes to String.substring in Java 7

0 comments

It is common knowledge that Java optimizes the substring operation for the case where you generate a lot of substrings of the same source string. It does this by using the (value, offset, count) way of storing the information. See an example below:

In the above diagram you see the strings "Hello" and "World!" derived from "Hello World!" and the way they are represented in the heap: there is one character array containing "Hello World!" and two references to it. This method of storage is advantageous in some cases, for example for a compiler which tokenizes source files. In other instances it may lead you to an OutOfMemorError (if you are routinely reading long strings and only keeping a small part of it - but the above mechanism prevents the GC from collecting the original String buffer). Some even call it a bug. I wouldn't go so far, but it's certainly a leaky abstraction because you were forced to do the following to ensure that a copy was made: new String(str.substring(5, 6)).

This all changed in May of 2012 or Java 7u6. The pendulum is swung back and now full copies are made by default. What does this mean for you?

For most probably it is just a nice piece of Java trivia
If you are writing parsers and such, you can not rely any more on the implicit caching provided by String. You will need to implement a similar mechanism based on buffering and a custom implementation of CharSequence
If you were doing new String(str.substring) to force a copy of the character buffer, you can stop as soon as you update to the latest Java 7 (and you need to do that quite soon since Java 6 is being EOLd as we speak).

Thankfully the development of Java is an open process and such information is at the fingertips of everyone!

A couple of more references (since we don't say pointers in Java :-)) related to Strings:

If you are storing the same string over and over again (maybe you're parsing messages from a socket for example), you should read up on alternatives to String.intern() (and also consider reading chapter 50 from the second edition of Effective Java: Avoid strings where other types are more appropriate)
Look into (and do benchmarks before using them!) options like UseCompressedStrings (which seems to have been removed), UseStringCache and StringCache

Hope I didn't strung you along too much and you found this useful! Until next time
- Attila Balazs

Saturday, December 01, 2012

(Re)Start me up!

0 comments

This post was originally published as part of the Java Advent series.

There are cases where you would like to start a Java process identical to the current one (or at least using the the same JVM with tweaked parameters). Some concrete cases where this would be useful:

Auto-tuning the maximum memory parameters (ie. you have an algorithm to determine the optimal value - for example: 80% of the system memory - and your JVM wasn't started with that particular value)
Creating a cluster of processes for high(er)-availability (true HA implies multiple physical nodes) or because processes have different roles (like the components in MongoDB).
Daemonizing the current process (that is, the background process should run even after the launching process has terminated) - this is a very frequent modus-operandi for programs on *nix systems where you have the foreground "control" process and the background "daemon" process (not to be confused with the "daemon" threads).

Doing this is relatively simple - and can be done in pure Java - after you find the correct API calls:

List arguments = new ArrayList<>();
// the java executable
arguments
  .add(String.format("%s%sbin%sjava",
    System.getProperty("java.home"), File.separator,
    File.separator));
// pre-execuable arguments (like -D, -agent, etc)
arguments.addAll(ManagementFactory.getRuntimeMXBean()
  .getInputArguments());

String classPath = System.getProperty("java.class.path"), javaExecutable = System
  .getProperty("sun.java.command");
if (classPath.equals(javaExecutable)) {
 // was started with -jar
 arguments.add("-jar");
 arguments.add(javaExecutable);
} else {
 arguments.add("-classpath");
 arguments.add(classPath);
 arguments.add(javaExecutable);
}

// we might add additional arguments here which will be received by the
// launched program
// in its args[] paramater
arguments.add("runme");

// launch it!
new ProcessBuilder().command(arguments).start();

Some explanations about to the code:

It is largely inspired from this project
We suppose that the java executable is named java and is located in bin/java relative to java.home. We use File.separator for the code to be portable.
getInputArguments is used to get specific arguments passed to the JVM (like -Xmx). It does not include the classpath.
Which is taken from java.class.path
Finally, there is one heuristic step: we try to detect if we were launched using the -jar myjar.jar syntax or the MyMainClass syntax and replicate it.

This is it! After that we use ProcessBuilder (which we should always favour over Runtime.exec because it auto-escapes the parts of the command line for us).

A final thought: if you intend to use this method to "daemonize" a process (that is: to ensure that it stays running after its parent process has terminated) you should do two things:

Redirect the standard input and output. By default they are redirected into temporary buffers and the JVM will seemingly randomly terminate when those buffers (pipes) fill up.
Under Windows use javaw instead of java. This ensures that the process won't be tied to the console it was started from (however it will still be tied to the user login session and will terminate when the user logs out - for a more heavy-duty solution look into the Java Service Wrapper).

This is it for today, hope you enjoyed it, fond it useful. If you run the code and it doesn't work as advertised, let me know so that I can update it (I'm especially interested if it works with non Sun/Oracle JVMs). Come back tomorrow for an other article!

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license.

Sunday, November 25, 2012

Upgrading from MySQL to MariaDB on Ubuntu

0 comments

So you decided that Oracle doesn't know its left foot from the back of his neck when it comes to open source (how's that for a mixed metaphor), but you are not ready just yet to migrate over to PostgreSQL? Consider MariaDB. Coming from Monty Widenius, the original author of MySQL, it aims to be 100% MySQL compatible while also being truly open-source.

Give that it's 100% MySQL compatible, you can update in-place (nevertheless it is recommended that do a backup of your data first). The steps are roughly adapted from here.

Go to the MariaDB repository configuration tool and generate your .list file (wondering what's up with the 5.5 vs 10.0 version? See this short explanation). You don't know the exact Ubuntu version you're running? Just use lsb_release -a.
Save the generated file under /etc/apt/sources.list.d/MariaDB.list as recommended and do an sudo aptitude update. You should see an output complaining about some public keys.
Do sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0xCBCB082A1BB943DB to add those keys (replace the last number with the one you saw in the previous output).
Issue sudo apt-cache policy mysql-common and you should see mariadb as an upgrade option.
Finally do sudo aptitude upgrade mysql-common libmysqlclient18 and watch your MySQL database being transformed into a MariaDB one and all keeping chugging along just as usual!

Cluj-Napoca (Romania) wins the title of European Youth Capital 2015

0 comments

We interrupt our regular (lack of) posting to bring you this news:

Cluj-Napoca is European Youth Capital 2015. Congratulation to everyone involved!

Tuesday, November 06, 2012

Writing beautiful code - not just for the aesthetic value

0 comments

This article was originally published in the 6th edition of TodaySoftMag in Romanian and on the Transylvania JUG blog in English. Reprinted here with the permission of the author / magazine.

Most mainstream programming languages contain a large set of features and diverse standard libraries. Because of this it becomes important to know not only “how” you can achieve something (to which there are usually several answers) but also “what is the recommended way”.

In this article I will argue that knowing and following the recommended ways of coding doesn’t only yield shorter (easier to write), easier to read, understand and maintain code but also prevents programmers from introducing a lot of bugs.

This particular article needs a drop of Java language knowledge to savour, but the fundamental idea can be generalized to any programming language: there is more to using a language efficiently than just knowing the syntax.

Example 1: Double Trouble

Lets start with a snippet of code: what does it print out?

Double d1 = (5.0d - 5.0d) *  1.0d;
Double d2 = (5.0d - 5.0d) * -1.0d;
System.out.println(d1.equals(d2));

What about the following one?

double d1 = (5.0d - 5.0d) *  1.0d;
double d2 = (5.0d - 5.0d) * -1.0d;
System.out.println(d1 == d2);

The answer seems to be clear: in both cases we multiply zero with different values (plus and minus one respectively), thus the result should be zero which should compare as equal regardless of the comparison method used (calling the equals method on objects or using the equality operator on the primitive values).

If we run the code, the result might surprise us: the first one displays false while the second one displays true. What’s going on? On one level we can talk the technical reasons behind this result: floating point values are represented in Java (and many other programming languages) using the sign-and-magnitude notation defined in the IEEE Standard 754. Because of this technical detail both “plus zero” and “minus zero” can be represented by variables of this type. And the “equals” method on Double (and Float) objects in Java considers these values to be distinct.

On another level however we could have avoided this problem entirely by using the primitive values as shown in the second code snippet and as suggested by Item 49 in the Effective Java book^[1]: Prefer primitive types to boxed primitives. Using primitive types is also more memory efficient and saves us from having to create special cases for the null value.

Sidenote: we have a similar situation with the BigDecimal class^[2] where values scaled differently don’t compare as equal. For example the following snippet also prints false:

BigDecimal d1 = new BigDecimal("1.2");
BigDecimal d2 = new BigDecimal("1.20");
System.out.println(d1.equals(d2));

The answer in this case (given that there is no primitive equivalent for this class) would be to use the compareTo method and assert that it returns zero instead of using the equals method (a method which can also be used to solve the conundrum in the Double/Float case if we are not worried about nulls).

Example 2: Where is my null at?

What does the following snippet of code print out?

Double v = null;
Double d = true ? v : 0.0d;
System.out.println(d);

At first glance we would say: null, since the condition is true and v is null (and null can be assigned to a reference of any type, so we are allowed to use it). The actual result is however a NullPointerException at the second line. This is because the right-hand type of the assignment is actually double (the primitive type) not Double (as we would expect) which is silently converted into Double (the boxed type). The generated code looks like this:

Double d = Double.valueOf(true ? v.doubleValue() : 0.0d);

This behavior is described in the Java Language Specification^[3]:

“If one of the second and third operands is of primitive type T, and the type of the other is the result of applying boxing conversion (§5.1.7) to T, then the type of the conditional expression is T.”

I would venture to guess that not many of us have read the JLS in its entirety and even if we would have read it, we might not have realized the implications of each phrase. The recommendation from EJ2nd mentioned at the previous example saves us again: we should use primitive types. We can also draw a parallel with Item 43: Return empty arrays or collections, not nulls. Would we have used a “neutral element”, which is analogous to using empty arrays/collections, the problem would not have appeared. (The neutral element would be 0.0d if we use the value later in summation or 1.0d if we use it in multiplication.)

Example 3: We come up empty

What is the difference between the following two conditions?

Collection<V> items;
if (items.size() == 0) { ... }
if (items.isEmpty()) { ... }

One could argue that they do exactly the same thing as being empty is equivalent to having zero items. Still, the second condition is easier to understand (we can almost read it out loud: “if items is empty then …”). But there is more: in some cases it can be much, much faster. Two examples from the Java standard libraries where the time needed to execute “size” grows linearly with the number of elements in the collection while “isEmpty” returns in constant time: ConcurrentLinkedQueue^[4] and the view sets returned by TreeSet’s^[5] headSet/tailSet methods. And while the documentation for the first mentions this fact, it doesn’t for the second.

This is yet another example how nicer code is also faster.

Example 4: Careful with that static, Eugene!

What will the following snippet of code print out?

public final class Test {
        private static final class Foo {
                static final Foo INSTANCE = new Foo(); // 2
                static final String NAME = Foo.class.getName(); // 3
                Foo() {
                        System.err.println("Hello, my name is " + NAME);
                }
        }
        public static void main(String[] args) {
                System.err.println("Your name is what?\nYour name is who?\n");
                new Foo(); // 1
        }
}

It will be

Your name is what?
Your name is who?

Hello, my name is null
Hello, my name is Test$Foo

The (probably) unexpected null value happens because we obtain a reference to a partially constructed object:

We start to create an instance of Foo at point 1
This being the first reference to Foo, the JVM loads it and starts to initialize it
Initializing Foo involves initializing all its static fields
The initialization of the first static field contains a call to the constructor at point 2 which is dutifully executed
At this point the NAME static field is not yet initialized, so the constructor will print out null

This code demonstrates that static fields can be confusing and we shouldn’t use them for things other than constants (but even then we should evaluate if the constant is not better declared as an Enum). By the same token we should also avoid singletons which make our code harder to test (thus avoiding them will make the code easier to test).

We should however favor static member classes over non-static ones (Item 22 in EJ2nd). Static classes in Java are entirely distinct conceptually from static fields and it is unfortunate that the same word was used to describe them both.

We should also run static analysis tools on our code and verify their output frequently (ideally at every commit). For example the bug presented is caught by Findbugs^[6] and tools incorporating Findbugs.

Example 5: Remove old cruft

Name four things wrong with the following snippet:

// WRONG! DON’T DO THIS!
Vector v1;
...
if (!v1.contains(s)) { v1.add(s); }

They would be:

The wrong container type is used. We clearly want to have each string present at most once which suggests using a Set<> which has the benefits of shorter and faster code (the above method gets linearly slower with the number of elements)
Doesn’t use generics
It unnecessarily synchronizes access to the structure if it is only used from a single thread
If the structure is actually used from multiple threads, the code is not thread safe, only “exception safe” (as in: no exceptions will be raised, but the data structure can be silently corrupted possibly creating a lot of headache downstream)

All of these can be avoided by dropping Vector and its siblings (Hashtable, StringBuffer) and using the Java Collection Framework (available for 14 years^[7]) with generics (available for 8 years^[8]).

Conclusion

There are many more examples one could give, but I think the point is well made that knowing a programming language means more than just knowing the syntax at a basic level. I’m urging you if you are using Java: get yourself a copy “Effective Java, 2nd edition” and “Java™ Puzzlers: Traps, Pitfalls, and Corner Cases” each and read through them if you haven’t done so already. Also, use static analysis on your code (Sonar^[9] is a good choice in this domain) and consider fixing the issues signaled by it, or at least read up on them.

Again, the conclusions is similar for other languages:

Try reading up on best practices/idiomatic ways to write code in the given language. For example for Perl the best book currently is “Modern Perl^[10]” by chromatic
Look to see if there is a good quality static analysis / lint program for your language. For Perl there is Perl::Critic^[11], for Python there is pep8^[12] and pylint^[13], all of which are free and open source

Being good a programmer (or an architect, or a business analyst, etc) is process of lifelong learning and these are the tools which can help us truly learn a programming language.

[1]Joshua Bloch: Effective Java, Second Edition. ISBN: 0321356683

[2]http://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html

[3]http://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.25

[4]http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html#size()

[5]http://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html

[6]http://findbugs.sourceforge.net/bugDescriptions.html#SI_INSTANCE_BEFORE_FINALS_ASSIGNED

[7]http://en.wikipedia.org/wiki/Java_version_history#J2SE_1.2_.28December_8.2C_1998.29

[8]http://en.wikipedia.org/wiki/Java_version_history#J2SE_5.0_.28September_30.2C_2004.29

[9]http://www.sonarsource.org/

[10]http://www.onyxneon.com/books/modern_perl/index.html

[11]http://search.cpan.org/~thaljef/Perl-Critic-1.118/lib/Perl/Critic.pm

[12]http://pypi.python.org/pypi/pep8

[13]http://pypi.python.org/pypi/pylint

Thursday, November 01, 2012

A (mostly) cross-platform clickable map of Romanian counties

0 comments

TL;DR - I've created a map of Romania's counties (judete) using Javascript which can be used as an alternative for drop-down boxes during form input. It works great with FF and Chrome :-). See it in action and browse the source code.

The map is based on the SVG drawings from Wikimedia Commons postprocessed in Inkscape to simplify the paths (and to remove the Black Sea from it). After that it was converted to Raphael.js using readysetraphael. I also use ScaleRaphael to be able to render it at an arbitrary size. According to BrowserStack (a great service well worth its money BTW if you do web development!) it works with Firefox (starting with version 3.6), Chrome and Safari. IE has a hard time with it, but hopefully as the underlying library improves, it will start to work :-).

Saturday, October 27, 2012

Every end is a new beginning

0 comments

TL;DR: I'm shutting down the twitfeeder project (it was on lifesupport for a long time) so I mirrored a technical article from the blog in the hope that it might be useful for somebody someday.

Proxying URL fetch requests from the Google App Engine

Hosting on the Google App Engine means giving up a lot controls: your application will run on some machines in one of Google's datacenters, use some other machines to store data and use yet an other set of machines to talk to the outside world (send/receive emails, fetch URLs, etc). You don't know the exact identity of these machines (which for this article means their external IP address), and even though you can find out some of the details (for example by fetching the ipchicken page and parsing the result or by sending an e-mail and looking at the headers), there is no guarantee made that the results are repeatable (ie. if you do it again and again you will get the same result) in the short or long run. While one might not care about such details most of the time, there are some corner cases where it would be nice to have a predictable and unique source address:

You might worry that some of the exit points get blocked because other applications on the Google App Engine have an abusive behavior

You might want a single exit point so that you can "debug" your traffic at a low (network / tcpdump) level

And finally, the main reason for which the Twit Feeder uses it: some third-party services (like Twitter or Delicious) use the source IP to whitelist requests to their API

To be fair to the GAE Architects: the above considerations don't affect the main usecase and are more of a cornercase if we look at the average application running on the GAE. Also, having so few commitments (ie. they don't stipulate things like "all the URL fetches will come from the 10.0.0.1/24 netblock") means that they are free to move things around (even between datacenters) to optimize uptime and performance which in turn benefits the application owners and users.

Back to our scenario: the solution is to introduce an intermediary which has a static and well known IP and let it do all the fetching. Ideally I would have installed Squid and and be done with it, but the URL fetch service doesn't have support for proxies currently. So the solution I came up with looks like this:

Get a VPS server. I would recommend one which gives you more bandwidth rather than more CPU / memory / disk. It is also a good idea to get one in the USA to minimize latency from/to the Google datacenters. I'm currently using VPSLink and I didn't had any problems (full disclosure: that is a referral link and you should get 10% off for lifetime if you use it).

Install Apache + PHP on it

Use a simple PHP script to fetch the page encoded in a query parameter using the php_curl extension.

To spice up this blogpost ;-), here is a diagram of the process:

A couple of points:

Taking care of a VPS can be challenging, especially if you aren't a Linux user. However failing to do so can result in it being taken over by malicious individuals. Consider getting a managed VPS. Also, three quick security tips: use strong passwords. move the SSH server to a non-standard port and configure your firewall to be as restrictive as possible (deny everything, and open ports to only a limited number of IP addresses).

Yes, this introduces a single point of failure into your application, so plan accordingly (while your application is small, this shouldn't be a problem - as it grows you can get two VPS's at two different data centers for example for added reliability).

The traffic between Google and your VPS can be encrypted using TLS (HTTPS). The good news is that the URL fetcher service doesn't check the certificates, so you can use self-signed ones. The bad news is that the URL fetcher doesn't check the certificates, so a determined attacker can relatively easily man-in-the-middle you (but it protects the data from the casual sniffer).

Be aware that you need to budget for double of the amount of traffic you estimate using at the VPS (because it needs to first download it and then upload it back to Google). The URL fetcher service does know how to use gzip compression, so if you are downloading mainly text, you shouldn't have a bandwidth problem.

PHP might seem like an unusual choice of language, given how most GAE users have experience in either Python or Java, but there are a lot of tutorials out there on how to install it (and on modern Linux distribution it can be done in under 10 minutes with the help of the package manager) and it was the one I was most comfortable with as an Apache module.

Without further ado, here are the relevant sourcecode snippets (which I hereby release into the public domain):

The PHP script:

<?php
error_reporting(0);
require 'privatedata.php';

if ($AUTH != @$_SERVER['HTTP_X_SHARED_SECRET']) {
   header("HTTP/1.0 404 Not Found");
   exit;
}

$url = @$_GET['q'];
if (!isset($url)) {
   header("HTTP/1.0 404 Not Found");
   exit;
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

if (isset($_SERVER['HTTP_USER_AGENT'])) {
   curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
}
if (isset($_SERVER['PHP_AUTH_USER']) && isset($_SERVER['PHP_AUTH_PW'])) {
   curl_setopt($ch, CURLOPT_USERPWD, "{$_SERVER['PHP_AUTH_USER']}:{$_SERVER['PHP_AUTH_PW']}");
}

if (count($_POST) > 0) {
   curl_setopt($ch, CURLOPT_POST, true);
   $postfields = array();
   foreach ($_POST as $key => $val) {
       $postfields[] = urlencode($key) . '=' . urlencode($val);
   }
   curl_setopt($ch, CURLOPT_POSTFIELDS, implode('&', $postfields));
}

$headers = array();
function header_callback($ch, $header) {
   global $headers;
   // we add our own content encoding
   // also, the content length might vary because of this
   if (false === stripos($header, 'Content-Encoding: ')
       && false === stripos($header, 'Content-Length: ')
       && false === stripos($header, 'Transfer-Encoding: ')) {
       $headers[] = $header;
   }
   return strlen($header);
}

curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'header_callback');

$output = curl_exec($ch);
if (FALSE === $output) {
   header("HTTP/1.0 500 Server Error");
   print curl_error($ch);
   exit;
}
curl_close($ch);

foreach ($headers as $header) {
   header($header);
}
print $output;

The "privatedata.php" file looks like this:

<?php

$AUTH = '23tqhwfgj2qbherhjerj';

Two separate files are being used to avoid submitting the password to the source repository, while still keeping all the sourcecode open.

Now with the code in place, you can test it using curl:

curl --compressed -v -H 'X-Shared-Secret: 23tqhwfgj2qbherhjerj' 'http://127.0.0.1:1483/fetch_url.php?q=http%3A//identi.ca/api/help/test.json

As you can see, a custom header is used for authentication. An other security measure is to use a non-standard port. Limiting the requests to the IPs of the Google datacenter from the firewall would be the ideal solution, but given that this was the problem we are trying to solve in the first place (the Google datacenters not having an officially announced set of IP addresses), this doesn't seem possible.

Finally, here is the Python code to use the script from withing an application hosted on the GAE:

from urllib import quote
from google.appengine.api import urlfetch

# ...

fetch_headers['X-Shared-Secret'] = '23tqhwfgj2qbherhjerj'
result = urlfetch.fetch(url='http://127.0.0.1:1483/fetch_url.php?q=%s' % quote(url), payload=data,
 method=urlfetch.POST if data else urlfetch.GET,
 headers=fetch_headers, deadline=timeout, follow_redirects=False)

A little more discussion about the code:

The method of using a custom header for authorization was chosen, since the forwarding of authentication data (ie. the "Authorization" header) was needed (specifically this is what the Twitter API uses for verifying identity)

Speaking of things the script forwards: it does forward the user agent and any POST data if present. The forwarding of other headers could be quite easily be added.

Passing variables in a GET request is also supported (they would be double-encoded, but that shouldn't be a concern in but the most extreme cases)

If we are talking about sensitive data, cURL (and the cURL extension for PHP) has the ability to fetch HTTPS content and to verify the certificates.

While this method might look cumbersome, in practice I found it to work quite well. Hopefully this information will help others facing the same problem. A final note: if you have questions about the code or about other aspects, post them in the comments. I will try to answer them as fast as possible. I'm also considering launching a "proxy service" for GAE apps to make this process much more simpler (abstracting away the setup and administration of the VPS), so if you would be interested in paying a couple of bucks for such a service, please contact me either directly or trough the comments.

Helper for testing multi-threaded programs in Java

0 comments

This post was originally published on the Transylvania JUG blog.

Testing multi-threaded code is hard. The main problem is that you invoke your assertions either too soon (and they fail for no good reason) or too late (in which case the test runs for a long time, frustrating you). A possible solution is to declare an interface like the following:

interface ActivityWatcher {
 void before();
 void after(); 
 void await(long time, TimeUnit timeUnit) throws InterruptedException, TimeoutException;
}

It is intended to be used as follows:

“before” is called before the asynchronous task is delegated to an execution mechanism (threadpool, fork-join framework, etc) and it increments an internal counter.
“after is called after the asynchronous task has completed and it decrements the counter.
“await” waits for the counter to become zero

The net result is that when the counter is zero, all your asynchronous tasks have executed and you can run your assertions. See the example code. A couple more considerations:

There should be a single ActivityWatcher per test (injected trough constructors or a dependency injection framework)
In production code you will use a dummy/noop implementation which removes any overhead.
This only works for situations where the asynchronous are kicked of immediately. Ie. it doesn’t work for situations where we have periodically executing tasks (like every 5 seconds) and we would want to wait for the 7th tasks to be executed for example.

One thing the above code doesn’t do is collecting exceptions: if the exceptions happen on different threads than the one executing the testrunner, they will just die and the testrunner will happily report that the tests passs. You can work around this in two ways:

use the default UncaughtExceptionHandler to capture all exceptions and rethrow them in the testrunner if they arrise (not so nice because it introduces global state – you can’t have two such tests running in parallel for example)
Extend activity watcher and code calling activity watcher such that it has a “collect(Throwable)” method which gets called with the uncaught exceptions and “await” rethrows them.

Implementing this is left as an exercise to the reader :-).

GeekMeet talk about Google App Engine

0 comments

The GAE presentation I've given for the 12th edition of Cluj Geek Meet can be found here (created using reveal.js).

You can find the source code here.

Sunday, August 19, 2012

Lightning talk at Cluj.PM

0 comments

The slides from my Cluj.PM lightning talk:

It was a stressful (but fun!) experience. Thanks to the organizers!

Running pep8 and pylint programatically

0 comments

Having tools like pep8 and pylint are great, especially given the huge amount of dynamism involved in Python - which results in many opportunities to shooting yourself in the foot. Sometimes however you want to invoke these tools in more specialized ways, for example only on the files which changed since the last commit. Here is how you can do this from a python script and capture their output for later post-processing (maybe you want merge the output from both tools, or maybe you want to show only the lines which changed since the last commit, etc):

import pep8
try:
  sys.stdout = StringIO()
  pep8_checker = pep8.StyleGuide(config_file=config, format='pylint')
  pep8_checker.check_files(paths=[ ...path to files/dirs to check... ])
  output = sys.stdout.getvalue()
finally:
  sys.stdout = sys.__stdout__

from pylint.lint import Run
from pylint.reporters.text import ParseableTextReporter

reporter = ParseableTextReporter()
result = StringIO()
reporter.set_output(result)
Run(['--rcfile=pylint.config'] + [ ...files.., ], reporter=reporter, exit=False)
output = result.getvalue()

It is recommended that you use pylint/pep8 installed trough pip/easy_install rather than the Linux distribution repositories, since they are known to contain outdated software. You can check for this via code like the following:

if pkg_resources.get_distribution('pep8').parsed_version < parse_version('1.3.3'):
    logging.error('pep8 too old. At least version 1.3.3 is required')
    sys.exit(1)
if pkg_resources.get_distribution('pylint').parsed_version < parse_version('0.25.1'):
    logging.error('pylint too old. At least version 0.25.1 is required')
    sys.exit(1)

Finally, if you have to use an old version of pep8, the code needs to be modified to the following (however, this older version probably won't be of much use and will most likely annoy you - you should really try to use an up-to-date version - for example you could isolate this version using virtualenv):

result = []
import pep8
pep8.message = lambda msg: result.append(msg)
pep8.process_options(own_code)
for code_dir in [ ...files or dirs... ]:
    pep8.input_dir(code_dir)

Wednesday, August 15, 2012

Clearing your Google App Engine datastore

0 comments

Warning! This is a method to erase the data from your Google App Engine datastore. There is no way to recover your data after you go trough with this! Only use this if you're absolutely certain!

If you have a GAE account used for experimentation, you might like to clean it up sometimes (erase the contents of the datastore and blobstore associated with the application). Doing this trough the admin interface can become very tedious, so here is an alternative method:

Start your Remote API shell

Use the following code to delete all datastore entities:

while True: keys=db.Query(keys_only=True).fetch(500); db.delete(keys); print "Deleted 500 entries, the last of which was %s" % keys[-1].to_path()

Use the following code to delete all blobstore entities:

from google.appengine.ext.blobstore import *
while True: list=BlobInfo.all().fetch(500); delete([b.key() for b in list]);  print "Deleted elements, the last of which was %s" % list[-1].filename

The above method is inspired by this stackoverflow answer, but has the advantage that it does the deletion in smaller steps, meaning that the risk of the entire transaction being aborted because of deadline exceeded or over quota errors is removed.

Final caveats:

This can be slow
This consumes your quota, so you might have to do it over several days or raise your quota
The code is written in a very non-pythonic way (multiple statements on one line) for the ease of copy-pasting

Monday, December 31, 2012

Friday, December 21, 2012

Wednesday, December 12, 2012

Saturday, December 01, 2012

Sunday, November 25, 2012

Tuesday, November 06, 2012

Example 1: Double Trouble

Example 2: Where is my null at?

Example 3: We come up empty

Example 4: Careful with that static, Eugene!

Example 5: Remove old cruft

Conclusion

Thursday, November 01, 2012

Saturday, October 27, 2012

Proxying URL fetch requests from the Google App Engine

Sunday, August 19, 2012

Wednesday, August 15, 2012

Search the blog

Subscribe

Enter your email address:

Delivered by FeedBurner

Del.icio.us

Tag cloud

Powered by

Blog Archive

Recent comments

Top Commenters

My time

About Me

Things I read / listen to

Random blogs I'm reading

My stackoverflow feed

License and disclaimer

Visitors

Weather