Monday, February 6, 2012

What is behind System.nanoTime()?

In java world there is a very good perception about System.nanoTime(). There is always some guys who says that it is fast, reliable and, whenever possible, should be used for timings instead of System.currentTimemillis(). In overall he is absolutely lying, it is not bad at all, but there are some drawback which developer should be aware about. Also, although they have a lot in common, these drawbacks are usually platform-specific.

Windows

Functionality is implemented using QueryPerformanceCounter API, which is known to have some issues. There is possibility that it can leap forward, some people are reporting that is can be extremely slow on multiprocessor machines, etc. I spent a some time on net trying to find how exactly QueryPerformanceCounter works and what is does. There is no clear conclusion on that topic but there are some posts which can give some brief idea how it works. I would say that the most useful, probably are that and that ones. Sure, one can find more if search a little bit, but info will be more or less the same.

So, it looks like implementation is using HPET, if it is available. If not, then it uses TSC with some kind of synchronization of the value among CPUs. Interestingly that QueryPerformanceCounter promise to return value which increases with constant frequency. It means that in case of using TSC and several CPUs it may have some difficulties not just with the fact that CPUs may have just different value of TSC, but also may have different frequency. Keeping all that in mind Microsoft recommends to use SetThreadAffinityMask to stuck thread which calls to QueryPerformanceCounter to single processor, which, obviously, is not happening in JVM.

Linux

Linux is very similar to Windows, apart from the fact that it is much more transparent (I managed to download sources :) ). The value is read from clock_gettime with CLOCK_MONOTONIC flag (for real man, source is available in vclock_gettime.c from Linux source). Which uses either TSC or HPET. The only difference with Windows is that Linux not even trying to sync values of TSC read from different CPUs, it just returns it as it is. It means that value can leap back and jump forward with dependency of CPU where it is read. Also, in contrast to Windows, Linux doesn't keep change frequency constant. On the other hand, it definitely should improve performance.

Solaris

Solaris is simple. I believe that via gethrtime it goes to more or less the same implementation of clock_gettime as linux does. The difference is that Solaris guarantees that counter will not leap back, which is possible on Linux, but it is possible that the same value will be returned back. That guarantee, as can be observed from source code, is implemented using CAS, which requires sync with the main memory and can be relatively expensive on multi-processor machines. The same as on Linux, change rate can vary.

Conclusion

The conclusion is king of cloudy. Developer has to be aware that function is not perfect, it can leap back or just forward. It may not change monotonically and change rate can vary with dependency on CPU clock speed. Also, it is not as fast as many may think. On my Windows 7 machine in a single threaded test it is just about 10% faster than System.currentTimeMillis(), on multi threaded test, where number of threads is the same as number of CPUs, it is just the same. And on IBM Z400 workstation with WinXP System.nanoTime() is always approximately 8 times slower.

So, in overall, all it gives is increase in resolution, which may be important for some cases. And as a final note, even when CPU frequency is not changing, do no think that you can map that value reliably to system clock, see details here (this example describes just Windows, but more or less the same stuff is applicable to all other OSes).

Appendix

Appendix contains implementations of the function for different OSes. Source code is from OpenJDK v.7.

Solaris

// gethrtime can move backwards if read from one cpu and then a different cpu
// getTimeNanos is guaranteed to not move backward on Solaris
inline hrtime_t getTimeNanos() {
  if (VM_Version::supports_cx8()) {
    const hrtime_t now = gethrtime();
    // Use atomic long load since 32-bit x86 uses 2 registers to keep long.
    const hrtime_t prev = Atomic::load((volatile jlong*)&max_hrtime);
    if (now <= prev)  return prev;   // same or retrograde time;
    const hrtime_t obsv = Atomic::cmpxchg(now, (volatile jlong*)&max_hrtime, prev);
    assert(obsv >= prev, "invariant");   // Monotonicity
    // If the CAS succeeded then we're done and return "now".
    // If the CAS failed and the observed value "obs" is >= now then
    // we should return "obs".  If the CAS failed and now > obs > prv then
    // some other thread raced this thread and installed a new value, in which case
    // we could either (a) retry the entire operation, (b) retry trying to install now
    // or (c) just return obs.  We use (c).   No loop is required although in some cases
    // we might discard a higher "now" value in deference to a slightly lower but freshly
    // installed obs value.   That's entirely benign -- it admits no new orderings compared
    // to (a) or (b) -- and greatly reduces coherence traffic.
    // We might also condition (c) on the magnitude of the delta between obs and now.
    // Avoiding excessive CAS operations to hot RW locations is critical.
    // See http://blogs.sun.com/dave/entry/cas_and_cache_trivia_invalidate
    return (prev == obsv) ? now : obsv ;
  } else {
    return oldgetTimeNanos();
  }
}

Linux

jlong os::javaTimeNanos() {
  if (Linux::supports_monotonic_clock()) {
    struct timespec tp;
    int status = Linux::clock_gettime(CLOCK_MONOTONIC, &tp);
    assert(status == 0, "gettime error");
    jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) + jlong(tp.tv_nsec);
    return result;
  } else {
    timeval time;
    int status = gettimeofday(&time, NULL);
    assert(status != -1, "linux error");
    jlong usecs = jlong(time.tv_sec) * (1000 * 1000) + jlong(time.tv_usec);
    return 1000 * usecs;
  }
}

Windows

jlong os::javaTimeNanos() {
  if (!has_performance_count) {
    return javaTimeMillis() * NANOS_PER_MILLISEC; // the best we can do.
  } else {
    LARGE_INTEGER current_count;
    QueryPerformanceCounter(¤t_count);
    double current = as_long(current_count);
    double freq = performance_frequency;
    jlong time = (jlong)((current/freq) * NANOS_PER_SEC);
    return time;
  }
}

References

Inside the Hotspot VM: Clocks, Timers and Scheduling Events
Beware of QueryPerformanceCounter()
Implement a Continuously Updating, High-Resolution Time Provider for Windows
Game Timing and Multicore Processors
High Precision Event Timer (Wikipedia)
Time Stamp Counter (Wikipedia)

Monday, December 5, 2011

The magic of conditional operator

Can you guess what is going to be the output of the following piece of code?
    Object obj = false ? new Long(1) : new Integer(1);
    System.out.println(obj.getClass());
The smart once can guess that it is going to be "class java.lang.Long", otherwise there won't be a question. Now can you answer why? To be honest, I was a surprised by such an outcome, but apparently, according to JLS that is correct behaviour. Here is quotation from "15.25 Conditional Operator ? :":

The type of a conditional expression is determined as follows:
  • If the second and third operands have the same type (which may be the null type), then that is the type of the conditional expression.
  • If one of the second and third operands is of type boolean and the type of the other is of type Boolean, then the type of the conditional expression is boolean.
  • If one of the second and third operands is of the null type and the type of the other is a reference type, then the type of the conditional expression is that reference type.
  • Otherwise, if the second and third operands have types that are convertible (§5.1.8) to numeric types, then there are several cases:
    • If one of the operands is of type byte or Byte and the other is of type short or Short, then the type of the conditional expression is short.
    • If one of the operands is of type T where T is byte, short, or char, and the other operand is a constant expression of type int whose value is representable in type T, then the type of the conditional expression is T.
    • If one of the operands is of type Byte and the other operand is a constant expression of type int whose value is representable in type byte, then the type of the conditional expression is byte.
    • If one of the operands is of type Short and the other operand is a constant expression of type int whose value is representable in type short, then the type of the conditional expression is short.
    • If one of the operands is of type; Character and the other operand is a constant expression of type int whose value is representable in type char, then the type of the conditional expression is char.
    • Otherwise, binary numeric promotion (§5.6.2) is applied to the operand types, and the type of the conditional expression is the promoted type of the second and third operands. Note that binary numeric promotion performs unboxing conversion (§5.1.8) and value set conversion (§5.1.13).
  • Otherwise, the second and third operands are of types S1 and S2 respectively. Let T1 be the type that results from applying boxing conversion to S1, and let T2 be the type that results from applying boxing conversion to S2. The type of the conditional expression is the result of applying capture conversion (§5.1.10) to lub(T1, T2) (§15.12.2.7).

In the other words, it says that if second and third operands are convertible to primitive numeric types, then the result type is based on numeric promotion of converted values. Here how it looks after applying these rules:
Object obj = Long.valueOf(false ? (new Long(1L)).longValue() : (new Integer(1)).intValue());
My opinion, it all looks like a magic, and at the end, that's the price Java paid for autoboxing. Well world is not perfect and that's it's beauty, isn't it? :)

Wednesday, November 9, 2011

Java file flushing performance

There are many situations when it is required to ensure that data was written to the disk and write is also required to be fast. The most most common where it has to happen are databases, journalling, etc. Also, it is often required to update some random position in a file. I specifically what to place emphasis on random access here, as far as the rest will cover just cases where it is supported, i.e. I'm not going to mention OutputStream.flush() & related topics. Just haven't tried it, as far as that wasn't my case at the moment.

There are several way of flushing the data to disk in java. These options can be quite different in the way they implemented internally and in their performance. Here is the list of existing things you can do:
  • FileChannel.force()
  • 'rws' or 'rwd' mode of RandomAccessFile, which 'works much like the force(boolean) method of the FileChannel class' (from javadoc).
  • MappedbyteBuffer.force()
  • RandomAccessFile.getFD().sync()
  • any close() method. Here I mean seek and close stream each time when access is required. Doing tests, I actually didn't seek, as far was updating data with zero offset.
Surprisingly (the only unsurprising exception is close()) all these methods gives very different performance and it varies almost randomly on different OSes and file systems. Worth noticing that hardware can also put it's correction on the performance of any of these methods. I have also a strong feeling that performance may vary even with minor change in OS or JVM version number. Here is the table with time it takes to flush 8bytes (keep in mind, that the real amount of flushed data depends on the size of caches and going to be much more that 8byes), just to give a flavour of how different is that:

RandomAccessFile.
getFD().sync()
RandomAccessFile, rwd mode MappedbyteBuffer.force()FileChannel.force()
Windows 0.2818ms0.0125ms 0.007ms 0.139ms
Linux 0.5354ms 0.5144ms 0.4663ms 0.0093ms

Please, do not treat these numbers as any relevant result. They are here just to give an example how these things can vary.

So, what the conclusion? Conclusion is that if you would need to write high-performance application which does lots of IO, you really need to test different approached on different OSes, on different file systems and, preferably  on different JVMs. Do not expect something to be fast on Linux (Solaris, AIX, etc) production box, when it is fast on your Windows (Linux, etc) workstation and vice versa. As can be seen, the difference can be in orders of magnitude.

Monday, July 25, 2011

Embedding Jetty7

Jetty web container provides the way to build highly modularized web application. It is also as very fast a lightweight. Following example shows how to build simple web application using embedded version of Jetty7.

Wednesday, July 20, 2011

The most complete list of -XX options for Java 6 JVM

There was a wonderful page with that list (http://www.md.pp.ru/~eu/jdk6options.html), but it seems it's gone now and not available any more. Luckily there is http://web.archive.org which allows to make time flowing backwards and recover things which have passed away long time back. So, I have just get that page out of grave and copy/pasted here to return that bit of information back to life.

Tuesday, July 12, 2011

Integration of TeamCity 6.x and Fitnesse

Team city is wonderful CI and build management platform, which combines great capabilities with really simple configuration. I seriously love it for the way it is designed, configured and managed and would recommend it to anybody who cares about quality and wants to have fast feedback on build/test/integration problems. FitNess is “The fully integrated standalone wiki, and acceptance testing framework”, which I, personally, do not like, but testers (at least ones I know) say it is good. I still do not believe them, but have to live with it :)

Sunday, July 3, 2011

EhCache replication: RMI vs JGroups.

Recently, I was working on one product which required replicated caching. Caching provider was already decided - EhCache, and the remained was a question about transport. Which one is the best option? By the best option here I mean just the one which has better performance. The performance measurement was done just between two of available transports - JGroups and RMI, others were not considered, sorry.

Saturday, January 15, 2011

jakarta regexp vs java.util.regex

Have never really been thinking about which library to use for regular expression, it always was java.util.regex by default. But I found that some people prefer to use Jakarta Regexp, and usually there is not clear reason for that choice. So, I decided to spent some time trying to find out what is better and probably write a couple of tests. After some time spent in the Internet looking for the examples of tests, it appeared that there was a good guy already who have done all that work. Here is the link to the page with results of his investigation:

http://tusker.org/regex/regex_benchmark.html

The conclusion is that Jakarta Regexp doesn't look good at all. Also you may notice that there are some other libraries who are doing regexps and some have very impressive performance and seems like worth trying.

Tuesday, November 30, 2010

All you need to know about QuickSort

It would be true to say that Quicksort is one of the most popular sorting algorithms. You can find it implemented on the most of the languages and it is present in almost any core library. In Java and Go Quicksort is default sorting algorithm for some data types and it is used in in C++ STL (Introsoft which is used there begins with Quicksort). Such popularity can be explained by the fact that on average, Quicksort is one of the fastest known sorting algorithms. Interestingly that complexity of Quicksort is not less than it is for other algorithms like MergeSort or HeapSort. The best case performance is O(nlogn) and on the worst case it gives O(n^2). Latter, luckily, is exceptional case for the proper implementation. Quicksort performance is gained by the main loop which tends to make excellent usage of CPU caches. Another reason of popularity is that it doesn't need allocation of additional memory.

Personally for me Quicksort appeared as one of the most complex sorting algorithms. The basic idea is pretty simple and usually takes just a few minutes to implement. But that version, of course, if not practically usable. When it comes to details and to efficiency, it is getting more and more complicated.

Quicksort was first discovered by C.A.R. Hoare in 1962 (see "Quicksort," Computer Journal 5, 1, 1962) and in following years algorithm slightly mutated. The most known version is Three-way Quicksort. The most comprehensive of widely known ones is Dual-Pivot Quicksort. Both algorithms will be covered in that post.

Monday, October 11, 2010

JCaptcha, SecureRandom & performance

Personally, I'm not big fan of JCaptcha library and recently was lucky enough to find another problem there. The problem is related mostly to linux and it's implementation of java.security. SecureRandom, which is known to be slow and is locking on every call to it.

For some reason (I suspect that this was the reason) JCaptacha overuse java.security.SecureRandom when it generates background image using FunkyBackgroundGenerator. Number of calls to get next random float can easily reach something around 100 000 per one captcha image. It is basically called at least once for each pixel.

Lets run some quick tests to have a look how bad is that. I have wrote write simple captch engine:


public static class MyImageCaptchaEngine extends ListImageCaptchaEngine {
protected void buildInitialFactories() {
WordGenerator wgen = new RandomWordGenerator("ABCDEFGHIJKLMNOPQRSTUVWXYZ123456789");
RandomRangeColorGenerator cgen = new RandomRangeColorGenerator(
new int[] {0, 100},
new int[] {0, 100},
new int[] {0, 100});
TextPaster textPaster = new RandomTextPaster(new Integer(7), new Integer(7), cgen, true);

BackgroundGenerator backgroundGenerator = new FunkyBackgroundGenerator(new Integer(200), new Integer(100));

Font[] fontsList = new Font[] {
new Font("Arial", 0, 10),
new Font("Tahoma", 0, 10),
new Font("Verdana", 0, 10),
};

FontGenerator fontGenerator = new RandomFontGenerator(new Integer(20), new Integer(35), fontsList);

WordToImage wordToImage = new ComposedWordToImage(fontGenerator, backgroundGenerator, textPaster);
this.addFactory(new GimpyFactory(wgen, wordToImage));
}
}
And the test itself:



long begin = System.currentTimeMillis();
for(int i = 0; i != 100; ++i) {
engine.getNextCaptcha();
}
long end = System.currentTimeMillis();
System.out.println("Total time is [" + (end - begin) + "]");
now lets run it:

Total time is [10967]

Ok, so what we have now it 10967ms, which, I believe, is bad. It can be significantly improved. I'm not very big fan of high-quality random backgrounds, so I will replace SecureRandom class, used by parent of FunkyBackgroundGenerator with "pseudo" random Random class. Funs of high-quality random backgrounds can still use SecureRandom for seeding, through:


public static class MyFunkyBackgroundGenerator extends FunkyBackgroundGenerator {
public MyFunkyBackgroundGenerator(Integer width, Integer height) {
super(width, height);
try {
Field rndField = AbstractBackgroundGenerator.class.getDeclaredField("myRandom");
rndField.setAccessible(true);
rndField.set(this, new Random());
}
catch (Exception e) {
e.printStackTrace();
}
}
}
I know, that is dirty hack, but as far "myRandom" is declared with default visibility, that's the shortest way to replace it for now . And what we have now is:

Total time is [1308]

Approximately 7 time quicker. Not that bad, specially for a sample case. In real world application improvement will be even more significant because some other processes may use '/dev/(u)random' or application itself can utilize SecureRandom for other purposes.

That's not it. There is another bottleneck, which is usage of Java2D for rendering captcha images. Java2D is well known for it's problems with multi-threading and the summary of these problems is that Java2D doesn't scale well. Some details can be found here. Possibly the way to fix that problem may be removing Java2D and using direct access to image via WritableRaster instead. However, it doesn't solve all problems, as far as Java2D is still used for drawing text.