Tuesday, May 7, 2013

Embedding Jetty9 & Spring MVC

This post is re-done of one of my previous posts which was about embedding Jetty7. Now it's about new version - Jetty9 and also with support of Spring MVC. Just thought it would be a good idea to keep something like that as a reference. There is no much text below, this is because the source is clear enough and doesn't need much explanation. Though, feel free to raise questions in comments.

Wednesday, March 27, 2013

AtomicFieldUpdater vs. Atomic

Java 1.5 introduced new family of classes (Atomic*FieldUpdater) for atomic updates of object fields with properties similar to Atomic* set of classes and it seems like there is slight confusion about the purpose of these. And that confusion is understood, the reason for their existance is not very obvious. First of all they are no way faster than Atomics, if you look at source, you see that there are lots of access control checks. Then, they are not handy - developer has to write more code, understand new API, etc.

So why would you bother? There are two main use cases when Atomic*FieldUpdater can be considered an an option:

  • There is a field which is mostly read and rarely changed. In that case, volatile field can be used for read access and Atomic*FieldUpdater for ocasional updates. Thought, that optimization is arguable, because there is a good chance that in latest JVMs Atomic*.get() is intrinsic and should not be slower than volatile.
  • Atomics have much higher overhead on memory usage than primitives. In cases when memory is critical Atomic can be replaced with volatile primitive with Atomic*FieldUpdater.

References:
http://concurrency.markmail.org/message/ns4c5376otat2p54?q=FieldUpdater
http://concurrency.markmail.org/message/mpoy74yhuwgi52fa?q=FieldUpdater

Tuesday, March 12, 2013

Scala: Automatic resourse management

After completing wonderful course by Martin Odesky, I have eventually had a chance to have a little play with Scala and create something more useful than "hello world" app. And even I have had some experience with that language just a few week before, I felt slightly frustrated. I reckon all that is because I become too dull and silly spending too much time with Java :) First surprise was that I realized that this language has a compiler - with Java it almost doesn't exist, you never 'compile' you do 'build', which is very different kind of thing. With Java you always almost curtain that you code is compilable, because modern IDEs (like Intellij) do not give you a chance to leave compilation error in your code. Another surprize is that Scala compiler is deadly slow, I have a good feeling that big project will suffer with it. So, you can say that with Scala it feels like comming back to good old C++ days :)

Ok, that's was introduction, here is some stuff I wrote, and which I almost sure is just another 'bicycle', but was useful for me. After some time with language, I realized that it doesn't have any standard resource-management construction, which probably is good for Scala - language is so flexible that it allows you to build your own without much effort (mostly code is stolen from this post):

  trait Managed[T] {
    def onEnter(): T
    def onExit(t:Throwable = null)
    def attempt(block: => Unit) {
      try { block } finally {}
    }
  }

  def using[T <: Any, R](managed: Managed[T])(block: T => R): R = {
    val resource = managed.onEnter()
    var exception = false
    try {
      block(resource)
    } catch  {
      case t:Throwable => {
        exception = true
        managed.onExit(t)
        throw t
      }
    } finally {
      if (!exception) {
        managed.onExit()
      }
    }
  }

  def using[T <: Any, U <: Any, R] (managed1: Managed[T], managed2: Managed[U]) (block: T => U => R): R = {
    using[T, R](managed1) { r =>
      using[U, R](managed2) { s => block(r)(s) }
    }
  }

  class ManagedClosable[T <: Closeable](closable:T) extends Managed[T] {
    def onEnter(): T = closable
    def onExit(t:Throwable = null) {
      attempt(closable.close())
    }
  }

  implicit def closable2managed[T <: Closeable](closable:T): Managed[T] = {
    new ManagedClosable(closable)
  }
and the usage looks like this:
  def readLine() {
    using(new BufferedReader(new FileReader("file.txt"))) {
      file => {
        file.readLine()
      }
    }
  }

Monday, February 4, 2013

Evil of microbenchmarking & CAS performance on Ivy Bridge

Some days back Martin Thompson published investigation on results of his controversial CAS (compare and swap) performance test he made few months back. And that investigation really impressed me - it shows how microbenchmarking can go really wrong, even when it is done by such a smart guy.

Just to recap, test was executing several threads which were hammering CPU with CAS operations. Test showed that on average CAS on modern Ivy Bridge processor works significantly slower than on older Nehalem architecture. After a few months and Martin found out the reason for such strange behavior and amazing thing about it is that the reason for test being slower is that Ivy Bridge is actually faster.

To understand why that happens lets see what's going on when CAS is executed. Generally speaking, on high level, in relation to CPU core, memory which is going to be written can be in two states - core can either exclusively own cache line with it or do not own. If it owns that line then CAS is extremely fast - core doesn't need to notify other cores to do that operation. If core doesn't own it, the situation is very different - core has to send request to fetch cache line in exclusive mode and such request requires communication with all other cores. Such negotiation is not fast, but on Ivy Bridge it is much faster than on Nehalem. And because it is faster on Ivy Bride, core has less time to perform a set of fast local CAS operations while it owns cacheline, therefore total throughput is less.

I suppose, a very good lesson learned here - microbenchmarking can be very tricky and not easy to do properly. Also results can be easily interpreted in a wrong way. So, be careful!

Thursday, December 20, 2012

git hangs after "Resolving deltas"

Have had a funy problem with Git. I suppose it's proxy-related. Writing it down, because sure that will have the same problem some time again. Also hope it will help to people who are also suffering with it.

As a precondition, I have a git with following in '.gitconfig':

[http]
proxy=http://user:password@proxy:8080

When I tried to clone repository I've got this:

$ git clone https://code.google.com/p/caliper/
Cloning into 'caliper'...
remote: Counting objects: 3298, done.
remote: Finding sources: 100% (3298/3298), done.
remote: Total 3298 (delta 1755)
Receiving objects: 100% (3298/3298), 7.14 MiB | 1.94 MiB/s, done.
Resolving deltas: 100% (1755/1755), done.

And then nothing, it just hangs. If you go and have a look, you can see that files are downloaded, but not unpacked. As all other people on Internet, I have no idea why that is happening, but eventually I have found a way to get files out of it.

When it hangs, just kill the process with Ctrl+C and run this command in repository folder:

$ git fsck
notice: HEAD points to an unborn branch (master)
Checking object directories: 100% (256/256), done.
Checking objects: 100% (3298/3298), done.
notice: No default references
dangling commit 2916d1238ca0f4adecbda580ef4329a649fc777c
Now just merge that dangling commit:
$ git merge 2916d1238ca0f4adecbda580ef4329a649fc777c
and from now on you can enjoy repository content in any way you want.

Thursday, December 13, 2012

File.setLastModified & File.lastModified

Have observed interesting behavior of File.lastModified file property on Linux. Basically, my problem was that I was incrementing the value of that property by 1 in one thread and monitoring the change in the other thread. And apparently no change in property's value happened, the other thread did not see increment. After some time trying to make it work, I realized that I have to increment it at least by a 1000 to make the change visible.

Wondering why that is happening, I have had a look at JDK source code and that's what I found:

JNIEXPORT jlong JNICALL
Java_java_io_UnixFileSystem_getLastModifiedTime(JNIEnv *env, jobject this,
                                                jobject file)
{
    jlong rv = 0;

    WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path) {
        struct stat64 sb;
        if (stat64(path, &sb) == 0) {
            rv = 1000 * (jlong)sb.st_mtime;
        }
    } END_PLATFORM_STRING(env, path);
    return rv;
}

What happens is that on Linux File.lastModified has 1sec resolution and simply ignores milliseconds. I'm not an expert in Linux programming, so not sure is there any way get that time with millisecond resolution on Linux. Assume it should be possible because 'setLastModified' seems like is working as it is expected to work - sets modification time with millisecond resolution (you can find the source code in 'UnixFileSystem_md.c').

So, just a nice thing to remember: when you work with files on Linux, you may not see change in File.lastModified when it's value updated for less than 1000ms.

Wednesday, October 24, 2012

Effective Concurrency by Herb Sutter

Have never ever written feedback on events or courses, but here I decided to write one. It is about "Effective concurrency" course by Herb Sutter. Hopefully that post will help to someone to support an approval for that course :)

So, as I have already said, a few weeks back I was lucky enough to attend "Effective Concurrency" course by Herb Sutter. That guy is software architect at Microsoft where he has been the lead designer of C++/CLI, C++/CX, C++ AMP, and other technologies. He also has served for a decade as chair of the ISO C++ standards committee. Many people also know him for his books.

Tuesday, September 11, 2012

Building OpenJDK on Windows

Experimenting with some stuff, I found that it is often useful to have JDK source code available in hand to make some changes, play with it, etc. So I decided to download and compile that beast. Apparently, it took me some time to do that, although my initial thought was that it's should be as simple as running make command :). As you can guess, I found that it's not a trivial task and to simplify my life in future, it would be useful to keep some records of what I was doing.

Saturday, May 19, 2012

Bug in Java Memory Model implementation

Just have came around amazing question on stackoverwflow:

http://stackoverflow.com/questions/10620680/why-volatile-in-java-5-doesnt-synchronize-cached-copies-of-variables-with-main

Basically the guy there is trying to use "piggybacking" to publish non-volatile variable and it doesn't work. "piggybacking" is a technique that uses data visibility guarantees of volatile variable or monitor to publish non-volatile data. For example such technique is used in ConcurrentHashmap#containsValue() and ConcurrentHashmap#containsKey(). The fact that is doesn't work in that case is a bug in Oracle's Java implementation. And that is rather scary - concurrency problems are very hard to indentify even on bug-free JVM and such bugs in Memory Model implementation making things much worse. Hopefully that's the only bug related to JMM and Oracle has good test coverage for such cases.

The good news is that this particular problem appears just on C1 (client Hotspot compiler) and not in all cases. It doesn't happen on C2 (server compiler, enabled with "-server" switch). Fortunately, the most of people are running java on server side and there are quite a few client application which are using advanced concurrency features.

For ones who want to understand that case better, please, follow the link, I've provided at the beginning of post. Also there is very useful post on "concurrency-interest", which also has a good explanation of what is going on there: http://cs.oswego.edu/pipermail/concurrency-interest/2012-May/009449.html

Monday, February 6, 2012

What is behind System.nanoTime()?

In java world there is a very good perception about System.nanoTime(). There is always some guys who says that it is fast, reliable and, whenever possible, should be used for timings instead of System.currentTimemillis(). In overall he is absolutely lying, it is not bad at all, but there are some drawback which developer should be aware about. Also, although they have a lot in common, these drawbacks are usually platform-specific.

Windows

Functionality is implemented using QueryPerformanceCounter API, which is known to have some issues. There is possibility that it can leap forward, some people are reporting that is can be extremely slow on multiprocessor machines, etc. I spent a some time on net trying to find how exactly QueryPerformanceCounter works and what is does. There is no clear conclusion on that topic but there are some posts which can give some brief idea how it works. I would say that the most useful, probably are that and that ones. Sure, one can find more if search a little bit, but info will be more or less the same.

So, it looks like implementation is using HPET, if it is available. If not, then it uses TSC with some kind of synchronization of the value among CPUs. Interestingly that QueryPerformanceCounter promise to return value which increases with constant frequency. It means that in case of using TSC and several CPUs it may have some difficulties not just with the fact that CPUs may have just different value of TSC, but also may have different frequency. Keeping all that in mind Microsoft recommends to use SetThreadAffinityMask to stuck thread which calls to QueryPerformanceCounter to single processor, which, obviously, is not happening in JVM.

Linux

Linux is very similar to Windows, apart from the fact that it is much more transparent (I managed to download sources :) ). The value is read from clock_gettime with CLOCK_MONOTONIC flag (for real man, source is available in vclock_gettime.c from Linux source). Which uses either TSC or HPET. The only difference with Windows is that Linux not even trying to sync values of TSC read from different CPUs, it just returns it as it is. It means that value can leap back and jump forward with dependency of CPU where it is read. Also, in contrast to Windows, Linux doesn't keep change frequency constant. On the other hand, it definitely should improve performance.

Solaris

Solaris is simple. I believe that via gethrtime it goes to more or less the same implementation of clock_gettime as linux does. The difference is that Solaris guarantees that counter will not leap back, which is possible on Linux, but it is possible that the same value will be returned back. That guarantee, as can be observed from source code, is implemented using CAS, which requires sync with the main memory and can be relatively expensive on multi-processor machines. The same as on Linux, change rate can vary.

Conclusion

The conclusion is king of cloudy. Developer has to be aware that function is not perfect, it can leap back or just forward. It may not change monotonically and change rate can vary with dependency on CPU clock speed. Also, it is not as fast as many may think. On my Windows 7 machine in a single threaded test it is just about 10% faster than System.currentTimeMillis(), on multi threaded test, where number of threads is the same as number of CPUs, it is just the same. And on IBM Z400 workstation with WinXP System.nanoTime() is always approximately 8 times slower.

So, in overall, all it gives is increase in resolution, which may be important for some cases. And as a final note, even when CPU frequency is not changing, do no think that you can map that value reliably to system clock, see details here (this example describes just Windows, but more or less the same stuff is applicable to all other OSes).

Appendix

Appendix contains implementations of the function for different OSes. Source code is from OpenJDK v.7.

Solaris

// gethrtime can move backwards if read from one cpu and then a different cpu
// getTimeNanos is guaranteed to not move backward on Solaris
inline hrtime_t getTimeNanos() {
  if (VM_Version::supports_cx8()) {
    const hrtime_t now = gethrtime();
    // Use atomic long load since 32-bit x86 uses 2 registers to keep long.
    const hrtime_t prev = Atomic::load((volatile jlong*)&max_hrtime);
    if (now <= prev)  return prev;   // same or retrograde time;
    const hrtime_t obsv = Atomic::cmpxchg(now, (volatile jlong*)&max_hrtime, prev);
    assert(obsv >= prev, "invariant");   // Monotonicity
    // If the CAS succeeded then we're done and return "now".
    // If the CAS failed and the observed value "obs" is >= now then
    // we should return "obs".  If the CAS failed and now > obs > prv then
    // some other thread raced this thread and installed a new value, in which case
    // we could either (a) retry the entire operation, (b) retry trying to install now
    // or (c) just return obs.  We use (c).   No loop is required although in some cases
    // we might discard a higher "now" value in deference to a slightly lower but freshly
    // installed obs value.   That's entirely benign -- it admits no new orderings compared
    // to (a) or (b) -- and greatly reduces coherence traffic.
    // We might also condition (c) on the magnitude of the delta between obs and now.
    // Avoiding excessive CAS operations to hot RW locations is critical.
    // See http://blogs.sun.com/dave/entry/cas_and_cache_trivia_invalidate
    return (prev == obsv) ? now : obsv ;
  } else {
    return oldgetTimeNanos();
  }
}

Linux

jlong os::javaTimeNanos() {
  if (Linux::supports_monotonic_clock()) {
    struct timespec tp;
    int status = Linux::clock_gettime(CLOCK_MONOTONIC, &tp);
    assert(status == 0, "gettime error");
    jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) + jlong(tp.tv_nsec);
    return result;
  } else {
    timeval time;
    int status = gettimeofday(&time, NULL);
    assert(status != -1, "linux error");
    jlong usecs = jlong(time.tv_sec) * (1000 * 1000) + jlong(time.tv_usec);
    return 1000 * usecs;
  }
}

Windows

jlong os::javaTimeNanos() {
  if (!has_performance_count) {
    return javaTimeMillis() * NANOS_PER_MILLISEC; // the best we can do.
  } else {
    LARGE_INTEGER current_count;
    QueryPerformanceCounter(¤t_count);
    double current = as_long(current_count);
    double freq = performance_frequency;
    jlong time = (jlong)((current/freq) * NANOS_PER_SEC);
    return time;
  }
}

References

Inside the Hotspot VM: Clocks, Timers and Scheduling Events
Beware of QueryPerformanceCounter()
Implement a Continuously Updating, High-Resolution Time Provider for Windows
Game Timing and Multicore Processors
High Precision Event Timer (Wikipedia)
Time Stamp Counter (Wikipedia)