The Java(tm) Specialists' Newsletter: Sorting Lists

Datetime:2016-08-23 00:36:19          Topic: Java8           Share

The Java Specialists' Newsletter

Issue 2392016-07-19Category:Tips and Tricks Java version: 8

Sorting Lists

by Dr. Heinz M. Kabutz

Abstract:

List has a new method sort(Comparable). This is handy, as it allows implementations to specialize how to sort their internal data structures. Vector only synchronizes once per list, not once per element. ArrayList avoids copying the array unnecessarily. CopyOnWriteArrayList works.

Welcome to the 239th edition of The Java(tm) Specialists' Newsletter , sent to you from the beautiful Island of Crete. I am sitting on my balcony under the full moon, listening to the rural sounds of our sheep "Darling" wandering around ringing her bell, no doubt searching for the last green tuft of grass. She has a name. This means we cannot eat her or her offspring up to the tenth generation.

Last Friday we had a watermelon festival in our village of Chorafakia. The concept is brilliant. For 5 EUR, you get a small bottle of tsikoudia (grappa) to share between 4 and as much watermelon as you can eat. And of course you can buy drinks and grilled meats. Live Cretan music, dancers and a jolly good time. Fantastic idea and I will definitely go again next year. I might even volunteer to help with the braaing. After all, I might be Chorafakianos now, but I was born and raised South African.

NEW: Please see our new "Extreme Java" course, combining concurrency, a little bit of performance and Java 8. Extreme Java - Concurrency & Performance for Java 8 .

Sorting Lists

In older versions of Java, we typically used Collections.sort(List) to sort a list of objects. Some lists implement the RandomAccess interface, indicating that we can access any member of the list at a random position in constant time. An example of such a list is ArrayList. LinkedList, on the other hand, is not. A lookup has O(n) time complexity. The Collections.sort(List) method didn't consider whether a List implemented RandomAccess, but instead always converted it to an Array, then sorted it with Arrays.sort(Object[]) and later wrote it back into the list with the set() method on the ListIterator. This way we could sort lists that had O(n) time complexity on lookup:

public static <T> void sort(List<T> list) {
  Object[] a = list.toArray();
  Arrays.sort(a);
  ListIterator i = list.listIterator();
  for (int j=0; j<a.length; j++) {
    i.next();
    i.set(a[j]);
  }
}

Internally, the Arrays.sort() method also creates temporary arrays in relation to the size of the input data. We thus have two places where temporary arrays are made: Collections.sort() and Arrays.sort(). Prior to Java 7 we had Merge Sort, which created an array exactly the size of the input array. Since Java 7 we have Tim Sort, which creates some additional temporary arrays and thus bumps up the total garbage created a little bit.

In Java 8, the List interface now contains the method sort(Comparator). The default implementation looks the same as the Java 7 method Collections.sort(List, Comparator). However, since it is in the interface, we are now able to specialize it in the List implementations. This means that ArrayList no longer needs to copy its contents into an array, but can sort directly. CopyOnWriteArrayList can sort its contents without throwing an UnsupportedOperation exception on the set(). Even Vector has been enhanced to synchronize across the entire sort() and not on every set() method. Life is good.

However, an astute computer scientist would quickly realize that object creation is not the bottleneck in the sort() method. Rather, merging and doing the sorting takes the bulk of the time. We can observe this by looking at the object creation rate. On my machine, I can create upwards of 4 GB per second. Here we are creating only about 12 MB per second. We will observe the object creation only out of curiosity.

I'd like to give a shout-out to our subscriber Michael Inden, who wrote a nice book Java 8 - Die Neuerungen (in German). I knew that List had a sort() method, but learned through his book about the different specializations in the List implementations. Clever.

In order to test the object creation, I simplified theByteWatcher:

import javax.management.*;
import java.lang.management.*;

/**
 * Reduced version of the ByteWatcher, described here:
 * http://www.javaspecialists.eu/archive/Issue232.html
 */
public class ByteWatcher {
  private static final String GET_THREAD_ALLOCATED_BYTES =
      "getThreadAllocatedBytes";
  private static final String[] SIGNATURE =
      new String[]{long.class.getName()};
  private static final MBeanServer mBeanServer;
  private static final ObjectName name;

  private final Object[] PARAMS;
  private final long MEASURING_COST_IN_BYTES; // usually 336
  private final long tid;

  private long allocated = 0;

  static {
    try {
      name = new ObjectName(
          ManagementFactory.THREAD_MXBEAN_NAME);
      mBeanServer = ManagementFactory.getPlatformMBeanServer();
    } catch (MalformedObjectNameException e) {
      throw new ExceptionInInitializerError(e);
    }
  }

  public ByteWatcher() {
    this.tid = Thread.currentThread().getId();
    PARAMS = new Object[]{tid};

    long calibrate = threadAllocatedBytes();
    // calibrate
    for (int repeats = 0; repeats < 10; repeats++) {
      for (int i = 0; i < 10_000; i++) {
        // run a few loops to allow for startup anomalies
        calibrate = threadAllocatedBytes();
      }
      try {
        Thread.sleep(50);
      } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        break;
      }
    }
    MEASURING_COST_IN_BYTES = threadAllocatedBytes() - calibrate;
    reset();
  }

  public void reset() {
    allocated = threadAllocatedBytes();
  }

  private long threadAllocatedBytes() {
    try {
      return (long) mBeanServer.invoke(
          name,
          GET_THREAD_ALLOCATED_BYTES,
          PARAMS,
          SIGNATURE
      );
    } catch (Exception e) {
      throw new IllegalStateException(e);
    }
  }

  /**
   * Calculates the number of bytes allocated since the last
   * reset().
   */
  public long calculateAllocations() {
    long mark1 = ((threadAllocatedBytes() -
        MEASURING_COST_IN_BYTES) - allocated);
    return mark1;
  }
}

Next I grabbed a little utility Memory that I wrote for the exercises in my Extreme Java - Advanced Topics Course . It is similar to the TimeUnit enum in structure. We use it for formatting the bytes allocated into numbers that are easier to read:

public enum Memory {
  BYTES {
    public double toBytes(double d) {
      return d;
    }

    public double toKiloBytes(double d) {
      return toBytes(d) / 1024;
    }

    public double toMegaBytes(double d) {
      return toKiloBytes(d) / 1024;
    }

    public double toGigaBytes(double d) {
      return toMegaBytes(d) / 1024;
    }

    public double toTeraBytes(double d) {
      return toGigaBytes(d) / 1024;
    }
  },
  KILOBYTES {
    public double toBytes(double d) {
      return toKiloBytes(d) * 1024;
    }

    public double toKiloBytes(double d) {
      return d;
    }

    public double toMegaBytes(double d) {
      return toKiloBytes(d) / 1024;
    }

    public double toGigaBytes(double d) {
      return toMegaBytes(d) / 1024;
    }

    public double toTeraBytes(double d) {
      return toGigaBytes(d) / 1024;
    }
  },
  MEGABYTES {
    public double toBytes(double d) {
      return toKiloBytes(d) * 1024;
    }

    public double toKiloBytes(double d) {
      return toMegaBytes(d) * 1024;
    }

    public double toMegaBytes(double d) {
      return d;
    }

    public double toGigaBytes(double d) {
      return toMegaBytes(d) / 1024;
    }

    public double toTeraBytes(double d) {
      return toGigaBytes(d) / 1024;
    }
  },
  GIGABYTES {
    public double toBytes(double d) {
      return toKiloBytes(d) * 1024;
    }

    public double toKiloBytes(double d) {
      return toMegaBytes(d) * 1024;
    }

    public double toMegaBytes(double d) {
      return toGigaBytes(d) * 1024;
    }

    public double toGigaBytes(double d) {
      return d;
    }

    public double toTeraBytes(double d) {
      return toGigaBytes(d) / 1024;
    }
  },
  TERABYTES {
    public double toBytes(double d) {
      return toKiloBytes(d) * 1024;
    }

    public double toKiloBytes(double d) {
      return toMegaBytes(d) * 1024;
    }

    public double toMegaBytes(double d) {
      return toGigaBytes(d) * 1024;
    }

    public double toGigaBytes(double d) {
      return toTeraBytes(d) * 1024;
    }

    public double toTeraBytes(double d) {
      return d;
    }
  };

  public abstract double toBytes(double d);

  public abstract double toKiloBytes(double d);

  public abstract double toMegaBytes(double d);

  public abstract double toGigaBytes(double d);

  public abstract double toTeraBytes(double d);

  public static String format(double d, Memory unit,
                              int decimals) {
    String unitStr;
    double val;
    double bytes = unit.toBytes(d);
    if (bytes < 1024) {
      val = bytes;
      unitStr = "B";
    } else if (bytes < 1024 * 1024) {
      val = BYTES.toKiloBytes(bytes);
      unitStr = "KB";
    } else if (bytes < 1024 * 1024 * 1024) {
      val = BYTES.toMegaBytes(bytes);
      unitStr = "MB";
    } else if (bytes < 1024 * 1024 * 1024 * 1024L) {
      val = BYTES.toGigaBytes(bytes);
      unitStr = "GB";
    } else {
      val = BYTES.toTeraBytes(bytes);
      unitStr = "TB";
    }
    return String.format("%." + decimals + "f%s", val, unitStr);
  }
}

Lastly, the actual test class called ListSorting. We are trying out 5 different List implementations: ArrayList, LinkedList, Vector, CopyOnWriteArrayList and the list that is returned by Arrays.asList(...). Each of the lists contains Double objects. We start by generating a stream of primitive doubles using ThreadLocalRandom, box them and collect them into a List. This becomes the list that we want to sort.

The test() method takes as a parameter a way to construct the list (e.g. ArrayList::new or LinkedList::new). Each list has to have a constructor that accepts a list as a parameter. We also provide the jumbled list to the test() method. This way, our test() method can try out different ways of sorting the lists, old and new.

import java.io.*;
import java.lang.management.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.function.*;
import java.util.stream.*;

public class ListSorting {
  private static final ByteWatcher byteWatcher =
      new ByteWatcher();

  public static void main(String... args) throws IOException {
    for (int i = 0; i < 10; i++) {
      testAll();
      System.out.println();
    }
  }

  private static void testAll() {
    for (int size = 100_000; size <= 10_000_000; size *= 10) {
      List<Double> jumble =
          ThreadLocalRandom.current()
              .doubles(size)
              .boxed()
              .collect(Collectors.toList());
      test(ArrayList::new, jumble);
      test(LinkedList::new, jumble);
      test(Vector::new, jumble);
      test(CopyOnWriteArrayList::new, jumble);
      test(doubles ->
          Arrays.asList(
              jumble.stream().toArray(Double[]::new)
          ), jumble);
    }
  }

  private static void test(
      UnaryOperator<List<Double>> listConstr,
      List<Double> list) {
    sortOld(listConstr.apply(list));
    sortNew(listConstr.apply(list));
  }

  private static void sortOld(List<Double> list) {
    measureSort("Old", list, () -> sort(list));
  }

  private static void sortNew(List<Double> list) {
    measureSort("New", list, () -> list.sort(null));
  }

  private final static ThreadMXBean tmbean =
      ManagementFactory.getThreadMXBean();

  private static void measureSort(String type,
                                  List<Double> list,
                                  Runnable sortJob) {
    try {
      long time = tmbean.getCurrentThreadUserTime();
      byteWatcher.reset();
      sortJob.run();
      long bytes = byteWatcher.calculateAllocations();
      time = tmbean.getCurrentThreadUserTime() - time;
      time = TimeUnit.MILLISECONDS.convert(
          time, TimeUnit.NANOSECONDS);
      System.out.printf(
          "%s sort %s %,3d in %dms and bytes %s%n",
          type,
          list.getClass().getName(),
          list.size(),
          time,
          Memory.format(bytes, Memory.BYTES, 2));
    } catch (UnsupportedOperationException ex) {
      System.out.println("Old sort: Cannot sort " +
          list.getClass().getName() + " " + ex);
    }
  }

  /**
   * {@linkplain java.util.Collections#sort Copied from Java 7}
   */
  @SuppressWarnings({"unchecked", "rawtypes"})
  public static <E> void sort(List<E> list) {
    Object[] a = list.toArray();
    Arrays.sort(a);
    ListIterator<E> i = list.listIterator();
    for (Object e : a) {
      i.next();
      i.set((E) e);
    }
  }
}

LinkedList uses the default sorting method from Java 7, which now lives in the default sort() method inside the List interface. Most of the List implementations, such as ArrayList, Arrays.ArrayList, Vector and CopyOnWriteArrayList, have overridden this with more efficient versions. When you run the code, you will notice that ArrayList allocates less bytes than before, although there isn't a huge difference in CPU time. You will also see that CopyOnWriteArrayList can now be sorted, whereas before you got an UnsupportedOperationException.

Sorting Like a Boss in Java 8

In Java 8, we have an Arrays.parallelSort(), but no analogous Collections.parallelSort(). But we can achieve our aim of sorting nirvana with parallel streams. In my parallelSort() method, we specify a list and a Function that will be used to create the correct type of List. R is the result type, for example ArrayList<E> or LinkedList<E>. It has to be a subtype of List<E>. We then convert the list to a parallel stream, sort that, and collect it to a list. The result is an ArrayList, which we then convert with the listConstructor Function.

import java.util.*;
import java.util.function.*;
import java.util.stream.*;

public class SortingLikeABoss {
  public static <E, R extends List<E>> R parallelSort(
      List<E> list,
      Function<List<E>, R> listConstructor) {
    return listConstructor.apply(
        list.parallelStream()
            .sorted()
            .collect(Collectors.toList()));
  }
}

Here is an example where we generate three different types of lists using our parallelSort() method:

import org.junit.Test;

import java.util.*;
import java.util.concurrent.*;
import java.util.stream.*;

import static org.junit.Assert.assertEquals;

public class SortingLikeABossTest {
  @Test
  public void testBossSorting() {
    List<Double> jumble =
        ThreadLocalRandom.current()
            .doubles(1_000_000)
            .parallel()
            .boxed()
            .collect(Collectors.toList());

    List<Double> sorted = new ArrayList<>(jumble);
    Collections.sort(sorted);

    ArrayList<Double> al = SortingLikeABoss.parallelSort(
        jumble, ArrayList::new
    );
    assertEquals(sorted, al);

    LinkedList<Double> ll = SortingLikeABoss.parallelSort(
        jumble, LinkedList::new
    );
    assertEquals(sorted, ll);

    CopyOnWriteArrayList<Double> cowal =
        SortingLikeABoss.parallelSort(
            jumble, CopyOnWriteArrayList::new
        );
    assertEquals(sorted, cowal);
  }
}

There is a catch. The current Java 8 implementation of parallel sorting only works if at least one thread in the common fork join pool is available. It won't default to just using the submitting thread to do the sorting, like you would see with other parallel tasks. I think this is a bug, not a feature :-) I've submitted it, but I don't think it will be fixed soon. It is not serious. We are not supposed to block up all the common threads anyway ...

Thanks for reading my newsletter. I hope you had as much fun as I had writing it :-)

Kind regards

Heinz





About List