Streams in Java – a Hand-on Approach

Datetime:2016-08-23 05:13:49          Topic: Java  Functional Program           Share

In this blog post, I will discuss the Streams feature in Java 8 (and above) from a pragmatic viewpoint. I will describe it and its uses as I have experienced myself.

What is this Stream business after all?

Some people new to Java often get confused with the term “stream”. Java (before version 8) already used “streams” to mean essentially the same thing that C and C++ used the term for – a conceptual streams of data – whether it be bytes or characters. This is why there is a plethora of interfaces and classes in the java.util.io package that has “stream” in its name. However, this is not what the new Streams API is all about.

In Java 8, a lot of features were introduced that enhanced support for Functional Programming in Java – lambdas, method references, and streams. A Stream in this new context is simply a way of streaming data through a set of operations that process the data at every stage of processing. It does not really matter how the data came to be in the first place – it might have been read in from a file, from a socket, from another String, from a database etc. What matters though is that the data can be essentially passed through different functional operations and then collected into another data structure, or printed out to some output (console, file, database, another stream). The important part is that the data at every stage is preserved – there is no modification of the original data structure (as will be demonstrated through examples later on). This is the reason why it supports functional programming concepts and makes for very terse and (for those who are familiar with it) readable code. Of course, there are pros and cons with this new feature.

Streams therefore operate at a much higher conceptual level than merely processing data through methods.

The Hitchhiker’s Guide to the Streams API in Java

The Stream API is specified in the java.util.stream package. There are a few very important interfaces defined here:

  • Stream
  • IntStream
  • LongStream
  • DoubleStream
  • Collector

Of these, we don’t really make use of Stream and Collector all that much directly. Instead, we use them through custom classes that implement these interfaces. The Stream interface is pretty much implemented by all major linear interfaces – List, Set, Queue, and Deque. Note that this support is provided by virtue of the fact that the Collection interface has a “default method”, default Stream stream() and also its parallel counterpart, default Stream parallelStream() . The reason (as mentioned in an earlier post) that the Collection interface itself doesn’t extend the Stream interface is in order to ensure that legacy code is not broken. By its very nature, streams are not implemented for non-linear data structures such Map. Only linear data structures have a well-defined way of streaming their data for processing by various operations.

The available classes in this package are:

  • Collectors
  • StreamSupport

Of these two, we will only ever be concerned with Collectors. StreamSupport provides low-level APIs that library writers can utilise to create their own versions of Streams.

The Collectors class is an extremely important class that provides methods that return a Collector object. Further processing is then done by the methods of the Stream interface, such as the “collect” method. This will become much clearer when we get down to examples.

Examples of Streams in action

So let’s down and dirty with it! First off, let’s start up a JShell session and get some of the basic imports that we need for the demo out of the way:

jshell> import java.util.function.*

jshell> import java.util.stream.*

jshell> /imports
|    import java.io.*
|    import java.math.*
|    import java.net.*
|    import java.util.concurrent.*
|    import java.util.prefs.*
|    import java.util.regex.*
|    import java.util.function.*
|    import java.util.stream.*
|    import java.util.*

As we can see, many of the most commonly used packages are already imported by default in JShell. However, the java.util.function and java.util.streampackages need to be explicitly imported.

Now that we’ve got that out of the way, let’s start off with some simple examples.

  • Let’s create a list of names, filter out those names which are longer than 3 characters, convert them to upper case, collect them in a set (remove duplicates), and then print them all out in order:

    jshell> List names = 
    Arrays.asList("Pat", "Mike", "Sally", "Robert", "Sam", "Matt", "Timmy", "Gennady", "Petr", "Slava", "Zach", "Paula", "Meg", "Matt", "Mike");
    names ==> [Pat, Mike, Sally, Robert, Sam, Matt, Timmy, Gennady, Petr, Slava, Zach, Pau ...
    
    jshell> names.stream().
       ...> filter((s) -> s.length() > 3).
       ...> map(String::toUpperCase).
       ...> collect(Collectors.toSet()).
       ...> forEach(System.out::println)
    GENNADY
    MIKE
    ZACH
    TIMMY
    PETR
    SLAVA
    ROBERT
    MATT
    SALLY
    PAULA

    Note that String::toUpperCase represents what is known as a “method reference”. They can also be used as the target of a Functional Interface. In this case, String::toUpperCase as used in the context of the “map” operation is equivalent to the following lambda expression:

    map((s) -> s.toUpperCase())

    It is just more convenient and terse to use method references whenever possible even in lieu of lambda expressions, especially for well known methods such as in this case.

    And just to emphasise the point that the actual data structure itself has not been modified:

    jshell> names.forEach(System.out::println)
    Pat
    Mike
    Sally
    Robert
    Sam
    Matt
    Timmy
    Gennady
    Petr
    Slava
    Zach
    Paula
    Meg
    Matt
    Mike

    The same code before Java 8 might look something like so:

    jshell> interface FilterPredicate {
       ...>          boolean test(T t);
       ...>      }
    |  created interface FilterPredicate
    
    jshell>      
    
    jshell>      interface MapFunction {
       ...>          R apply(T t);
       ...>      }
    |  created interface MapFunction
    
    jshell>
    
    jshell> List filterNames(List names, 	     FilterPredicate predicate)    {
       ...>          List filteredNames = new ArrayList();
       ...>          
       ...>          for (String name : names) {
       ...>              if (predicate.test(name))
       ...>                 filteredNames.add(name);
       ...>          }
       ...>          return filteredNames;
       ...>      }
    |  created method filterNames(List,FilterPredicate)
    
    
    
    jshell>
    
    jshell>      List mapNames(List names, MapFunction mapper) {
       ...>          List mappedNames = new ArrayList();
       ...>          
       ...>          for (String name : names) {
       ...>              mappedNames.add(mapper.apply(name));
       ...>          }
       ...>          
       ...>          return mappedNames;
       ...>      }
    |  created method mapNames(List,MapFunction)
    
    jshell>      
    
    jshell>      Set collectNames(List names) {
       ...>          Set uniqueNames = new HashSet();
       ...>          
       ...>          for (String name : names) {
       ...>              uniqueNames.add(name);
       ...>          }
       ...>          
       ...>          return uniqueNames;
       ...>      }
    |  created method collectNames(List)
    
    jshell> Set processedNames = 
       ...>             collectNames(mapNames(filterNames(names, 
                               new FilterPredicate() {
       ...>                     @Override
       ...>                     public boolean test(String s) {
       ...>                         return s.length() > 3;
       ...>                     }}), new MapFunction() {
       ...>                         @Override
       ...>                         public String apply(String name) {
       ...>                             return name.toUpperCase();
       ...>                         }
       ...>                     }));
    processedNames ==> [GENNADY, MIKE, ZACH, TIMMY, PETR, SLAVA, ROBERT, MATT, SALLY, PAULA]
    
    jshell>                     
    
    jshell>      for (String name : processedNames) {
       ...>         System.out.println(name);
       ...>      } 
    GENNADY
    MIKE
    ZACH
    TIMMY
    PETR
    SLAVA
    ROBERT
    MATT
    SALLY
    PAULA

    Wow! So much of code (and a lot of it boilerplate, especially the repetitive for-each looping) to achieve what essentially took one line using Java Streams! Also, from a readability perspective, I would argue that the Streams based version is much more readable and understandable. A lot of the boilerplate in the non-Streams version disrupts one’s flow when reading that code – instead of focusing on the take at hand, one is distracted by all the language specific cruft. Also note that the pre-Java 8 version of the code was written in a somewhat different way compared to what a lot of Java developers would do. If we had written out the code in a purely imperative manner, it would be at least twice as long with a lot more cognitive dissonance to boot.

  • For the second example, let us generate an infinite stream of natural numbers, take the first 100 numbers, filter out the even numbers, double each of them, and finally generate their sum.

    jshell> IntStream.iterate(1, (n) -> n+1).
       ...> limit(100).
       ...> filter((d) -> d%2 == 0).
       ...> map((i) -> i*2).
       ...> sum()
    $29 ==> 5100

    How much code would that take without using Streams? Heh. Some explanation of the code: we use the “iterate” method of the IntStream interface to generate an infinite list of natural numbers by using 1 as the seed value and the second parameter (the lambda expression) as a sort of generator function. Then we use “limit” to take the first 100 instances, filter out the even numbers, map each filtered value to double its value and finally call the terminal operation “sum” to collect the sum of the processed stream of numbers.

    At this juncture, it would probably be appropriate to comment that there are two types of operations when it comes to streams – non-terminal operations (which take values and produce processed value), and <strong.terminaloperations (which simply consume values and generate a final value). “limit”, “filter”, and “map” are examples of the former whereas “sum” is, as noted, a terminal operation. Knowing this distinction can save a lot of headache when code seemingly doesn’t work as expected.

  • For the final example, let us group together a bunch of students by grade:

    jshell> enum Grade {
       ...>         A, B, C, D, E, F;
       ...>     }
    |  created enum Grade
    
    jshell>
    
    jshell>     class Student {
       ...>         private int id;
       ...>         private String name;
       ...>         private Grade grade;
       ...>         
       ...>         public Student(int id, String name, Grade grade) {
       ...>             this.id = id;
       ...>             this.name = name;
       ...>             this.grade = grade;
       ...>         }
       ...>         
       ...>         public int getId() { return this.id; }
       ...>         public String getName() { return this.name; }
       ...>         public Grade getGrade() { return this.grade; }
       ...>         
       ...>         @Override
       ...>         public String toString() {
       ...>             return "{ " + this.id + ", " + this.name + ", " + this.grade + " }"; 
       ...>         }
       ...>     }
    |  created class Student
    
    jshell>     
    
    jshell>     List students = 
       ...>     Arrays.asList(new Student(1, "Rich", Grade.F),
       ...>                   new Student(2, "Peter", Grade.A),
       ...>                   new Student(3, "Sally", Grade.B),
       ...>                   new Student(4, "Slava", Grade.B),
       ...>                   new Student(5, "Megan", Grade.C),
       ...>                   new Student(6, "Edward", Grade.D),
       ...>                   new Student(7, "Amanda", Grade.A),
       ...>                   new Student(8, "Petr", Grade.B),
       ...>                   new Student(9, "Susan", Grade.F),
       ...>                   new Student(10, "Arnold", Grade.E));
    students ==> [{ 1, Rich, F }, { 2, Peter, A }, { 3, Sally, B }, { 4, Slava, B }, { 5, Meg …
    
    jshell>
    
    jshell> Map<Grade, List> studentData = 
       ...>             students.
       ...>             parallelStream().
       ...>             collect(Collectors.
       ...>                     groupingBy(Student::getGrade))
    studentData ==> {A=[{ 2, Peter, A }, { 7, Amanda, A }], E=[{ 10, Arnold, E }], B=[{ 3, Sally ...
    
    jshell>             
    
    jshell> for (Map.Entry<Grade, List> entry : studentData.entrySet()) {
       ...>          System.out.println("Grade: " + 
       ...>                             entry.getKey() + 
       ...>                             ", Students: " + 
       ...>                             entry.getValue());
       ...>     }
    Grade: A, Students: [{ 2, Peter, A }, { 7, Amanda, A }]
    Grade: E, Students: [{ 10, Arnold, E }]
    Grade: B, Students: [{ 3, Sally, B }, { 4, Slava, B }, { 8, Petr, B }]
    Grade: F, Students: [{ 1, Rich, F }, { 9, Susan, F }]
    Grade: C, Students: [{ 5, Megan, C }]
    Grade: D, Students: [{ 6, Edward, D }]

    As can be see, the Collectors class comes bundled with extremely useful methods to perform almost any conceivable processing on data. In this case, we’re still grouping the students based on their grades.

    Also note the use of “parallelStream”. Java supports parallel streams for operations that can be parallelised (i.e. they don’t have any real dependencies between one another). In this case, since we are processing a bunch of student data and aggregating them into groups based on grade, this is precisely such a situation where we can achieve performance boosts using parallel streams, especially when the data sets grow in size.

Some things to watch out for

  • Order of operations:

    Since streams provide a very high-level way to process operations, a lot of the inner details gets hidden from the developer. However, understanding some of these low-level details is crucial when working with streams. The most important part is knowing how the flow of processing takes place in streams.

    Suppose we want to find the sum of all odd numbers (doubled) between 1 and 10^6. We could do it like this:

    jshell> OptionalLong sum = 
       ...>             LongStream.iterate(1, (n) -> n+1).
       ...>                       limit(1_000_000).
       ...>                       map((b) -> b*2).
       ...>                       filter((i) -> i%2 != 0).
       ...>                       reduce((one, two) -> one + two);
    sum ==> OptionalLong.empty
    
    jshell> sum.getAsLong()
    |  java.util.NoSuchElementException thrown: No value present
    |        at OptionalLong.getAsLong (OptionalLong.java:119)
    |        at (#28:1)

    Why does this not work? Well, it’s because we are mapping each number to double its value and thereby ensuring that all the numbers are even! Let’s fix that to see that it works if we swap around map and filter:

    jshell> OptionalLong sum = 
       ...>             LongStream.iterate(1, (n) -> n+1).
       ...>                       limit(1_000_000).
       ...>                       filter((i) -> i%2 != 0).
       ...>                       map((b) -> b*2).
       ...>                       reduce((one, two) -> one + two);
    sum ==> OptionalLong[500000000000]
    
    jshell> sum.getAsLong()
    $30 ==> 500000000000

    Et voila! This shows how important it is to get the ordering of non-terminal operations correct since they may result in logical errors which cannot be caught by the compiler.

    Now, let’s flip the same example around: suppose we want to calculate the sum of even numbers (incremented by 2) from 1 to 10^6. Note that in this case, it doesn’t matter whether we map first and then filter or filter first and then map. The result is the same (adding 2 to an odd number always produces another odd, and so also for even number). What about performance? Let’s see:

    Let’s map and then filter:

    jshell>     void sumOfEvenNumbersSlow() {
       ...>         long start = System.currentTimeMillis();
       ...>         
       ...>         OptionalLong sum = 
       ...>             LongStream.iterate(1, (n) -> n+1).
       ...>                       limit(1_000_000_000).
       ...>                       map((b) -> b+2).
       ...>                       filter((i) -> i%2 == 0).
       ...>                       reduce((one, two) -> one + two);
       ...>         
       ...>         long end = System.currentTimeMillis();
       ...>         System.out.format("Sum: %d, time taken = %.3fs\n", sum.getAsLong(), (double)(end-start)/1000);
       ...>     }
    |  created method sumOfEvenNumbersSlow()
    
    jshell>
    
    jshell>     sumOfEvenNumbersSlow()
    Sum: 250000001500000000, time taken = 24.378s

    Now let’s filter first and the map:

    jshell>     void sumOfEvenNumbersFast() {
       ...>         long start = System.currentTimeMillis();
       ...>         
       ...>         OptionalLong sum = 
       ...>             LongStream.iterate(1, (n) -> n+1).
       ...>                       limit(1_000_000_000).
       ...>                       filter((i) -> i%2 == 0).
       ...>                       map((b) -> b+2).
       ...>                       reduce((one, two) -> one + two);
       ...>         
       ...>         long end = System.currentTimeMillis();
       ...>         System.out.format("Sum: %d, time taken = %.3fs\n", sum.getAsLong(), (double)(end-start)/1000);
       ...>     }
    |  created method sumOfEvenNumbersFast()
    
    jshell>
    
    jshell> sumOfEvenNumbersFast()
    Sum: 250000001500000000, time taken = 22.012s.

    Notice the substantial performance difference? Furthermore, this performance gap will only increase as the data set size increases. This illustrates the importance of properly ordering the operations when dealing with streams. Always filter before mapping whenever the order doesn’t matter – don’t waste precious cycles doing operations whose results are going to be dropped anyway.

    Finally, another important point to note is this – for non-terminal operations such as filter and map, the way it works is that a value is generated and passed down, and this process is repeated until the entire data stream has been exhausted. How then does a terminal operation like reduce or sum handle that? They have internal mechanisms to keep collating intermediate results and then produce the entire final result in one go and return that value to the caller.

  • Debugging streams:
    In most cases, each individual operation in a long chain of stream operations is small enough that we can easily weed out bugs – logical or otherwise. When the body of each operation starts growing though, debugging becomes much more difficult. The Java compiler helps us with static type issues, but it cannot always be relied upon to pinpoint the exact issue in large bodies of code. The sane thing to do would be to always try and keep each operation limited to a single conceptual abstraction of code, and then to ensure that that abstraction can be represented in a line or two of code at most.
  • Using parallel streams :

    As could be seen in the last example, we can use parallel streams in those cases where the operations are essentially independent of one another. The thing is that the onus for determining and ensuring this is on the developer. Java will not (and cannot) check whether the operations are independent of one another or not. This means that unless well thought through, this can lead to a lot of head-scratching and confusion.

    Another potential problem with parallel streams is that debugging (which is already hard enough with parallel code) becomes even tougher with parallel streams when things don’t go as planned. This goes right back to where the whole sequence of operations is well thought through. Nothing can substitute good planning.

    Finally, the performance boost might not be very noticeable in many cases, especially for small data sets. This becomes even truer on single-core machines (which are, to be fair, becoming rarer by the day). In any case, one should carefully weigh the overheads associated with parallel streams against the purported performance benefits from using them.

  • Conclusion

    Streams are, without doubt, my favourite feature in Java 8. As the world tends to move more and more towards Functional Programming as a core paradigm (and one which is orthogonal to Object Orientation, I must add), it is increasingly becoming important for developers to know at least the fundamentals of Functional Programming – using pure functions wherever possible, avoiding side-effects, especially with regards to the modification of data structures, using higher-order functions such as filter, map, reduce, and of course, moving towards more declarative code than imperative code.

    Functional Programming is a very old concept that is being rediscovered as the world moves onto a massively mute-core environment where we just cannot afford the headaches associated with mutable state and side-effects. Of course, side-effects are essential in getting anything practical done (imagine a world without any IO!), and some languages handle that problem quite elegantly (Haskell) and others focus on rather providing efficient immutable data structures (Clojure), but one guideline that is bound to be useful whatever your language support for Functional Programming might be (or not!) is to clearly separate out the Functional and Non-Functional parts of the code and provide a clear and simple form of interaction between them. We will discuss more about Functional Programming and its core concepts in the next few posts.





    About List