DEV Community

Nils
Nils

Posted on

Beyond Stream.distinct()

A Stream of objects

Starting with Java 8, you can use a Stream to process elements from a Collection (mostly classes derived from List or Set will be used for this) in an iterating way, defining filtering and mapping lazily, and then "consume" the Stream.

You can transform the elements in a Stream for further processing by extracting values from objects (e.g. getting the city name from a person or calculating the length of a string) and filter elements (e.g. only persons older than x years or cities starting with "D") before consuming the resulting Stream of elements.

What defines "distinct"?

One special way of filtering elements in a Stream is using distinct() to return unique elements. Checking for uniqueness is internally done via the equals() method of the processed elements, so you need to implement both equals() and hashCode() accordingly to be able to use distinct(). While simple classes like String, Long or LocalDate work like expected, it might be difficult to correctly implement those methods for custom objects. And of course, this puts a limit on how custom objects can be processed, because you can only implement one way of checking for "equal" elements.

A more flexible, yet easy solution

First, you need to define a function that extracts the unique field from your custom object.

For example, we take this object representing a very simple person entry:

public class Person {
    private int id;
    private String firstname;
    private String lastname;

    public Person() { }

    public void setId(int id) {
        this.id = id;
    }
    public int getId() {
        return this.id;
    }
    public void setFirstname(String firstname) {
        this.firstname = firstname;
    }
    public String getFirstname() {
        return this.firstname;
    }
    public void setLastname(String lastname) {
        this.lastname = lastname;
    }
    public String getLastname() {
        return this.lastname;
    }
}
Enter fullscreen mode Exit fullscreen mode

You can get the ID by calling p.getId() on a Person object p. Or, to describe this as a function reference: Person::getId.
The same goes for Person::getFirstname and Person::getLastname.

Our list l (e.g. an ArrayList<Person>) contains some person entries and we want to get unique entries from that list, based on the lastname property.

For this we can use Collectors.groupingBy(...) with referencing the access function of a property and another Collector that "remembers" the first non-null person element:

Map<String, Person> m = l.stream()
    .collect(Collectors.groupingBy(
        Person::getLastname,
        Collectors.reducing(null, (a, b) -> (a != null) ? a : b))
    );
Enter fullscreen mode Exit fullscreen mode

In result, we will get a map with all distinct last names as keys and the first person element with that last name as corresponding value.
If needed, the reduction collector can be changed to return the last person instead: Collectors.reducing(null, (a, b) -> (b != null) ? b : a))

Optionally, you can insert LinkedHashMap::new between the the access function and the reduction collector. This specifies the type of map that will be used to return the grouped data. When using a LinkedHashMap the processing order of the elements will be retained in the result. However, this is actually limited by the specifications of the given stream, so the results may vary.

This also works for simple element types, so you can use this logic for both custom and simple objects. For simple objects the access function is just (obj) -> obj or Function.identity().

Downsides and alternatives

The main downside of this method is that Stream.collect(...) is a terminal operation. This means that calling collect(...) will consume the stream and process all the filterings and mappings that are defined on the Stream. It will then return a final result of the collect operation, in our case the result map.
Of course, you can get a new Stream on the values or key set of that map for further processing, but the source Stream has been consumed at that point.
In contrast, distinct() is a stateful intermediate operation which returns a new Stream, but no operation is done at that moment on the source Stream.

Some open source projects like Eclipse Collections and Vavr provide alternatives to lists, sets, maps and streams which also include the option to define an access function to an object property when collecting distinct elements.

And there are some other clever alternatives described in the answers of this StackOverflow question. Btw. my personal favorite there is returning a Predicate via a helper method distinctByKey(...).

Conclusion

The standard method Stream.distinct() works well for simple types and custom objects that provide a fixed way to determine object equality.
If you need more flexibility on defining which property (or even a combination of multiple properties) make an object unique and filter data accordingly, you can utilize provided Collector implementations, build custom Predicates or Collectors or use other open source collection frameworks.

Top comments (0)