Notes on Clojure Sequence Implementations

21 Sep 2011

These are notes from a talk I gave at Clojure.mn on September 7, 2011. The goal was to explore Clojure sequences in more depth than the standard introductory books and tutorials. As you will see, our investigations quickly led us to the depths of Clojure's Java implementation.

What is a sequence?

A sequence is a logical list that, unlike a standard Lisp cons cells, is not tied to a particular implementation. A seq is an object that implements the seq interface on a collection, allowing you to access it like a sequence. Rich Hickey says, "A seq is like a logical cursor," and that, unlike iterators, they are immutable.

At the implementation level, sequences are defined by the clojure.lang.ISeq Java interface. When this interface is implemented on a collection, it becomes accessible as a sequence.

Collections that can be "viewed as sequences" are called seq-able. Programming Clojure provides a nice overview such collections:

  • Clojure collections (every aggregate data structure)
  • Java collections
  • Java arrays and strings
  • Regular expression matches
  • Directory structures
  • I/O streams
  • XML trees

So if you are familiar with sequences, you can manipulate most of the data structures you'll use in Clojure.

Some examples

Let's look at some collections to see what is and isn't a sequence. We'll use the seq? predicate to check if a collection implements the ISeq interface.

user> (seq? '(1 2))      ;=> true
user> (seq? [1 2])       ;=> false
user> (seq? {:foo :bar}) ;=> false
user> (seq? #{1 2})      ;=> false
user> (seq? ())          ;=> true

So lists (even empty ones) are sequences, and vectors, maps, and sets are not. But aren't all Clojure collections sequences? Technically, they are all "viewable as sequences," which means that they can implement the ISeq interface, and return a seq object that can be treated like a sequence. To make them do that, you use the the seq function, which implements the ISeq interface on the collection passed to it. Let's try it out on the collections we just tested.

user> (seq '(1 2))      ;=> (1 2)
user> (seq [1 2])       ;=> (1 2)
user> (seq {:foo :bar}) ;=> ([:foo :bar])
user> (seq #{1 2})      ;=> (1 2)
user> (seq ())          ;=> nil

We see that calling seq on the list, vector, and set all printed the same output, while the map was turned into something else, and the empty set returned nil. But this is just how the REPL represents sequences. To see what's really happening under the hood, let's look at the classes that are returned.

user> (class '(1 2))
;=> clojure.lang.PersistentList
user> (class (seq '(1 2)))
;=> clojure.lang.PersistentList

So calling seq on a list returns a list. This is unsurprising, since the seq? predicate returns true for lists, meaning they implement ISeq out of the box. Another interpretation is that since the argument to seq is already a sequence, it returns the original sequence rather than a seq object which accesses the argument.

user> (class [1 2])
;=> clojure.lang.PersistentVector
user> (class (seq [1 2]))
;=> clojure.lang.PersistentVector$ChunkedSeq

Here we see that clojure.lang.PersisitentVector is the vector class, and clojure.lang.PersistentVector$ChunkedSeq is a seq object that accesses vectors. We see similar seq objects returned for maps (clojure.lang.PersistentArrayMap) and sets (clojure.lang.PersistentHashSet).

user> (class (seq {:foo :bar}))
;=> clojure.lang.PersistentArrayMap$Seq
user> (class (seq #{1 2}))
;=> clojure.lang.APersistentMap$KeySeq

The point is not to dwell on the specific implementation details, but to see that while they all look like (1 2) in the REPL, each is actually a different collection, and each collection returns a unique seq object to access it as a sequence.

Laziness

Laziness means you can specify a computation that would theoretically take forever to complete, and evaluate as much or as little of it as you want. An example of this is the primes sequence taken from clojure.contrib.lazy-seqs.

It's said that "most" Clojure sequences are lazy. So why can't we simply say they're all lazy? One reason is that you can always define your own non-lazy sequence. There may also be implementation details that make some sequences less lazy than others, such as with those collections that are implemented as chunked sequences.

A simple lazy sequence can be created using the range function.

user> (range 10)
;=> (0 1 2 3 4 5 6 7 8 9)
user> (class (range 10))
;=> clojure.lang.LazySeq

While the REPL evaluates the entire sequence and prints it out as part of it's P step, a program using this structure would not count from 0 to 9 until it was told to. One way to keep the REPL from evaluating the whole sequence is to use the take function:

user> (take 5 (range 10))
;=> (0 1 2 3 4)
user> (class (take 5 (range 10)))
;=> clojure.lang.LazySeq

We see that take returns a sequence that is a subset of its input sequence. As a consequence, only that subsequence is evaluated in the REPL.

So how do we force the evaluation of an entire sequence, say, to realize its side effects? You can use the dorun and doall functions (or the doseq macro, which works like a list comprehension), as follows:

user> (dorun (range 10))
;=> nil
user> (doall (range 10))
;=> (0 1 2 3 4 5 6 7 8 9)

This illustrates the difference between these two functions: doall returns the sequence after evaluating it; dorun does not. Both produce all of the side effects resulting from evaluating the sequence.

first returns items

The first function evaluates the first element of a sequence and returns it. So the output is of the same class as the elements of the sequence. If its argument isn't a sequence, it calls seq on it first.

user> (first '(1 2))                 ;=> 1
user> (first [1 2])                  ;=> 1
user> (first {:foo :bar, :baz :bot}) ;=> [:foo :bar]
user> (first #{1 2})                 ;=> 1
user> (first ())                     ;=> nil

The behavior is pretty straightforward for standard Clojure collections. Notice that

  • the "first" element of a vector is the one with the 0 index,
  • the "first" element of a map is the leftmost pair defined in the map literal,
  • and the "first" element of a set is the leftmost element defined in the set literal.

Calling first on the empty set returns nil, as expected. And what's the class of the first map pair?

user> (class (first {:foo :bar, :baz :bot}))
;=> clojure.lang.MapEntry

So calling seq on a map returns a sequence of clojure.lang.MapEntrys, which are printed like vectors in the REPL.

rest and cons always return sequences

The rest function returns the remaining items in a sequence. The result is always a sequence, regardless of it's input. So with vectors we see that rest outputs a sequence:

user> (seq? [1 2 3])    ;=> false
user> (rest [1 2 3])    ;=> (2 3)

The cons function constructs new sequences from old sequences by appending the first argument to the front of the second.

user> (cons 1 '(1 2))         ;=> (1 1 2)
user> (class (cons 1 '(1 2))) ;=> clojure.lang.Cons

This clojure.lang.Cons class is a sequence inasmuch as it implements ISeq, but it does not implement all the features of a standard list (PersistentList).

consing a vector also produces a clojure.lang.Cons sequence.

user> (cons 1 [1 2])         ;=> (1 1 2)
user> (class (cons 1 [1 2])) ;=> clojure.lang.Cons

Sets operate similarly, but since the second argument has seq called on it before appending the first argument to it, it no longer acts like a set, which can only have unique elements.

user> (cons 1 #{1 2})        ;=> (1 1 2)

Since cons always outputs clojure.lang.Conss, I'll omit any further class checking of its output.

consing maps is not as straightforward, since a map is a set of key/value pairs and cons will simply call seq on its input and append an item to the front of it. So consing a keyword will produce a sequence with a keyword followed by a map entry.

user> (cons :baz {:foo :bar}) ;=> (:baz [:foo :bar])

consing another map will produce a sequence whose first entry is a map, followed by a map entry, which is a pretty odd thing.

user> (cons {:baz :bot} {:foo :bar})
;=> ({:baz :bot} [:foo :bar])

The preferred method is to cons a two-element vector, which is interpreted as a map entry, onto a map.

user> (cons [:baz :bot] {:foo :bar})
;=> ([:baz :bot] [:foo :bar])

The result is a sequence of map entries — the standard way Clojure converts maps to sequences.

rest versus next

Both rest and next return the remaining items in the sequence, but they deal with empty sequences differently. rest returns a sequence no matter what, and never returns nil. next returns nil instead of an empty sequence, by calling seq on the result. So you can think of (next aseq) as equivalent to (seq (rest aseq)).

user> (rest '(1)) ;=> ()
user> (seq ())    ;=> nil
user> (next '(1)) ;=> nil

So next is less lazy than rest because it looks ahead at the returned sequence to see if it's empty via the seq function. This makes rest the preferred method.

cons versus conj

The main difference between cons and conj is that cons returns sequences (of type clojure.lang.Cons) regardless of its input, while conj returns the type of collection that was passed to it. Also, conj can add multiple items to a collection, whereas cons can only add one at a time. This means the collection argument comes first in conj, so that there can be a variable number of items added to the collection.

A more subtle difference is that conj operates on the collection's underlying data structure, rather than on the ISeq interface, so it can be more efficient. As a consequence, conj doesn't always append items to the beginning of collections. Here are some illustrations.

user> (conj '(1 2) 1)         ;=> (1 1 2)
user> (class (conj '(1 2) 1)) ;=> clojure.lang.PersistentList

So for lists, conj appends to the beginning, like cons. But it returns a PersistentList rather than a Cons sequence. On a vector, conj appends to the end, and returns a vector:

user> (conj [1 2] 1)         ;=> [1 2 1]
user> (class (conj [1 2] 1)) ;=> clojure.lang.PersistentVector

Sets are unordered, so there's no concept of beginning or end, and they contain only unique elements.

user> (conj #{1 2} 1)         ;=> #{1 2}
user> (class (conj #{1 2} 1)) ;=> clojure.lang.PersistentHashSet

So conjing 1 to a set that already contains 1 produces an identical set. In this way, conj operates like a set union operation. It's worth noting that consing 1 to #{1 2} not only outputs a different data structure, but that data structure also contains the element 1 twice, since sequences don't preserve the uniqueness constraint of sets.

conjing a map pair onto a map produces a new map, which is unordered.

user> (conj {:foo :bar} [:baz :bot])
;=> {:baz :bot, :foo :bar}
user> (class (conj {:foo :bar} [:baz :bot]))
;=> clojure.lang.PersistentArrayMap

Unlike with cons, we can also append maps to maps.

user> (conj {:foo :bar} {:baz :bot})
;=> {:baz :bot, :foo :bar}

Conclusion

There are many other interesting features of sequences, but our focus here has been on their implementations, rather than their day to day use. Much more could be said about these implementations but, as we've seen, the further you go down the implementation rabbit hole, the more your inquiry begins to resemble a Java exercise rather than a Clojure one. Maybe its best, next time you're wondering what's under the hood of your favorite sequence, to remind yourself that you're here to program Clojure, not Java.

Update: (October 1, 2011) Fixed some typos and made more precise use of the terms seq and sequence, following the conventions in The Joy of Clojure, §5.1.2.