The Slogan Handbook - Composite Data Types

The preceding chapter discussed basic data types that serve as building blocks or atoms of large Slogan programs. This chapter is about composite data types, which are molecules created by combining the basic data types in various ways. Some of the data types we discuss here are pairs, lists, arrays, hash tables, sets and records. Pairs, lists and arrays are also known as aggregate types because they are concatenations of other values.¹

5.1 Pairs

Suppose you are writing a program that deals with GPS coordinates. A GPS coordinate consists of two real numbers – one for latitude and the other for longitude. You can always use two separate variables for these, but your program will lack the ability to express the concept of location as a concrete entity. Without this ability, functions in your program will have to accept and "interpret" two distinct numbers as representing a single location.

Let us tackle the problem of giving the GPS coordinate a concrete representation. We need a mechanism to glue together two real numbers into a single value. Slogan's pair data type is ideal for this purpose. A pair can be constructed either by calling the pair function or by using the colon (:) operator². Here are the coordinates of two cities of the world, encoded as decimal degrees:


let new_york_coords = pair(40.7166638, -74.0)
let bangalore_coords = 12.97160:77.59456

new_york_coords
// 40.7166638:-74.0
bangalore_coords
// 12.9716:77.59456

The values in a pair can be accessed in two ways. The first method is to use the head and tail functions.


head(bangalore_coords)
// 12.9716
tail(bangalore_coords)
// 77.59456

If you want to bind both members in a pair to new variables, it is more convenient to use the data destructuring mechanism built-into the let statement. The variable part of a let can be a "pattern". If the internal structure of the value on the right-hand side of the assignment matches this pattern, values inside the structure will be bound to variables in the pattern.


let ny_lat:ny_long = new_york_coords
ny_lat
// 40.7166638
ny_long
// -74.0

If you want to bind only certain elements in the structure, just replace the variable name with an underscore (_) in the pattern:


let _:bng_long = bangalore_coords
bng_long
// 77.59456

Now that we have a suitable representation for GPS coordinates, it is a good idea to hide the details of this representation under suitable functions. This will enable the rest of the code to deal with location or coordinates as an abstract idea. Basically we need three functions – a constructor to build coordinates out of two real numbers and two selectors for accessing the individual parts of the location. If we ever decide to use a different internal representation, only these three functions need to change. This is because the rest of the code create and access coordinates through these functions and does not make any assumptions about how they are really represented in memory. This way we add a thin layer of data abstraction to our program.


function make_coords(lat, long) lat:long
function latitude(coord) head(coord)
function longitude(coord) tail(coord)

As the constructors and selectors are not doing nothing much on their own, the following definitions are also possible, in which they become just aliases for pair, head and tail:


let make_coords = pair
let latitude = head
let longitude = tail


let new_york_coords = make_coords(40.7166638, -74.0)
let bangalore_coords = make_coords(12.97160, 77.59456)
latitude(new_york_coords)
// 40.7166638
longitude(bangalore_coords)
// 77.59456

Any new functions that we write need not be bothered about the fact that we represent coordinates with pairs. As an example, we will write a function that prints the DMS value of a coordinate object.


function show_dms(coord)
{ show_dms_component(latitude(coord))
  show("  ")
  show_dms_component(longitude(coord))
  newline() }

function show_dms_component(dd_component)
  show(floor(dd_component), "d ",
       floor(mod(dd_component * 60, 60)), "m ",
       mod(abs(dd_component) * 3600, 60), "s ")
      
// Usage:
show_dms(bangalore_coords)
//> 12.0d 58.0m 17.760000000002037s   77.0d 35.0m 40.41600000002654s 
show_dms(new_york_coords)
//> 40.0d 42.0m 59.98967999999877s   -74.0d 0.0m 0.0s

5.2 Lists

The values glued together by a pair need not be primitives. They can be other pairs, for instance. This enables us to build hierarchical data structures like the one shown below:


let tree = (1:(2:3:(4:5)))
head(tree)
// 1
head(tail(tree))
// 2
tail(tail(tree))
// 3:4:5

The basic sequence obtained by chaining together pairs are known as a list. A proper list will be terminated by a value that represents the empty sequence. In Slogan this value is represented by []. There are various ways you could build a proper list. Some of these methods are illustrated here:


// pairs terminated by an empty list ([]), is a proper list.
1:2:3:4:5:[]  
// [1, 2, 3, 4, 5]

// a list literal can also be constructed by enclosing comma-separated values in []:
[\a, \e, \i, \o, \u]
// [\a, \e, \i, \o, \u]

// another way to build a proper list is to call the `list` function:
list(1, 2, "hello", 3)
// [1, 2, hello, 3]

A list of pairs can be used as a table to lookup information. The lookup function can treat the head of each pair as the key to find the associated data.


let price_list = ['orange:80, 'apple:120, 'grapes:72]

assq('apple, price_list)
// apple:120
assq('mango, price_list);
// false

function calculate_price(fruit, kg)
{ let entry = assq(fruit, price_list)
  when (entry) tail(entry) * kg }

calculate_price('apple, 2)
// 240
calculate_price('grapes, 5)
// 360
calculate_price('mango, 5)
// false

If the lookup-keys are of a complex type like string, list or large integers, assq won't work. We need a function that inspects the structure of the key for equality. The function assoc is defined for this purpose. Assoc uses the is_equal predicate, which is mapped to the == operator, to do the equality check.


let person = ["name": "nemo", "age": 1]
assoc("name", person)
// name:nemo
assoc("age", person)
// age:1
assq("age", person)
// false

The next program demonstrates some useful functions that can be applied on lists:


let xs = [10, 3, 45, 8, 9]
length(xs)
// 5
length(xs) == count(xs)
// true
at(xs, 2)
// 45
xs[2]
// 45
xs[1:3]
// [3, 45]
xs[:3]
// [10, 3, 45]
reverse(xs)
// [9, 8, 45, 3, 10]
sort(xs)
// [3, 8, 9, 10, 45]

// comparisons
[1, 2, 3] == [1, 2, 3]
// true
[1, 2, 3] <> [4, 5, 6]
// true
[1, 2, 3] < [4, 5, 6]
// true
[1, 2, 3] > [4, 5, 6]
// false
[1, 2, 3] >= [1, 2, 3]
// true

// membership checks
memq(3, xs)
// [3, 45, 8, 9]
memq(10, xs)
// [10, 3, 45, 8, 9]
memq(1, xs)
// false

let ys = ["a", "list", "of", "strings", ["and", "lists"]]
// memq won't work because it uses `is_eq`
memq("of", ys)
// false

// use `member` instead
member("of", ys)
// [of, strings, [and, lists]]
member(["and", "lists"], ys)
// [[and, lists]]
member(["and", "list"], ys)
// false

function is_vowel(c)
  memq(c, [\a, \e, \i, \o, \u])

is_vowel(\a)
// [\a, \e, \i, \o, \u]
is_vowel(\o)
// [\o, \u]
is_vowel(\k)
// false

5.2.1 List Comprehensions

A list comprehension is a notational convenience for constructing lists from other lists. It has the following general syntax:


[out_expr | var_expr <- input_list where filter_expr, ...]

Out_expr constructs each element in the output list. Var_expr assign values to variables used in out_expr. Each value is "extracted" from an input list. The where_clause is optional and is used to filter values extracted from the input list.


[x * x | x <- [1, 2, 3, 4, 5]]
// [1, 4, 9, 16, 25]

[i : j | i <- range(1, 5), j <- range(i, 5) where is_even(i)]
// [2:2, 2:3, 2:4, 2:5, 4:4, 4:5]

function triads(n)
{ let elems = range(1, n);
  [[x, y, z] | x <- elems,
               y <- elems,
               z <- elems where x * x + y * y == z * z] }

triads(15)
// [[3, 4, 5], [4, 3, 5], [5, 12, 13], [6, 8, 10],
    [8, 6, 10], [9, 12, 15], [12, 5, 13], [12, 9, 15]]

function concat(xss) [x | xs <- xss, x <- xs]
concat([[1, 2, 3], [4, 5, 6]])
// [1, 2, 3, 4, 5, 6]

5.3 Arrays

Arrays are fixed-length sequences that provide constant-time, position-based lookup for elements. If fast lookups are required you should always prefer an array over a list because a list can only provide sequential access to its members. Just like for lists, there are multiple ways to create and initialize arrays:


#[1, 2, 3]
// #[1, 2, 3]
array("hello", "world")
// #[hello, world]
let xs = make_array(5)
xs
// #[false, false, false, false, false]
array_set(xs, 0, 120)
array_set(xs, 2, "hi")
xs
// #[120, false, hi, false, false]
xs[0]
// 120
let ys = #[1, 2, 3, 4, 5]
ys[2:4]
// #[3, 4]

5.3.1 Type specific arrays

Slogan provide arrays for storing and accessing specific numeric types. For example, the byte-array is optimized for bytes. There are also arrays for 16bit/32bit/64bit signed/unsigned integers and 32bit/64bit floating-point numbers. The type of an array literal is specified by an identifier after the # sign. For example #u8 means an array of unsigned bytes and #s32 means an array of signed 32bit integers.


#u8[1, 34, 250][1]
// 34

// precision of elements may vary based on architecture:
#f32[1.0, 34.114, 250.12][1]
// 34.11399841308594

Another useful type-specific array are bit_arrays. They are designed to efficiently store and retrieve bit-encoded information.


let flags = #b[1, 0, 1, 1]
flags[0]
// true
flags[1]
// false
bit_array_clear(flags, 2)
flags
// #b[1, 0, 0, 1]

Exercise 5.1. Read about the 16 bit color encoding scheme, where the red and blue components are encoded using 5 bits and the green component is encoded in 6 bits. Implement a function, make_color that takes the red, green and blue components as arguments and return the encoded color value as a bit-array. Also write selectors for decoding the color object into individual red, green and blue values.

Bloom Filter

A bloom filter is a data structure that can quickly test if an element is a member of a set. A bloom filter is basically a large bit-array. An element is added to the bloom filter by first converting it into a bunch of integers called hashes and then using those as indices to be turned-on in the bit-array. Membership check also happens similarly - if all the bits at the hashes of the element are on, it is a member of the set. Bloom filters are space efficient because elements are reduced to a few bits and stored. They are ideal when fast lookups against a huge set is required and a false positive result is not catastrophic. An example application is in the domain of crawling and indexing web pages. A crawler has to retrieve and index millions or even billion of pages. When it encounters a new URL, it has to quickly figure out if that URL was already crawled or not. Bloom filter is an ideal data structure here because it is space efficient, fast and rarely re-crawling a page need not be a big deal.

Let us go straight into the implementation of the bloom filter. Note that we make use of the hash functions built into Slogan. A production quality bloom filter will require better hashing techniques.


function make_bloom_filter(size)
  make_bit_array(size)

/* Return two hashes for the string `entry`.
   The first is generated using the built-in `string_hash`
   function. The second hash is generated by converting
   `entry` into a list of integers and then hashing that list. 
*/
function hash_entry(entry, size)
{ let h1 = string_hash(entry)
  // we haven't talked about `map` yet, but we will soon!
  let h2 = equal_hash(map(char_to_integer, string_to_list(entry)))
  remainder(h1, size):remainder(h2, size) }

/* Add an entry to the bloom filter.
   The bits at the positions identified by the hashes
   are turned on. 
*/
function bloom_filter_set(b, entry)
{ let h1:h2 = hash_entry(entry, bit_array_length(b))
  bit_array_set(b, h1)
  bit_array_set(b, h2) }

/* Return true if `entry` is a member of the bloom filter.
   Both bits identified by the hashes must be on.
*/
function bloom_filter_test(b, entry)
{ let h1:h2 = hash_entry(entry, bit_array_length(b))
  b[h1] && b[h2] }


let b = make_bloom_filter(1000)
bloom_filter_set(b, "hello")
bloom_filter_set(b, "helLO")
bloom_filter_set(b, "hello world")

bloom_filter_test(b, "hello")
// true
bloom_filter_test(b, "helLO")
// true
bloom_filter_test(b, "hello world")
// true
bloom_filter_test(b, "hello, world")
// false
bloom_filter_test(b, "HelLO")
// false

5.4 Hash Tables

The hash table is one of the most ingenious and versatile of all data structures. It is an unordered collection of key/value pairs in which all the keys are distinct, and the value associated with a given key can be retrieved, updated, or removed using a constant number of key comparisons on the average, no matter how large the hash table.

The simplest way to create a hash table is to write down it as pairs enclosed in #{}. The head of a pair is treated as key and the tail becomes the associated value.


let ages = #{"alice":10, "bob":8, "eve":12}
ages["alice"]
// 10
ages["eve"] = ages["eve"] + 2
ages["eve"]
// 14
ages["olivia"]
// false

// return a default value for a missing key
hashtable_at(ages, "olivia", 7)
// 7
hashtable_keys(ages)
// #[alice, eve, bob]
hashtable_values(ages)
// #[10, 14, 8]

5.5 Sets

A set stores an unordered sequence of objects without duplicates. It is an implementation of the mathematical concept of finite sets. Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set. A set literal is written by enclosing the objects in #().


let s1 = #(1, 2, 3, 4)
let s2 = #(3, 4, 5, 6)
is_set_member(s1, 2)
// true
is_set_member(s1, 5)
// false
set_difference(s1, s2)
// #(1, 2)
set_difference(s2, s1)
// #(5, 6)
set_intersection(s1, s2)
// #(3, 4)
set_union(s1, s2)
// #(1, 2, 3, 4, 5, 6)
is_subset(#(1, 2), #(1, 2, 3, 4))
// true
is_superset(#(1, 2), #(1, 2, 3, 4))
// false
is_superset(#(1, 2, 3, 4), #(1, 2))
// true

5.6 Records

Records are a means for defining new, distinct types. The record statement is used to introduce a new custom type. Its general syntax is shown below:


record <name> (<member01> where <pre-condition>, <member02> ...)

For a new record type, Slogan automatically generates a constructor and selector/modifier functions for accessing and updating its members.


record employee(name, salary, dept)
let e1 = employee(name = "alice", salary = 3400, dept = "ENG")
let e2 = employee(name = "bob", salary = 4500, dept = "FIN")

e1
// #<employee #4 name: "alice" salary: 3400 dept: "ENG">
e2
// #<employee #5 name: "bob" salary: 4500 dept: "FIN">

employee_name(e1)
// alice
employee_dept(e2)
// FIN
employee_set_salary(e1, 3600)
e1
// #<employee #4 name: "alice" salary: 3600 dept: "ENG">

One problem with the auto-generated constructor is that it won't do any data integrity checks. For instance, you are allowed to create an employee with an invalid salary:


employee(name = "nemo", salary = "#@@@#@@$", dept = "ENG")
#<employee #6 name: "nemo" salary: "#@@@#@@$" dept: "ENG">

The optional where clause allows us to specify data validation rules for record values. Let us redefine the employee record with some condition checks.


record employee(name   where is_string(name),
                salary where is_integer(salary)
                             && salary > 1500
                             && salary < 10000,
                dept   where is_string(dept))

employee(name = "nemo", salary = "#@@@#@@$", dept = "ENG")
//> error: precondition_failed, #@@@#@@$
employee(name = "nemo", salary = 230, dept = "ENG")
//> error: precondition_failed, 230
employee(name = "nemo", salary = 2300, dept = "ENG")
// #<employee #7 name: "nemo" salary: 2300 dept: "ENG">

5.7 Patterns of Data

Slogan has the ability to take apart data structures and do pattern matching on them. A pattern match expression has the following general form:


match (value)
  pattern_1 -> result_1
| pattern_2 -> result_2          
| ...

If value does not match any of the listed patterns, a no_match_found error is raised.

Let us begin our exploration of pattern matching with the help of a few simple examples. Later we will see how this facility can lead to the clean and concise specification of a non-trivial algorithm.

Our first example re-implements a function from Slogan core - the length function that return the number of elements in a list.


function length(xs)
  match(xs)
    [] -> 0        
  | h:t -> 1 + length(t)

Our definition of length does a pattern destructuring on its argument. If the pattern matches an empty list, 0 is returned. If the pattern matches a head and a tail pair, the length is 1 added to the length of tail.

We can write this function more compactly, by eliminating the explicit declaration of match. We also do not need to bind the h variable because we don't use it. This can be replaced by the _ wildcard character.


function length(xs)
| [] -> 0        
| _:t -> 1 + length(t)

Pattern matching can be done on any data type with a literal representation - numbers, strings, lists, arrays, hash tables, sets and so on. A few more examples follows:


function factorial(n)
| 0 -> 1
| _ -> n * factorial(n-1)

factorial(10)
// 3628800
factorial(3)
//6

// Evaluate arithmetic expressions in the
// format #{operation: [expr1, expr2]}, where `operation`
// is one of the four symbols - 'add, 'sub, 'mul and 'div.
// An expression can also be a numeric literal.  
function calculate(expr)
| #{'add: [e1, e2]}    -> calculate(e1) + calculate(e2)
| #{'sub: [e1, e2]}    -> calculate(e1) - calculate(e2)
| #{'mul: [e1, e2]}    -> calculate(e1) * calculate(e2)
| #{'div: [e1, e2]}    -> calculate(e1) / calculate(e2)
| e where is_number(e) -> e

calculate(#{'add: [#{'mul: [10, 20]}, 500]})
// 700
calculate(#{'sub: [100, 20]})
// 80
calculate(#{'sub: [100, "ok"]})
//> error: no_match_found

The calculate function makes use of the where guard in the last pattern to make sure that the value that gets bound to the variable is a number.

In addition to the built-in data structures, user defined records can also be destructured:³


record circle(radius)
record rectangle(width, length)
    
function area(shape)
| circle(radius) -> 3.14159 * (radius * radius)
| rectangle(width, length) -> width * length

area(circle(radius = 10.34))
// 335.88497980399995
area(rectangle(width=20, length=52.78))
// 1055.6

A record pattern match can refer members by position. This is achieved by prefixing the member name by the @ character. (This means record members are not allowed to start with the @ character). Let us rewrite the area function by destructuring the record members by position:


function area(shape)
| circle(@r) -> 3.14159 * (@r * @r)
| rectangle(@w, @l) -> @w * @l

If a function with an implicit match takes more than one parameter, all the arguments should be packaged into a list and passed to the pattern matcher:


function f(a, b)
| [1, b] -> b * 10
| [2, b] -> b * 100

f(1, 2)
// 20
f(2, 2)
// 200

Slogan support or-patterns, which is a feature that allows us to collapse multiple clauses with identical right-hand sides into a single clause:


function f(xs)
| [a, b]
| #[a, b]
| #(a, b) -> a * b

f([10, 20])
// 200

f(#[10, 20])
// 200
          
f(#(10, 20))
// 200

Repetition of the same pattern can be avoided by using the special pattern variable %, which always refer to the previous pattern checked.


match ([1, 2, 3]) 
  [_, b, _] where b >= 10 -> 'hi
| %         where b >= 1  -> 'hello
// hello

5.7.1 A self-balancing search tree

Now that we have covered the basics, let us write some code that exploits the true expressive power of pattern matching! We are going to implement a data structure that is often flagged as "advanced" in text books. This is the red-black tree - one of the most popular of all balanced binary trees.

In a red-black tree every node is colored either red or black and it satisfies the following two balance invariants:

These invariants guarantee that the longest possible path in the tree is not longer than the shortest possible path times two. (The longest path has alternating red and black nodes and the shortest path has only black nodes.)

These invariants are enforced while inserting a new node, using a balance function. This function re-configures all possible black-red-red paths into a red-black-black path. The black-red-red paths can occur in four configurations, depending on whether the red node is a left or right child. The rewrite required is the same in all cases. Pattern matching makes it possible to write the balance function in a compact, declarative style:


function balance(color, t, z, d) 
| ['b, ['r,['r,a,x,b],y,c], z, d] 
| ['b, ['r,a,x,['r,b,y,c]], z, d] 
| ['b, a, x, ['r,['r,b,y,c],z,d]] 
| ['b, a, x, ['r,b,y,['r,c,z,d]]]
    -> ['r,['b,a,x,b],y,['b,c,z,d]]
| _ -> [color, t, z, d]

The complete code for the red-black tree is available for download. For a detailed description of a red-black tree structure similar to the one presented here, please see the book "Purely functional data structures" by Chris Okasaki.

¹By that definition, a string is also an aggregate data type because it is essentially an array of characters. We treated it as a basic data type because of its importance in performing many useful tasks, like searching for patterns in textual data.

²Pairs can be made out of any expressions. For e.g, the expression 1 + 2:100 will create the pair 3:100. Keep in mind that function call expressions tends to bind tightly to the right. So if you want to glue the two expressions 1 + inc(1) and 100 into a pair, the first expression must be enclosed in parenthesis, as in, (1 + inc(1)):100. This will result in the pair 3:100 as expected.

³Pattern based destructuring can be performed on user-defined composite types as well. Slogan allows you to define your own composites that behave like lists and hash tables. This will be the subject of a later chapter.