Intermediate Exercism • haskell

Hamming

Challenge Overview

Calculate the Hamming difference between two DNA strands.

Visit the hamming exercise on Exercism to read the full instructions and download the exercise files.

Dig Deeper

zipWith

`zipWith`

distance :: String -> String -> Maybe Int
distance xs ys
  | length xs /= length ys = Nothing
  | otherwise = Just $ length $ filter id (zipWith (/=) xs ys)

The most straightforward way of solving this problem is to

first check that the lengths are equal, and then
iterate over both inputs together to count their differences.

Higher-order functions

Higher-order functions are functions that take functions as arguments. Examples well-known even outside of Haskell are map and filter.

map applies a function to all elements of a list. It returns a list of all the results.
```
-- >>> map (* 2) [1 .. 5]
-- [2,4,6,8,10]
```
filter applies a predicate to all elements of a list. It returns a list with only those elements for which predicate returned True.
```
-- >>> filter odd [1 .. 5]
-- [1,3,5]
```

Another example that is extremely common in Haskell is the function composition operator (.), which ‘chains’ two functions into one.

reverseSort :: [Int] -> [Int]
reverseSort = reverse . sort  -- first sort, then reverse

-- >>> reverseSort [3, 1, 4, 1, 5, 9]
-- [9,5,4,3,1,1]

Still other examples include

zipWith, which combines two lists into one using a element-combining function.
```
-- >>> zipWith (+) [10, 20, 30] [1, 2, 3]
-- [11,22,33]
```
uncurry, which turns a function of two arguments into one that accepts a tuple.
```
tupleMinus = uncurry (-)

-- >>> tupleMinus (23, 4)
-- 19
```
fmap, which does the same as map except it also works for some types that aren’t lists.
```
-- >>> fmap (+ 1) [1 .. 5]
-- [2,3,4,5,6]

-- >>> fmap (+ 1) (Just 7)
-- Just 8
```

In this approach

After making sure that the lengths are equal, we count the number of places in which the inputs differ.

The /= operator compares two values and returns True precisely when they are unequal.

-- >>> 4 /= 5
-- True

-- >>> 2 /= 2
-- False

We use zipWith to walk both input lists simultaneously, marking unequal pairs using (/=) as we go.

comparisons =
  zipWith (/=)
    [3, 2, 6]
    [5, 2, 4, 7]

-- >>> comparisons
-- [True,False,True]

zipWith stops as soon as one of its argument lists stops. This is the reason we need to check the lengths separately.

Now, the number of differences between the two inputs is exactly the number of Trues in the list produced by zipWith. We count them using filter and length.

We need to give filter a predicate to filter by. In this case, this predicate should return True when given True and False for False. A function that does this already exists in the Prelude: id is the function that returns its argument unchanged.

We could also have zipped the two lists using the simpler zip function.

pairs =
  zip
    [3, 2, 6]
    [5, 2, 4, 7]

-- >>> pairs
-- [(3,5),(2,2),(6,4)]

In that case we would have needed to count the pairs that have the same value in both places. This can still be done using filter, but the required predicate is more complicated. uncurry (/=) is a function that takes a tuple and returns True when the tuple has equal values in both places.

distance :: String -> String -> Maybe Int
distance xs ys
  | length xs /= length ys = Nothing
  | otherwise = Just $ length $ filter (uncurry (/=)) (zip xs ys)

Considerations on this approach

This style of solution is very easy to understand. This is a very important quality for code to have! Code is primarily meant for humans to reason about, after all.

On the other hand, this solution suffers an inefficiency. Haskell’s lists are linked lists. Therefore, length needs to walk its argument entirely. This can take a lot of time for long lists.

length is called three times in this approach, resulting in three separate walks over the inputs. The explicit recursion and worker–wrapper approaches avoid this and instead walk the input exactly once.

Recursion

distance :: String -> String -> Maybe Int
distance [] [] = Just 0
distance (x : xs) (y : ys) =
  (if x /= y then (1 +) else id) <$> distance xs ys
distance _ _ = Nothing

Recursion

Recursion in Haskell is the phenomenon of functions or values being defined in terms of themselves. For example, here is a recursive definition of an (infinite) list:

-- A recursive (constant) value
ones :: [Integer]
ones = 1 : ones
--          👆
--      recursion

-- `ones` is infinite, so take care to
-- ask for only finitely many elements

-- >>> take 5 ones
-- [1,1,1,1,1]

And here is a recursive definition of the well-known factorial function:

-- A recursive function
factorial :: Integer -> Integer
factorial n
  | n <= 0    = 1
  | otherwise = n * factorial (n - 1)
--                      👆
--                  recursion

-- >>> factorial 5 == 5 * 4 * 3 * 2 * 1 * 1
-- True

This factorial function will always produce an output, because

it contains a base case that directly results in output, and
it only ever recursively applies itself to ‘smaller’ values than its own argument, and
any series of ever ‘smaller’ values is guaranteed to eventually hit one of the base cases.

In the case of factorial 4 above we have

factorial 4
 = (4 * factorial 3)
 = (4 * (3 * factorial 2))
 = (4 * (3 * (2 * factorial 1)))
 = (4 * (3 * (2 * (1 * factorial 0))))
 = (4 * (3 * (2 * (1 * 1))))
 = (4 * (3 * (2 * 1)))
 = (4 * (3 * 2))
 = (4 * 6)
 = 24

Haskell does not provide any dedicated loop devices such as many other languages’ for and while constructs. Instead, all ‘loopiness’ in Haskell is produced through recursion – if not by you then under the hood of some of the functions you use.

In this approach

If both inputs are empty, then they are of equal length and they do not have any differing elements, so Just 0 should be returned.

If one input is empty but the other isn’t, then the inputs certainly are of unequal length, and Nothing should be returned.

If both inputs are nonempty, we recursively determine the hamming distance between their tails. We also compare their heads. If the heads are unequal we should add 1 to the distance between the tails; otherwise we should leave it as is.

The recursive call produces a Maybe Int rather than just an Int, so we cannot just increment it. Nothing that a bit of pattern matching won’t fix, however:

distance (x : xs) (y : ys) =
  case distance xs ys of
    Nothing -> Nothing
    Just d -> Just (if x /= y then 1 + d else d)

This pattern of mapping Nothing to Nothing and Just to Just is very common. The function fmap was invented to do it for us:

-- >>> fmap (+1) Nothing
-- Nothing

-- >>> fmap (+1) (Just 6)
-- Just 7

It makes the code a lot more compact.

distance (x : xs) (y : ys) =
  fmap (\d -> if x /= y then 1 + d else d) distance xs ys

The solution highlighted at the top is slightly different in two ways.

First, it uses the <$> operator. Being a synonym of fmap, it does exactly the same thing: fmap f xs == f <$> xs always. Sometimes using the operator results in more readable code, but it might take some getting used to.

distance (x : xs) (y : ys) =
  (\d -> if x /= y then 1 + d else d) <$> distance xs ys

Second, it uses an if that produces one of two functions instead of one function that produces one of two integers.

Because functions are ordinary values like any other in Haskell, we can have them be produced by if expressions just like other types of values:

someFunction b = if b then (* 2) else (+ 3)

-- >>> someFunction True 4
-- 8

-- >>> someFunction False 4
-- 7

The highlighted solution chooses what to do to with the result of the recursive call based on the comparison of the heads. If they are unequal, it applies an increment ((1 +)). Otherwise, it applies the do-nothing function (id).

distance :: String -> String -> Maybe Int
distance [] [] = Just 0
distance (x : xs) (y : ys) =
  (if x /= y then (1 +) else id) <$> distance xs ys
distance _ _ = Nothing

Considerations on this approach

In a way, this solution is elegant. However, it suffers an inefficiency.

Suppose the inputs are

xs = [0, 0, 0, 0]
ys = [0, 1, 0, 1]

Then evaluation of distance xs ys might go as follows.

distance xs ys
  = distance [0, 0, 0, 0] [0, 1, 0, 1]
  = id <$> ( distance [0, 0, 0] [1, 0, 1] )
  = id <$> ( (1 +) <$> ( distance [0, 0] [0, 1] ) )
  = id <$> ( (1 +) <$> ( id <$> ( distance [0] [1] ) ) )
  = id <$> ( (1 +) <$> ( id <$> ( (1 +) <$> distance [] [] ) ) )
  = id <$> ( (1 +) <$> ( id <$> ( (1 +) <$> Just 0 ) ) )
  = id <$> ( (1 +) <$> ( id <$> Just 1 ) )
  = id <$> ( (1 +) <$> Just 1  )
  = id <$> Just 2
  = Just 2

A tower of nested <$> applications forms! For every pair of elements in the inputs a <$> layer is added. For long lists this can take a lot of memory.

It is impossible to collapse the tower early. <$> needs to know whether its right operand is a Nothing or a Just, but this does not become clear until the end of at least one of the lists has been reached.

For an example of how to avoid this problem, see the worker–wrapper approach.

Worker–wrapper

distance :: String -> String -> Maybe Int
distance = go 0
  where
    go !n (x : xs) (y : ys) = go (if x == y then n else n + 1) xs ys
    go !n [] [] = Just n
    go _ _ _ = Nothing

This approach uses an inner worker function to simultaneously walk both lists and keep track of the number of encountered differences. It also uses a bang pattern to force intermediate evaluation, to guarantee decent space efficiency.

The worker–wrapper construct

Sometimes, when solving a problem, it is more convenient or efficient to keep track of some kind of state. Many other languages use local variables for this. However, variables do not exist in Haskell: all bindings are constants.

In lieu of local variables we can add extra parameters to our functions. These will hold our state. ‘Changing’ the state is done by calling the same functions (recursively) with different arguments.

Functions that uses such additional parameters to represent state are often called workers, and functions that use workers to solve a problem in a stateful way are often called wrappers.

As an example, consider the following possible definition of length.

length :: [a] -> Int
length []     = 0
length (_:xs) = 1 + length xs

One problem with this function is that it builds large computations that take up potentially a lot of memory:

_ =  length "abcde"
  == 1 + length "bcde"
  == 1 + (1 + length "cde")
  == 1 + (1 + (1 + length "de"))
  == 1 + (1 + (1 + (1 + length "e")))
  == 1 + (1 + (1 + (1 + (1 + length ""))))
  -- this computation is as large as the list was long!
  == 1 + (1 + (1 + (1 + (1 + 0))))
  == 1 + (1 + (1 + (1 + 1)))
  == 1 + (1 + (1 + 2))
  == 1 + (1 + 3)
  == 1 + 4
  == 5

It would be more efficient to keep track of a running total number of encountered elements. To do this we can use a variant of length with an additional parameter:

length' :: Int -> [a] -> Int
length' count []     = count
length' count (_:xs) = length' (1 + count) xs

Here the parameter count is used to keep track of the number of previously encountered elements. It might evaluate as follows.

_ =  length' 0 "abcde"
  == length' 1 "bcde"
  == length' 2 "cde"
  == length' 3 "de"
  == length' 4 "e"
  == length' 5 ""
  == 5

The computation stays small at every step – much more efficient!

To obtain an efficient function length with the familiar [a] -> Int type – instead of the more involved Int -> [a] -> Int – we can ‘wrap’ length' in a function that provides it with initial arguments:

length :: [a] -> Int
length xs = length' 0 xs
  where
    length' count []     = count
    length' count (_:xs) = length' (1 + count) xs

Here length is known as the wrapper, and length' as the worker. Hence, this is an example of the worker–wrapper construct.

The worker is commonly called go, loop, or f' when the wrapper is called f.

The worker–wrapper pattern is more general than demonstrated here. For a few more more examples, see the relevant wiki article, and for yet more examples and an explanation of the general pattern see the paper «The worker/wrapper transformation» (pdf).

Bang patterns

The above implementation of length' might evaluate as efficiently as illustrated. However, depending on which optimization opportunities are spotted by the compiler, it also might not! The definition of length' allows evaluation to proceed as follows as well.

_ =  length' 0 "abcde"
  == length' (1 + 0) "bcde"
  == length' (1 + (1 + 0)) "cde"
  == length' (1 + (1 + (1 + 0))) "de"
  == length' (1 + (1 + (1 + (1 + 0)))) "e"
  == length' (1 + (1 + (1 + (1 + (1 + 0))))) ""
  == 1 + (1 + (1 + (1 + (1 + 0))))
  == 1 + (1 + (1 + (1 + 1)))
  == 1 + (1 + (1 + 2))
  == 1 + (1 + 3)
  == 1 + 4
  == 5

Still a computation as large as the list was long is built. The situation is not quite as bad as before though.

In the case of the naive implementation of length, the built up expression has the form

_ = 1 + (1 + (⋯(1 + length …)⋯))

This expression cannot be simplified before length … has been evaluated. That is, simplification cannot begin until the last recursive step has been evaluated. As a consequence, the computation must grow as big as the original list.

The computation built up by length' on the other hand has the form

_ = 1 + (1 + (⋯(1 + 0)⋯))

This can be simplified! In fact, as illustrated in the previous section, it can be simplified at every individual step.

Haskell offers various ways to force intermediate evaluation. One convenient tool is the bang pattern, enabled by the BangPatterns language extension.

{-# LANGUAGE BangPatterns #-}  -- 👈 at the top of the file

length'' :: Int -> [a] -> Int
length'' !count []     = count
length'' !count (_:xs) = length'' (1 + count) xs
--       👆
--  bang patterns

Bang patterns force evaluation of arguments at function invocation. Consequently, this length'' (at worst) evaluates as follows.

_ =  length'' 0 "abcde"
  == length'' (1 + 0) "bcde"
  == length'' (1 + 1) "cde"
  == length'' (1 + 2) "de"
  == length'' (1 + 3) "e"
  == length'' (1 + 4) ""
  == 1 + 4
  == 5

Here, at each step the (1 + …) argument is reduced to a single number first, before again 1 is added to it.

We can eliminate even that last recurring + using another strictness management tool: the seq function.

length''' :: Int -> [a] -> Int
length''' count []     = count
length''' count (_:xs) =
  let count' = 1 + count
   in count' `seq` length''' count' xs

x `seq` y always returns y, but additionally guarantees that x is evaluated if y is evaluated. Here, length''' uses it to guarantee that count' is evaluated ‘before’ it is passed to the recursive length''' call. As a result, length''' evaluates optimally.

A full discussion of strictness in Haskell here would take up too much space. Please consult your other learning resources. The track docs include an article on Haskell learning resources.

Community Solutions

View Community Solutions

The following are the top 3 community solutions by stars:

Solution 1 by @JEphron (⭐ 5) — View on Exercism
Solution 2 by @jeffwindsor (⭐ 0) — View on Exercism
Solution 3 by @trezona-lecomte (⭐ 0) — View on Exercism

Source: Exercism haskell/hamming

💡 View Solution

module Hamming (distance) where

distance :: String -> String -> Maybe Int
distance xs ys 
    | length xs /= length ys = Nothing
    | otherwise = Just $ length $ filter id $ zipWith (/=) xs ys