Visit the protein-translation exercise on Exercism to read the full instructions and download the exercise files.
Dig Deeper
Recursion
Recursion
module ProteinTranslation
let proteins (rna: string): string list =
let rec doProteins (rna: string) (proteins: string list): string list =
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" | "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" | "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" | "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
| "UAA" | "UAG" | "UGA" | "" -> List.rev proteins
| _ -> failwith "Unknown coding"
doProteins rna []
In this approach, we’ll define a recursive function that will recursively process the RNA sequence and keep track of the translated proteins.
Recursive translation
To use (tail call) recursion to translate the RNA to proteins, we’ll introduce a helper function: doProteins.
This function takes the remaining, unprocessed RNA and a list of translated proteins (the accumulator value):
let rec doProteins (rna: string) (proteins: string list): string list
We’ll define this function inside the proteins function (also known as a nested function), but it could just as well have been defined outside the proteins function.
That said, its implementation is merely a helper to the proteins function and is thus tied to that function, so to have it be close to where it is called often makes sense (it signals to the reader that the function should only be used within its parent function).
To allow a function to recursively call itself, the `rec` modified must be added.
In other words: by default, functions cannot call themselves.
Translating
As each codon is three letters long, the doProteins function looks at the first three letters of its codons parameter.
For each translateable codon, we recursively call the doProteins function, with the remainder of the codons (skipping the first three letters) and the codon’s protein added to the proteins accumulator value as arguments.
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" -> doProteins rna[3..] ("Serine" :: proteins)
| "UCC" -> doProteins rna[3..] ("Serine" :: proteins)
| "UCA" -> doProteins rna[3..] ("Serine" :: proteins)
| "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
Stopping
Next up is to handle the “STOP” proteins.
We’ll add branches for each of the three “STOP” proteins, which stop the recursion and instead return the (reversed) accumulator value:
| "UAA" -> List.rev proteins
| "UAG" -> List.rev proteins
| "UGA" -> List.rev proteins
There is one additional case we need to process, and that is when there are no codons left to process:
| "" -> List.rev proteins
We need to use [`List.rev`](https://fsharp.github.io/fsharp-core-docs/reference/fsharp-collections-listmodule.html#rev) to reverse the proteins, as translated proteins are added at the head of the list (in the front).
Prepending an element to a list is _much_ faster than appending an element.
In fact, it is so much faster that the penalty of having to reverse the list ends up being well worth it.
At this point, the F# compiler lets us know that we haven’t handled all possible cases, which is true.
The easiest thing to do is to throw an error if none of the previous branches matched:
| _ -> failwith "Unknown coding"
Combining patterns
You might have noticed that many of the branches end up doing the exact same thing.
F# allows one to “chain” multiple patterns to reduce the duplication:
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" | "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" | "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" | "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
| "UAA" | "UAG" | "UGA" | "" -> List.rev proteins
| _ -> failwith "Unknown coding"
Alignment
While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:
match rna[0..2] with
| "AUG" -> doProteins rna[3..] ("Methionine" :: proteins)
| "UUC" | "UUU" -> doProteins rna[3..] ("Phenylalanine" :: proteins)
| "UUA" | "UUG" -> doProteins rna[3..] ("Leucine" :: proteins)
| "UCU" | "UCC" | "UCA" | "UCG" -> doProteins rna[3..] ("Serine" :: proteins)
| "UAU" | "UAC" -> doProteins rna[3..] ("Tyrosine" :: proteins)
| "UGU" | "UGC" -> doProteins rna[3..] ("Cysteine" :: proteins)
| "UGG" -> doProteins rna[3..] ("Tryptophan" :: proteins)
| "UAA" | "UAG" | "UGA" | "" -> List.rev proteins
| _ -> failwith "Unknown coding"
A downside of vertical alignment is that changes to the code require more work, as you'll need to ensure everything is still aligned.
For this particular case, it isn't really an issue, as the codons are fixed and the code is thus unlikely to change.
Putting it all together
The final step is to call our recursive helper function:
doProteins rna []
And with that, we have a working, tail recursive implementation that translates the RNA to proteins.
Tail recursion prevents stack overflows when a recursive function is called many times.
While the exercise does not have large test cases that would cause a stack overflow, it is good practice to always use using tail recursion when implementing a recursive functions.
If you'd like to read more about tail recursion, [this MSDN article](https://blogs.msdn.microsoft.com/fsharpteam/2011/07/08/tail-calls-in-f/) goes into more detail.
Another good resource on tail recursion is [this blog post](http://blog.ploeh.dk/2015/12/22/tail-recurse/).
Unfold
Unfold
module ProteinTranslation
let proteins (rna: string): string list =
let doProteins (rna: string): (string * string) option =
match rna[0..2] with
| "AUG" -> Some ("Methionine", rna[3..])
| "UUC" | "UUU" -> Some ("Phenylalanine", rna[3..])
| "UUA" | "UUG" -> Some ("Leucine", rna[3..])
| "UCU" | "UCC" | "UCA" | "UCG" -> Some ("Serine", rna[3..])
| "UAU" | "UAC" -> Some ("Tyrosine", rna[3..])
| "UGU" | "UGC" -> Some ("Cysteine", rna[3..])
| "UGG" -> Some ("Tryptophan", rna[3..])
| "UAA" | "UAG" | "UGA" | "" -> None
| _ -> failwith "Unknown coding"
List.unfold doProteins rna
The protein translation algorithm is basically a while loop that accumulates values and terminates when the “STOP” protein is reached or when there are no more codons left to translate.
This is a common enough pattern, that F# has a built-in function for this use case: List.unfold, which we’ll use in this approach.
List.unfold
The List.unfold function takes two arguments:
- A function that takes a value and returns a value pair wrapped in an
Option<T>
- The initial value
The function can return either:
Some (returnValue, newValue): continue executing, whilst adding the first value (returnValue) to the list of values to return, and use newValue as the value for the next call to the lambda
None: stop execution
Once the function returns None, the accumulated return values are returned.
Unfolding the proteins
Now that we know how List.unfold works, let’s use it to translate our codons.
Let’s define a doProteins function that takes a string parameter representing the codons left to translate, and returns a (string * string) option.
The string pair consists of the translated protein and the remaining, unprocessed RNA:
let doProteins (rna: string): (string * string) option
We’ll define this function inside the proteins function (also known as a nested function), but it could just as well have been defined outside the proteins function.
That said, its implementation is tied to the List.unfold call, so have it be close to that often makes sense (it signals to the reader that the function should only be used within its parent function).
Translating
As each codon is three letters long, the doProteins function looks at the first three letters of its codons parameter.
For each translateable codon, we return a pair of strings, the first being its protein translation and the second being the remainder of the codons (skipping the first three letters).
We’ll wrap the pair in Some to signal List.unfold to continue processing:
match rna[0..2] with
| "AUG" -> Some ("Methionine", rna[3..])
| "UUC" -> Some ("Phenylalanine", rna[3..])
| "UUU" -> Some ("Phenylalanine", rna[3..])
| "UUA" -> Some ("Leucine", rna[3..])
| "UUG" -> Some ("Leucine", rna[3..])
| "UCU" -> Some ("Serine", rna[3..])
| "UCC" -> Some ("Serine", rna[3..])
| "UCA" -> Some ("Serine", rna[3..])
| "UCG" -> Some ("Serine", rna[3..])
| "UAU" -> Some ("Tyrosine", rna[3..])
| "UAC" -> Some ("Tyrosine", rna[3..])
| "UGU" -> Some ("Cysteine", rna[3..])
| "UGC" -> Some ("Cysteine", rna[3..])
| "UGG" -> Some ("Tryptophan", rna[3..])
Stopping
Next up is to handle the “STOP” proteins.
We’ll add branches for each of the three “STOP” proteins, which stop execution by returning None:
| "UAA" -> None
| "UAG" -> None
| "UGA" -> None
There is one additional case we need to process, and that is when there are no codons left to process:
| "" -> None
At this point, the F# compiler lets us know that we haven’t handled all possible cases, which is true.
The easiest thing to do is to throw an error if none of the previous branches matched:
| _ -> failwith "Unknown coding"
Combining patterns
You might have noticed that many of the branches end up doing the exact same thing.
F# allows one to “chain” multiple patterns to reduce the duplication:
match rna[0..2] with
| "AUG" -> Some ("Methionine", rna[3..])
| "UUC" | "UUU" -> Some ("Phenylalanine", rna[3..])
| "UUA" | "UUG" -> Some ("Leucine", rna[3..])
| "UCU" | "UCC" | "UCA" | "UCG" -> Some ("Serine", rna[3..])
| "UAU" | "UAC" -> Some ("Tyrosine", rna[3..])
| "UGU" | "UGC" -> Some ("Cysteine", rna[3..])
| "UGG" -> Some ("Tryptophan", rna[3..])
| "UAA" | "UAG" | "UGA" | "" -> None
| _ -> failwith "Unknown coding"
Alignment
While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:
match rna[0..2] with
| "AUG" -> Some ("Methionine" , rna[3..])
| "UUC" | "UUU" -> Some ("Phenylalanine", rna[3..])
| "UUA" | "UUG" -> Some ("Leucine" , rna[3..])
| "UCU" | "UCC" | "UCA" | "UCG" -> Some ("Serine" , rna[3..])
| "UAU" | "UAC" -> Some ("Tyrosine" , rna[3..])
| "UGU" | "UGC" -> Some ("Cysteine" , rna[3..])
| "UGG" -> Some ("Tryptophan" , rna[3..])
| "UAA" | "UAG" | "UGA" | "" -> None
| _ -> failwith "Unknown coding"
A downside of vertical alignment is that changes to the code require more work, as you'll need to ensure everything is still aligned.
For this particular case, it isn't really an issue, as the codons are fixed and the code is thus unlikely to change.
Step-by-step execution
Let’s run through the List.unfold calls to get a better feel for it working as intended:
| Remaining RNA | Lambda return | Return values |
|---|
"AUGUUUUGG" | Some ("Methionine", "UUUUGG") | ["Methionine"] |
"UUUUGG" | Some ("Phenylalanine", "UGG") | ["Methionine"; "Phenylalanine"] |
"UGG" | Some ("Tryptophan", "") | ["Methionine"; "Phenylalanine"; "Tryptophan"] |
"" | None | ["Methionine"; "Phenylalanine"; "Tryptophan"] |
You can see that we process the codons step by step, slowly building up the return values and returning them once we’ve processed them all.
Putting it all together
Finally, we can put it all to together by piping the RNA into List.unfold with our doProteins function as its first argument:
List.unfold doProteins rna
We now have a working implementation that translates the RNA to proteins.
Seq module
Seq module
module ProteinTranslation
let private codonToProtein (codon: string): string =
match codon with
| "AUG" -> "Methionine"
| "UUC" | "UUU" -> "Phenylalanine"
| "UUA" | "UUG" -> "Leucine"
| "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
| "UAU" | "UAC" -> "Tyrosine"
| "UGU" | "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" | "UAG" | "UGA" -> "STOP"
| _ -> failwith "Invalid codon"
let proteins (rna: string): string list =
rna
|> Seq.chunkBySize 3
|> Seq.map System.String
|> Seq.map codonToProtein
|> Seq.takeWhile (fun protein -> protein <> "STOP")
|> Seq.toList
This approach combines a number of functions from the Seq module to build up a pipeline to translate the RNA to proteins.
Translating
Let’s define a codonToProtein function that takes a string parameter representing the codon and returns the translated protein, also as a string:
let private codonToProtein (codon: string): string
We could have defined this function as a nested function within the `proteins` function, but as it could reasonably be used _outside_ the `proteins` function, we chose not to.
Within the function, we simply pattern match on the codon to translate it to its corresponding protein:
match codon with
| "AUG" -> "Methionine"
| "UUC" -> "Phenylalanine"
| "UUU" -> "Phenylalanine"
| "UUA" -> "Leucine"
| "UUG" -> "Leucine"
| "UCU" -> "Serine"
| "UCC" -> "Serine"
| "UCA" -> "Serine"
| "UCG" -> "Serine"
| "UAU" -> "Tyrosine"
| "UAC" -> "Tyrosine"
| "UGU" -> "Cysteine"
| "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" -> "STOP"
| "UAG" -> "STOP"
| "UGA" -> "STOP"
| _ -> failwith "Invalid codon"
Combining patterns
You might have noticed that many of the branches end up doing the exact same thing.
F# allows one to “chain” multiple patterns to reduce the duplication:
match codon with
| "AUG" -> "Methionine"
| "UUC" | "UUU" -> "Phenylalanine"
| "UUA" | "UUG" -> "Leucine"
| "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
| "UAU" | "UAC" -> "Tyrosine"
| "UGU" | "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" | "UAG" | "UGA" -> "STOP"
| _ -> failwith "Unknown coding"
Alignment
While definitely not needed, aligning the code vertically makes it more clear that the codon patterns all end up doing basically the same thing, but with a different protein:
match codon with
| "AUG" -> "Methionine"
| "UUC" | "UUU" -> "Phenylalanine"
| "UUA" | "UUG" -> "Leucine"
| "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
| "UAU" | "UAC" -> "Tyrosine"
| "UGU" | "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" | "UAG" | "UGA" -> "STOP"
| _ -> failwith "Unknown coding"
Using function instead of match
As the codonToProtein function has one argument on which the function then immediately starts pattern matching, one could rewrite it using the function keyword:
let private codonToProtein =
function
| "AUG" -> "Methionine"
| "UUC" | "UUU" -> "Phenylalanine"
| "UUA" | "UUG" -> "Leucine"
| "UCU" | "UCC" | "UCA" | "UCG" -> "Serine"
| "UAU" | "UAC" -> "Tyrosine"
| "UGU" | "UGC" -> "Cysteine"
| "UGG" -> "Tryptophan"
| "UAA" | "UAG" | "UGA" -> "STOP"
| _ -> failwith "Invalid codon"
What function does is to return a one-parameter function that pattern matches on that parameter.
The following definitions are thus functional equivalent:
let private codonToProtein codon =
match codon with
| "AUG" -> "Methionine"
let private codonToProtein =
function
| "AUG" -> "Methionine"
let private codonToProtein =
fun codon ->
match codon with
| "AUG" -> "Methionine"
Whilst they are all valid, the first option (function with parameter and explicit match) is preferred for readability.
Putting it all together
Now that we can translated RNA to proteins, let’s build up a working solution.
Split RNA sequence into codons
First, let’s split our rna parameter into chunks of three letters:
rna
|> Seq.chunkBySize 3
|> Seq.map System.String
The Seq.chunkBySize function will split the string into a sequence of three letters.
We then use Seq.map to convert those three letter sequences into strings (basically: concatenating the letters), which represent the codons.
Converting to proteins
Next up is converting the codons to proteins, once again using Seq.map:
|> Seq.map codonToProtein
Stopping
Then we’ll need to stop processing when we encounter a “STOP” protein.
For that we can use Seq.takeWhile, where we pass in a lambda function that checks if the protein is not the “STOP” protein.
It is isn’t the element is preserved, and the next element is checked, until either there are no elements or a “STOP” protein is found (this protein is not included in the results).
|> Seq.takeWhile (fun protein -> protein <> "STOP")
One could also write the above as:
```fsharp
|> Seq.takeWhile ((<>) "STOP")
```
However, this is arguably less readable.
Converting to a list
Finally, we convert the sequence to a list via Seq.toList:
|> Seq.toList
Combining it all
This gives us the following pipeline:
let proteins (rna: string): string list =
rna
|> Seq.chunkBySize 3
|> Seq.map System.String
|> Seq.map codonToProtein
|> Seq.takeWhile (fun protein -> protein <> "STOP")
|> Seq.toList
We now have a working implementation that translates the RNA to proteins.
Source: Exercism fsharp/protein-translation