Instructions
You have identified a gap in the social media market for very very short posts.
Now that Twitter allows 280 character posts, people wanting quick social media updates aren’t being served.
You decide to create your own social media network.
To make your product noteworthy, you make it extreme and only allow posts of 5 or less characters.
Any posts of more than 5 characters should be truncated to 5.
To allow your users to express themselves fully, you allow Emoji and other Unicode.
The task is to truncate input strings to 5 characters.
Text Encodings
Text stored digitally has to be converted to a series of bytes.
There are 3 ways to map characters to bytes in common use.
- ASCII can encode English language characters.
All characters are precisely 1 byte long.
- UTF-8 is a Unicode text encoding.
Characters take between 1 and 4 bytes.
- UTF-16 is a Unicode text encoding.
Characters are either 2 or 4 bytes long.
UTF-8 and UTF-16 are both Unicode encodings which means they’re capable of representing a massive range of characters including:
- Text in most of the world’s languages and scripts
- Historic text
- Emoji
UTF-8 and UTF-16 are both variable length encodings, which means that different characters take up different amounts of space.
Consider the letter ‘a’ and the emoji ’😛’.
In UTF-16 the letter takes 2 bytes but the emoji takes 4 bytes.
The trick to this exercise is to use APIs designed around Unicode characters (codepoints) instead of Unicode codeunits.
Dig Deeper
Regex
Regex
let string = '👨👨👧👧💜🤧🤒🏥😀';
let string2 = 'The quick brown fox jumps over the lazy dog. It barked.';
const splitWithRegEx = (s) => s.match(/.{0,5}/gu)[0];
console.log(splitWithRegEx(string)); // will be "👨👨👧" - incorrect
console.log(splitWithIterator(string2)); // will be "The q"
This solution:
- Uses the String.match() method with a supplied RegEx
- The RegEx supplied matches any character
., between 0 and 5 times {0, 5}. The u flag enables Unicode support.
- This matches characters by code points as well.
- Then it returns the first match as the output string.
This approach will not yield the correct result when applied to characters that are made of multiple
graphere clusters and are meant to represent a single visual unit, such as some emoji.
Iterators
Iterators
let string = '👨👨👧👧💜🤧🤒🏥😀';
let string2 = 'The quick brown fox jumps over the lazy dog. It barked.';
const splitWithIterator = (s) => [...s].slice(0, 5).join('');
console.log(splitWithIterator(string)); // will be "👨👨👧" - incorrect
console.log(splitWithIterator(string2)); // will be "The q"
This solution:
- Uses spread syntax to unpack the string into an array of its characters.
- internaly, the spread operator works with iterators to separate the string by its code points.
- Then it separates the first 5 characters (code points).
- Finally, it joins them back into a string.
This approach will not yield the correct result when applied to characters that are made of multiple
graphere clusters and are meant to represent a single visual unit, such as some emoji.
Intl.Segmenter
Intl.Segmenter
let string = '👨👨👧👧💜🤧🤒🏥😀';
const splitWithSegmenter = (s) =>
Array.from(new Intl.Segmenter().segment(String(s)), (x) => x.segment)
.slice(0, 5)
.join('');
console.log(splitWithSegmenter(string)); // will be "👨👨👧👧💜🤧🤒🏥" - correct, yay!
This solution:
- Uses the Intl.Segmenter object to split the string by graphemes and form an array from the result.
- Then it separates the first 5 graphemes.
- Finally, it joins them back into a string.
At the time of writing (February 2024) this method is not fully supported by the stable release of the Mozilla Firefox browser.
However, support for the Intl.Segmenter object is being worked on in the Nightly release of the browser.
Source: Exercism javascript/micro-blog