Getting an array of "characters" in Clojure

Iā€™m continuing to noodle around in Clojure and needed to split a string into its component characters. If you know your string will be plain-old-ASCII, you can do:

(clojure.string/split "hello" #"")
;; ["h" "e" "l" "l" "o"]

But this gets funny if you want to split up a string that has emoji in it:

(def ghosties "šŸ‘»šŸ‘»šŸ‘»")
(clojure.string/split ghosties #"")
;; ["?" "?" "?" "?" "?" "?"]

This is because emoji are represented by two bytes under the hood, called a surrogate pair. When you split up a string using the empty regex, it spits it on each byte not on each character. This is a problem in Javascript as well.

To solve this problem in Clojure, we need to reach into the Java APIs for manipulating strings and characters:

;; For a given integer codepoint, return the string representation of it
(defn codepoint-to-str [cpt]
  (-> (StringBuilder.)
      (.appendCodePoint cpt)

;; For a given string, split it into codepoints and generate a string
;; for each individual code point.
(defn to-string-array [s]
      (iterator-seq (.iterator (.codePoints s)))))

(to-string-array ghosties)
;; ("šŸ‘»" "šŸ‘»" "šŸ‘»")