Sunday, February 16, 2020

Mini Regex


Hacker News linked to Brian Kenighan's post about a small regex matcher, which seemed like a good bite-sized problem to try implementing -- both in general and for practicing Haskell, particularly with this being a pattern matching problem.

Following Kernighan's specifications, the regex language is limited to:

    c    matches any literal character c
    .    matches any single character
    ^    matches the beginning of the input string
    $    matches the end of the input string
    *    matches zero or more occurrences of the previous character


To make the exercise a bit more stateful, I chose to implement find (return the left-most occurrence of a pattern that matches the regex, if it exists), instead of just indicating the presence of a match. There is minimum lookahead required -- just one character to detect boundaries for the Kleene star.

I first tried to implement find with a lazy map over the tails of the input (returning the head as the left-most match), but ran into the problem of inputs with no matches. Since I couldn't find an out-of-box function, mapUntil implements a map that stops when a non-default value is found, and otherwise returns the default value. This also has the benefit of using an empty string to indicate no match, which is more direct here than, say, using Maybe. Always wrapping and unwrapping the input with the start and end sentinels seemed like a reasonable price to avoid introducing special cases later on.


find :: Regex -> String -> Match
find r s = unwrap $ mapUntil "" (findWord r) (tails $ wrap s)
  where wrap xs = ('^':xs) ++ "$"
        unwrap xs = [c | c <- xs, c `notElem` ['^', '$']]

mapUntil :: Match -> (String -> Match -> Match) -> [String] -> Match
mapUntil d _ [] = d
mapUntil d f (x:xs) = if m /= d then m else mapUntil d f xs
  where m = f x ""


The core matching function covers four patterns: non-star regex, star regex, the regex is consumed (success), or otherwise failure.


findWord :: Regex -> String -> Match -> Match
findWord (x:'*':xs) s@(y:ys) ms = findWord xs ys' (reverse ms'++ ms)
  where (ms', ys') = if not (match x y) then ("", s) else splitWhile (matchStar x xs) s
findWord (x:xs) (y:ys) ms = if match x y then findWord xs ys (y:ms) else ""
findWord "" _ ms = reverse ms
findWord _ _ ms = ""

Match is very simple. MatchStar needs a bit of extra state, since '.*' matching should stop if the next part of the regex matches input.

match :: Char -> Char -> Bool
match x y = x == '.' || x == y

matchStar :: Char -> String -> Char -> Bool
matchStar x [] y = match x y
matchStar x (x':xs) y = (x == '.' && x' /= y) || x == y


SplitWhile seems like it should exist... it calls both takeWhile and dropWhile, so could be rewritten more efficiently to require only single pass over the input.


splitWhile :: (a -> Bool) -> [a] -> ([a], [a])
splitWhile f xs = (takeWhile f xs, dropWhile f xs)


There are a few reverses and concatenations, which weren't immediately obvious to me to refactor into more efficient patterns. Code with some hand-written test cases: https://gist.github.com/tkuriyama/f60210a0e6dfa5bc0872494afaa0030b