Friday, November 13, 2020

Elm Codegen Quirks with Servant-Elm and Persistent

Developing further on the previous post, building some slightly less trivial examples with the combination of servant-elm and persistent (with Sqlite backend) yields some interesting problems in the generated Elm code. 

1. Phantom Types for Keys

Due to the way persistent handles database primary keys, servant-elm (via the underlying elm-bridge) currently generates Elm code with types such as Key World instead of WorldId, where, for Sqlite, ID variables are of type Int, used for database primary keys. 

But what is Key World in terms of Elm types? And what happens when you try to sequence a JSON encoder / decoder for Key (which doesn't exist!), along with a JSON encoder / ecoder for World -- something altogether different from the desired Int

type alias User  =
   { userEnv: (Key World)
   , userName: String

    [...]

jsonEncUser : User -> Value
jsonEncUser  val =
   Json.Encode.object
   [ ("userEnv", (jsonEncKey (jsonEncWorld)) val.userEnv)
   , ("userName", Json.Encode.string val.userName)
 
    [...]


There's some discussion of handling phantom types in elm-bridge, as well as some existing support for custom type replacements in the elm-bridge libraryTo get it working, I ended up needing the custom type replacements, as well as some direct manipulation of the generated Elm code. (I also found some fairly extensive replacement ideas here).


That takes care of the instances of Key World and the like in Elm type definitions, but not in the API call functions, and also introduces a new problem of missing definitions for Id type encoders/decoders. 

type alias User  =
   { userEnv: WorldId
   , userName: String
   [...]

jsonEncUser : User -> Value
jsonEncUser  val =
   Json.Encode.object
   [ ("userEnv", jsonEncWorldId val.userEnv)
   , ("userName", Json.Encode.string val.userName)
   [...]

getWorldByWid : (Key World) -> (Result Http.Error  ((Maybe World))  -> msg) -> Cmd msg
getWorldByWid capture_wid toMsg =
    [...]

There may be some way to do replacements in a more integral way with the Template Haskell codegen, but I ended up just doing appends and naive text replacements on the final Elm output: https://github.com/tkuriyama/spacemonkey/blob/master/spacemonkey/scripts/PostCodeGen.hs


2. Missing URL String Encoders

I also couldn't get servant-elm to generate correct URL string encoders, despite defining instances of toHttpApiData, which seems like it should be the thing to do

 
        Http.request
            [...]
            , url =
                Url.Builder.crossOrigin "http://localhost:8080"
                    [ "cellColor"
                    , (capture_worldid |> String.fromInt)
                    , (capture_x |> String.fromInt)
                    , (capture_y |> String.fromInt)
                    , (capture_color |> String.fromInt)
                    ]
             [...

... where Color is defined on the Haskell side as:

data Color
  = White
  | Yellow
  | Red
  | Green
  | Blue
  | Grey
  deriving (Show, Read, Eq, Generic, Enum, Bounded)

instance FromHttpApiData Color where
  parseUrlPiece :: T.Text -> Either T.Text Color
  parseUrlPiece = myParse

instance ToHttpApiData Color where
  toUrlPiece = T.pack . show
  toQueryParam = T.pack . show

So for each sum type instance, the post-code-gen script from above also takes care of the string encoders for use in URL building, appending the functions to the generated Elm code.

 

Wrapping Up

With the above edits, the Elm code gen + persistence layer stack works again. There was a lot more research and work(arounds) required this time, and I needed some help from friends at the Recurse Center, but it still (mostly!) satisfies the property of writing data definitions only once for the whole stack. It's not hard to see that some of the type mapping issues across languages and data boundaries are non-trivial to resolve, so I hope more usage and development goes into these kinds of type-safe stacks. 


Monday, November 9, 2020

Hello Server + Persistent

 The previous post covered setting up a Servant server in Haskell, with automatic codegen for Elm bindings and using the acid-state package for persistence. 

An obvious benefit of the stack is that one only ever needs to think in native Haskell data structures. It's relatively easy to use and achieves the goal of minimizing work across data boundaries (in this case, between the server and persistence layer). It's also self-contained, insofar as it doesn't require an external database server / process / engine.

On the other hand... many external database servers have been developed and refined for years -- if not decades -- yielding stability, tools, and all sorts of best practices and optimizations. 

Yesod's answer to the problem is persistent, which provides a type-safe and unified way to interact with a variety of database backends (primarily Sqlite, MySQL, Postgres, and MongoDB). Although the library was designed for Yesod, it is not tightly coupled and seems to work for Servant.

This file contains the data definitions for a revised version of Hello Server.  persist also uses Template Haskell, so we adapt the data declarations accordingly:

share
[mkPersist sqlSettings , mkMigrate "migrateAll"]
[persistLowerCase|
ServerState
state String
counter Int
deriving Eq Show Generic
|]

Note that the servant-elm template invocations come later and works as expected -- i.e. first, persistent generates the ServerState data type and associate code it needs; then, servant-elm uses the generated ServerState data type for its purposes. 

This file contains the server instance, following the example in the Servant repo. 

And, with a few small changes on the Elm client side to accommodate API tweaks, it works!

There are a number of fundamental things I still haven't figured out (like... shouldn't the Sqlite connection pool be explicitly closed in the bracket pattern?). Overall, there are more things to think about compared to working with acid-state -- which seems reasonabl, since it corresponds to the relative power of the database engine (or at least, I assume so). Leveraging the "migrations" feature from persistent, though, I've still only written a single data definition in Haskell for the entire stack!

The repo with all examples: https://github.com/tkuriyama/spacemonkey





Thursday, November 5, 2020

Hello Server + Acid-State

 The previous post covered defining a Servant server with automatic code generation for Elm bindings.

Are there libraries that allow for type-safe persistence in Haskell?

Again referring to the state of the Haskell ecosystem, one interesting library is acid-state, which provides ACID guarantees for de/serializing (serializable) Haskell data structures directly, with no external DBMS.

acid-state also uses Template Haskell, and it's relatively straightforward to augment the prior Hello Server data definitions. We define some simple operations we want (query, write), as well as the template invocations:

    $(deriveSafeCopy 0 'base ''ServerState)

    $(makeAcidic ''ServerState ['writeState, 'queryState])

On the server side,  this file has the updated logic to interact with the data store. Since acid-state uses monadic state, we can use ReaderT and again follow the Servant tutorial to thread the data state through the server calls. Although it doesn't seem strictly necessary given the acid-state guarantees, we can use the bracket pattern from Control.Exception to ensure that the data store is closed normally in the event of unexpected server failure.

That's pretty much it! As promised by the stack, generating the Elm code is just a matter of changing some filename configs in the code. In this case, the REST interface didn't change, so changing the backend had no impact on the rest of the client code. 

I haven't thought too much yet about the implications of this approach vs using some typed-SQL library like persistent, which I intend the try next.



Hello Server: Haskell + Elm Across Data Boundaries

 I started with the thought that it would be fun to write a small, private social network app in Elm, for family and friends to share. This immediately begs the question: what should the backend be? 

Setting aside for the moment the class of problems associated with running infrastructure with real users and data on the internet (e.g. security, availability, load, etc), there is the problem of type consistency and safety.

Maintaining safe, static data typing across data boundaries (say, client and server, or server and persistence layer) seems inherently tricky. Ideally, one wants to define data types (or data structures) somewhere,  exactly once, and have some abstractions and / or tools that propagate the type guarantees throughout the stack. That would safe effort, minimize boilerplate, and prevent typ(e)o-related errors!

The Elm ecosystem actually has Lamdera, whose the promise is that the server and persistence layers are abstracted away, leaving the developer to reason only about Elm data structures and what data maintain logically in the front vs back end. That's pretty much exactly what I want in this instance! Sadly, it remains in alpha, and doesn't appear to be open source for the moment.

Browsing through the State of the Haskell ecosystem, there are a variety of options for running servers and handling persistence in Haskell (the most obvious fit with Elm). Some of them look like they need project status updates (like Spock... people want to know the status, as it's been dormant for ~three years!). But of course I'm not the first to encounter this problem... so there are a few promising libraries. 


Haskell Server + Elm code gen + no persistence

The first is servant-elm, which wraps the elm-bridge library for use with the Servant webserver. 

Following Servant's tutorial and the servant-elm README on Github, we can:  

  1. Define ADTs of our choice in Haskell, with corresponding typed REST APIs for Servant
  2. auto-generate Elm code for corresponding data, including JSON serializers / deserializers that match the REST API (!)
This file defines the Haskell ADTs and API endpoints, with the Template Haskell line that servant-elm needs: deriveBoth defaultOptions ''ServerState

This file then defines the server logic, with an IORef to maintain some simple in-memory state.

Finally, this file actually generates the Elm module (well, the binary that generates the Elm module).

The resulting Elm code works as expected, albeit with some long function names!

To actually use the Elm client with localhost, I needed to start a temporary version of Chrome with some web security options disabled to account for CORS (borrowed from here).


Next post... adding persistence.





Monday, November 2, 2020

Parser Combinator for SQL Subset

Some more records from RC... a fellow Recurser and I presented on our toy SQL parser combinator at the Friday 5-minute talks (talk slides).

We actually ended up writing parser combinators from scratch twice during RC (first for CIS194 and again for fp-course). The two had almost identical primitives like satisfy :: (Char -> Bool) -> Parser Char, zeroOrMore, oneOrMore, and building up from there... which makes me wonder if there's just a standard recipe for parser combinators? 

For the toy SQL parser, we ended up referencing how Stephen Diehl uses Parsec in Kaleidescope once we defined the syntax. It turns our that Parsec provides a bunch of primitives we previously wrote ourselves, like brackets, parens, commaSep... But I think it would have taken us much longer to figure out how to use the package, were it not for Diehl's examples.

https://github.com/tkuriyama/toydb/blob/master/src/Database/Parser.hs


*Database.Parser> process "Insert Into myTable [1, True, \"Jonathan\"];"

Insert "myTable" [FvInt 1,FvBool True,FvText "Jonathan"]


*Database.Parser> process "Insert Into myTable [1, True, \"Jonathan\"]"

(line 1, column 42):

unexpected end of input

expecting ";"


*Database.Parser> process "Insert myTable [1, True];"

(line 1, column 8):

unexpected "m"

expecting "Into"


12 Weeks of RC

 Summary of my 12 weeks at Recurse Center (https://www.recurse.com/).


It was, unequivocally, a fun and intellectually rewarding experience that I'd recommend to folks interested in programming -- at pretty much any stage of life or career. It's free to attend, for the moment virtual, and available in 1, 6, and 12 week formats).


  • Organized a daily study group for functional programming (mainly Haskell)
  • Attended weekly study groups for systems programming and type theory
  • Learned about Elm and wrote some visualizations (sudokurisk)
  • Learned about networking primitives and wrote some buggy toy servers
  • Contributed to a friend's Haskell game (networked and concurrent!), soon to be internet-playable
  • Spent some time with property-based testing, test coverage, and TCR == (Test && Commit) || Revert
  • Procrastinated on writing computer-generated music (small example sonifying the Collatz sequence)
  • Started a group project to write a toy database in Haskell (with a toy SQL parser combinator and B-Trees)
  • Attended 6-12 technical and non-technical a week (most of them 5 mins); gave five 5-min talks
  • Attended a great talk on Turing's '36 paper on computation; presented on Dijkstra's '74 essay on scientific thought
  • Wrote a few thousand lines of code -- the below table is an inaccurate though representative summary (using tokei)
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                       8         1169          935           31          203
 C Header                1          229           64          122           43
 Elm                    12         1322         1061           43          218
 Haskell                49         5231         2232         2327          672
 Org                     3          236          155            0           81
 Python                  9          743          585           23          135
 Shell                   6           16           13            1            2
 Zig                     3           56           41            0           15
 Zsh                     1           26           20            1            5
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             4          191            0          134           57
 |- Python               4          722          623           21           78
 (Total)                            913          623          155          135
===============================================================================
 Total                  96         9941         5729         2703         1509
===============================================================================

Tuesday, October 20, 2020

Network Programming

We decided to a bit of socket-level network programming as part of the systems programming study group at Recurse Center. The canonical text would appear to be the Unix Network Programming book, but Beej's guide also came highly recommended -- with the added benefit of being quicker to get started and freely available online: 
It turns out that following the correct order of syscalls per Beej's incantations is relatively straightforward... yielding very simple toy servers in short order:
On the other hand, some fundamentals of C programming still elude me, e.g. I couldn't work out how to refactor the creation of sockets into a standalone function due to some (presumably basic) pointer issues...

Monday, October 19, 2020

fp-course

Following the Learn Haskell recommendation, the FP study group at Recurse Center  worked through the fp-course exercises after finishing CIS 194. 

https://github.com/bitemyapp/learnhaskell

The exercises seem to provide good practice, with modules like State and StateT requiring some mental gymnastics training..! 

The repo unfortunately has some incomplete tests files, so I transcribed a few from the comments into HSpec format for Parser.hs and MoreParser.hs: https://github.com/tkuriyama/puzzles/tree/master/fp-course

The test cases were tedious but turned out to be worthwhile, since JsonParser.hs is comprised of more abstract parser combinators. (Interestingly, the parser combinator in CIS 194 is almost identical to fp-course in its building blocks... a coincidence, or a common design pattern?).

Anagrams and Cheque remain to be completed, but we're moving on to writing a toy database in Haskell (with the goal to complete in the last two weeks of the Fall 1 batch!).


Sunday, October 11, 2020

Game of Risk Elm Visualization

Building on the previous post... an interactive(ish) Elm visualization of the game's probability trees: 

https://tarokuriyama.com/projects/risk.php


Thursday, September 17, 2020

Game of Risk

The last assignment in the CIS194 Haskell class is to simulate dice-throwing battles in the game of Risk. The combat rules, taken from assignment specs, are:

  • There is an attacking army (containing some number of units) and a defending army (containing some number of units). 
  • The attacking player may attack with up to three units at a time. However, they must always leave at least one unit behind. That is, if they only have three total units in their army they may only attack with two, and so on. 
  • The defending player may defend with up to two units (or only one if that is all they have). 
  • To determine the outcome of a single battle, the attacking and defending players each roll one six-sided die for every unit they have attacking or defending. So the attacking player rolls one, two, or three dice, and the defending player rolls one or two dice. 
  • The attacking player sorts their dice rolls in descending order. The defending player does the same.
  • The dice are then matched up in pairs, starting with the highest roll of each player, then the second-highest.
  • For each pair, if the attacking player’s roll is higher, then one of the defending player’s units die. If there is a tie, or the defending player’s roll is higher, then one of the attacking player’s units die.
We assume it is optimal to always attack or defend with as many units as possible.


Given an initial number of attacking and defending units, what is the expected outcome (say, the probability that the attacker wins)? 

It's not obvious that there is a closed-form expression, but we can find programmatically the exact probabilities. 

First, from the game specs, we know that there are only a few primitive battle patterns (A, D), where A is attacked and D is defender: (1, 1), (2, 1), (3, 1), (1, 2), (2, 2), (3, 2). For each pattern, we can generate all possible permutations of die values. Each permutation represent a battle, with an outcome ("loss scenario") in which A and D both lose some number of units (which can be zero units, if they win all dice rolls).  

For example, for (A,D) of (1,1), both the attacker and defender role one die each, yielding 36 pairs of possible die rolls: (1,1), (1,2), (1,3) ... (6,5), (6,6). Tallying up the proportion of outcomes: A loses 1 unit 21 / 36; D loses 1 unit 15 / 36.

(Incidentally, the closed-form expression for 1 vs 1 is relatively simple: for each N that the defender rolls, with 1/6 probability each, the attacker loses in N/6 scenarios. For example, if the defender rolls 4, the attacker loses for all die values <= 4. So we can evaluate the expression (1/6) * (sum [1..6]) / 6, which equals 21/36. See these notes for a more detailed explanation.)

Once we have the loss scenarios with proportions for each primitive battle pattern, finding the exact win probability for the attacker is recursive: for any given starting pattern (A, D), lookup the relevant primitive battle pattern. For each loss scenario in which the attacker does not lose the game, update the battlefield with the losses and recursively lookup the next loss scenario, chaining the proportions (probabilities) until the game ends for each branch. 

The win probabilities for the attacker for patterns up to (10, 10) are:



Using Rational numbers in Haskell allows the answers to be expressed exactly. For example for (4,2), the win probability for A is 6610505 % 10077696 (see the rightmost column here). 



Friday, September 11, 2020

Trying TCR

A fellow at the Recurse Center hosted a workshop on TCR (see Kent Beck's post for an introduction to the idea), using these templates

The short of it is: a watcher will automatically run tests upon file save; if any tests fail, it will revert to the prior commit; else, it will make a new commit. 

Despite some initial mental (and practical workflow) resistance, I've warmed very rapidly to the idea. 

By far the biggest change is the revert part of test && commit || revert. Some initial complaints:

  • I made a typo, and it revert to the last commit
  • I just lost a whole block of code
  • I can't just try this simple thing because it keeps failing a test and getting deleted 
The first one maybe retains some legitimacy, but in general, the idea is: 
  • slow down
  • write fragments that you understand
  • write smaller fragments if necessary 
The process also highlights the fact that tests may evolve with the code; as an interface changes, so too do the tests need to change (or TCR won't let you make the commits!).


A few implementation notes for macOS users, building on the Python example from these templates

This gist contains the edits discussed below.

watch.sh
  • Instead of inotifywait, on macOS fswatch can be installed from Homebrew (brew install fswatch)
  • fswatch should be called with the --one-event flag, since it's already called in an infinite loop (otherwise it blocks the next line)
commit.sh
  • By default, TCR will auto-commit with the same message; as one possible alternative, call AppleScript and ask for a commit message via a prompt
    

test.sh
  • The default example has some assertions in the main Python file; modify this to call a testing framework instead (in my case, py.test)


There's a good video here of doing TCR-style testing in Elm. An interesting aspect they mention is squashing a bunch of TCR-generated commits ("what changed") at a later point into more of an explanatory commit ("why the changes").


P.S. for best results with editors, autosave features should be disabled, and autoreload features should be enabled, e.g. in emacs: 

(global-auto-revert-mode t)
(setq auto-save-default nil)



Friday, August 28, 2020

Sudoku Visualization

The Sudoku solver visualization in Elm is done! 

https://tarokuriyama.com/projects/sudoku.php

Writing Elm feels similar to writing Haskell, with a bit of syntax that's more like F# (e.g. the pipeline operators <| and |>). Between the familiar format of Learn You and Elm and the examples in Beginning Elm, it's relatively easy to pick up given a bit of functional programming background.

All of the code for the solver and visualization is on GitHub.








 

Wednesday, August 26, 2020

Elm JSON Decoder

In the vein of functional programming, I've been learning Elm to visualize the progress of a Sudoku solver.

The JSON decoding tutorials are clear, and the JSON.Decode library is well-documented once you know how to read it (in general, the library documentation in Elm seems to be on the terse side). 

Two things were not very clear from initial read of the docs: how to parse lists to tuples, and how to parse strings to custom types. 

Leaning on SO for the below example,  tuples apparently used to be supported as first-class decoders, but in the current version of Json.Decoder, index seems to be the official way to construct decoders for tuples. 

arrayAsTuple2 a b c =

    Decode.index 0 a

        |> Decode.andThen (\ aVal -> Decode.index 1 b

            |> Decode.andThen (\ bVal -> Decode.succeed (aVal, bVal)))            

boardDecoder : Decoder Board

boardDecoder = Decode.list <| arrayAsTuple2 Decode.int Decode.int 

Elm 0.19 only supports up to three-value tuples (somewhat surprising coming from Haskell, though with records and custom data types, the decision to limit unnamed tuples has a certain logic...). So the helper construction isn't too bad.


Parsing strings to custom data types turns out to be more straightforward, though a bit repetitive (especially if you need to later get the string representation back out of custom data type -- there is no deriving Show in Elm). First decode the string, then chain a pattern match using andThen:

data Transform = Rows | Cols | Boxes

transformDecoder : Decoder Transform
transformDecoder =
    Decode.string
    |> Decode.andThen
       (\ s ->
            case s of
                "Rows" -> Decode.succeed Rows
                "Cols" -> Decode.succeed Cols
                "Boxes" -> Decode.succeed Boxes
                _ -> Decode.fail <| "Unknown transformation.")

Saturday, August 22, 2020

Recurse Center

 I've wanted to attend the Recurse Center (formerly known as Hacker School) or a while, and have finally found the time to do so. It's very consistent with the free education path (it's free!), and so far has been a great source of exposure to new topics and ideas, as well as of inspiration to work on different classes of problems. They've put a lot of effort into fostering a healthy, supportive community, and the nature of the program is such that self-motivated, inquisitive learners more or less self-select themeselves into attendance. Some topics I was exposed to or participated in so far -- audio programming, gaming, lower-level / systems programming languages (eg Zig), functional programming (eg Haskell, Elm), procedural generetion, Web Assembly compilation... There are folks of all walks of life, skills, and interests in the community, making it something that I expect will continue to yield rich experiences for the remainder of the three-month batch. 

Wednesday, July 29, 2020

Stochastic Writing

Bang Bang Con 2020 features a number of pithy, unexpected talks.

Following Andrew Yoon's talk about his work on BML (a "stochastic markup language"), here's a short poem that changes under the reader's eyes... 





Sunday, June 21, 2020

What goes on in a computer in a millisecond?

In a 2001 Q&A session (around 1h30m) Knuth is asked about unsolved challenges in the field of compute science, à la Hilbert's problems. Knuth's responds by referring to a prompt he posed to the World Computer Congress in 1989, in a keynote titled "Theory and Practice, IV" (Arxiv has a version, though this version appears to have the slides as well):

"Make a thorough analysis of everything your computer does during one second of computation"

Does the prompt require adjustment for context, some thirty years later? Knuth noted the practical challenges associated with monitoring and interpreting the computer's activities, whose difficulty would appear to remain on the same order of magnitude today. On the other hand, Knuth associated 1 second with approximately 250,000 instructions. Given that CPU clock speeds have jumped from standard units in MHz to GHz (wikipedia notes clock speeds in 1989 were in the 16-35 MHz range, in line with Knuth's approximation), maybe today's equivalent of the question is:

"What goes on in a computer in a millisecond"?

In the same keynote, Knuth provides a set of guiding questions, including the correctness of the observed activities and their correspondence to theory -- that is, whether they reflect existing theory, and whether new theory would yield practical improvements.

Despite the daunting scope of the problem, the fundamental idea of mapping theory to some atomic observations of computing practice seems widely applicable and useful for students and practitioners alike.

Sunday, June 7, 2020

Compilers, Project Euler 100

It was sad to see Stanford retire Lagunita (apparently due to a decision to retire their Open edX implementation). Many of their classes appear to be slated for addition to edX, including the Compilers class -- at least according to an email sent to Lagunita participants. I seem to recall the content originally being available on Coursera. It certainly has the look, feel, and substances of the original batch of very high quality courses (like Tim Roughgarden's Algorithms courses, Andrew Ng's AI course, Dan Boneh's Cryptography Part I... we're still waiting on Part II after many years, apparently because he is trying to finish his book). There seems to be a great deal more of content on the MOOCs now, but also maybe some more noise.

In any case, I spent most of my study time in March speeding through the Compilers videos and finishing the first programming assignment on lexing (yacc / flex) to make sure all the mechanics work. Fortunately, there are offline graders that can be run in the class VM, so it will be possible to complete the other programming assignments at a more leisurely pace. Given that there isn't currently an active forum for unaffiliated students like me to ask questions, GitHub has been in an invaluable resource to reference other students' implementations and try to triangulate certain issues.

Speaking of using reference materials... seven years after starting Project Euler and learning about the Sieve of Eratosthenes, I've finally completed more than 100 problems. About 65 problems have been solved in Haskell over the past year (the balance at some point during prior years, in Python).


I had no qualms referencing the internet for many problems, since it's unlikely I would have independently stumbled upon many of the math theorems required to solve some problems efficiently (like the Sierpinski triangle and Lucas' Theorum behind one of the problems on Pascal's triangle). In that regard, I'm glad of folks who have ignored Proect Euler's rules of the road and posted their research online. At least for autodidactic purposes, it seems like more information is (almost?) always better. How to use and internalize the information, how to differentiate research from shortcuts, how to set boundaries of information taken vs internally synthesized -- those are all the responsibility of the student, with decisions ultimately depending on one's goals. My (recent) goals have been in rough descending order to learn functional programming paradigms, to learn Haskell, to enjoy problem solving, and last + least to learn some underlying math. As far as the first two go, continuing with more problems probably offers diminishing returns (especially compared to pursuing orthogonal problem spaces), but I'll likely continue with it on the slow path for the enjoyment. 

Sunday, February 16, 2020

Mini Regex


Hacker News linked to Brian Kenighan's post about a small regex matcher, which seemed like a good bite-sized problem to try implementing -- both in general and for practicing Haskell, particularly with this being a pattern matching problem.

Following Kernighan's specifications, the regex language is limited to:

    c    matches any literal character c
    .    matches any single character
    ^    matches the beginning of the input string
    $    matches the end of the input string
    *    matches zero or more occurrences of the previous character


To make the exercise a bit more stateful, I chose to implement find (return the left-most occurrence of a pattern that matches the regex, if it exists), instead of just indicating the presence of a match. There is minimum lookahead required -- just one character to detect boundaries for the Kleene star.

I first tried to implement find with a lazy map over the tails of the input (returning the head as the left-most match), but ran into the problem of inputs with no matches. Since I couldn't find an out-of-box function, mapUntil implements a map that stops when a non-default value is found, and otherwise returns the default value. This also has the benefit of using an empty string to indicate no match, which is more direct here than, say, using Maybe. Always wrapping and unwrapping the input with the start and end sentinels seemed like a reasonable price to avoid introducing special cases later on.


find :: Regex -> String -> Match
find r s = unwrap $ mapUntil "" (findWord r) (tails $ wrap s)
  where wrap xs = ('^':xs) ++ "$"
        unwrap xs = [c | c <- xs, c `notElem` ['^', '$']]

mapUntil :: Match -> (String -> Match -> Match) -> [String] -> Match
mapUntil d _ [] = d
mapUntil d f (x:xs) = if m /= d then m else mapUntil d f xs
  where m = f x ""


The core matching function covers four patterns: non-star regex, star regex, the regex is consumed (success), or otherwise failure.


findWord :: Regex -> String -> Match -> Match
findWord (x:'*':xs) s@(y:ys) ms = findWord xs ys' (reverse ms'++ ms)
  where (ms', ys') = if not (match x y) then ("", s) else splitWhile (matchStar x xs) s
findWord (x:xs) (y:ys) ms = if match x y then findWord xs ys (y:ms) else ""
findWord "" _ ms = reverse ms
findWord _ _ ms = ""

Match is very simple. MatchStar needs a bit of extra state, since '.*' matching should stop if the next part of the regex matches input.

match :: Char -> Char -> Bool
match x y = x == '.' || x == y

matchStar :: Char -> String -> Char -> Bool
matchStar x [] y = match x y
matchStar x (x':xs) y = (x == '.' && x' /= y) || x == y


SplitWhile seems like it should exist... it calls both takeWhile and dropWhile, so could be rewritten more efficiently to require only single pass over the input.


splitWhile :: (a -> Bool) -> [a] -> ([a], [a])
splitWhile f xs = (takeWhile f xs, dropWhile f xs)


There are a few reverses and concatenations, which weren't immediately obvious to me to refactor into more efficient patterns. Code with some hand-written test cases: https://gist.github.com/tkuriyama/f60210a0e6dfa5bc0872494afaa0030b