Migrating JSON with lens and foci

By Zhouyu Qian | Email | Twitter | LinkedIn

Published February 9, 2018

Updated February 12, 2018

Prelude

Like many other companies, Capital Match uses JSON as its preferred format for serializing data. But what is perhaps slightly surprising is that at Capital Match, JSON is not only used for serializing data in transit (that is, over an HTTPS connection to the client), but also for serializing data intended for permanent storage.

This means that as we change the Haskell data types, old data in the form of JSON need to be migrated. There are quite a few approaches here. The easiest and most straightforward approach is probably just to store a version number together with the data, and then have the deserialization routine read the version number and decide what to do.

Here is a simple example where the migration happens in the FromJSON instance. Suppose we have this data type:

data Widget = Widget
  { widgetId :: Int
  , widgetSize :: Int
  , widgetName :: String
  } deriving Show

Now, we would like to make this data type serializable using this approach. We would write the serialization code like this:

instance ToJSON Widget where
  toJSON Widget {..} =
    object
      [ "widgetId" .= widgetId
      , "widgetSize" .= widgetSize
      , "widgetName" .= widgetName
      , "version" .= Number 1
      ]

instance FromJSON Widget where
  parseJSON =
    withObject "Widget type" $ \o -> do
      version <- o .: "version"
      when
        (version > (1 :: Int))
        (fail "Unexpected object version when parsing Widget")
      Widget <$> o .: "widgetId" <*> o .: "widgetSize" <*> o .: "widgetName"

Suppose now that in the future we decide to add a new field, say, widgetDescription of type String. We also need to decide on a policy for handling descriptions for old Widgets. Suppose that business rules require us to provide a generic description saying “This is widget x” where x is the widget ID (this is certainly a contrived example: in many cases using Maybe could be a better and more natural choice to represent missing data and also get automatic support in TH-generated FromJSON instances). This is no big deal. We bump the version to 2, add a line in the ToJSON instance, and write the deserialization code as follows:

instance FromJSON Widget where
  parseJSON =
    withObject "Widget type" $ \o -> do
      version <- o .: "version"
      widgetId <- o .: "widgetId"
      widgetSize <- o .: "widgetSize"
      widgetName <- o .: "widgetName"
      case version of
        (1 :: Int) ->
          let widgetDescription = "This is widget " ++ show widgetId
          in pure Widget {..}
        2 -> do
          widgetDescription <- o .: "widgetDescription"
          pure Widget {..}
        _ -> fail "Unexpected object version when parsing Widget"

This approach seems promising, and the code we write seems clear, but there are actually a few issues in the long run:

  • First, notice that for each version, we have essentially separate code paths to construct the Widget object in its final and latest form. This means that every time we change the definition of the Widget type, we need to modify the deserialization code for every existing version. The worst part is that instead of thinking about how things should change from one version to the next, we need to think about how versions should change from all previous versions to the latest version.

  • Second, in practice, after just a handful of changes of the type, the migration from the earliest type to the final Widget would look totally incomprehensible. And the code does not really express the intent of each version either. The change from version 1 to version 2 added a field of a certain type, but when there are more versions, it quickly becomes a tiring game of manually diff’ing the separate code paths to figure out their differences.

  • Third, notice that there is only ability to deserialize a JSON with an old version into the eventual type: there is no ability at all to inspect the intermediate versions. It seems like there is no real need to get back to an intermediate version, but practice suggests that it is the most helpful when writing new migration code. If there are currently ten versions of Widget and you are writing the eleventh, you will need to look at all ten previous versions to figure out the best way to migrate each.

  • Finally, of course, the issue is that we still need to write the ToJSON and FromJSON instances manually. For data types with many fields, it is more desirable to have the compiler write the code for us. Accidentally writing widgetId <- o .: "widgetSize" is quite easy.

Capital Match used to have some issues with migrations. As I mentioned in a previous post, the approach we used to have is worse: it is in spirit similar to the above, except that we do the transformation not inside FromJSON instance, but outside, in a separate function Value -> Widget with partial functions liberally sprinkled throughout. Our experience tells us that the above approach is simply unsustainable in the long run, even after we fixed our use of partial functions.

After some incremental improvements, as well as some other refactoring that better separate out the migration aspect of deserialization, we finally finished converting all of our migration code to a much better style in October 2017. The rest of this article explains our design.

How Capital Match does migrations now

The crux of the new approach is to define each migration using types, instead of merely being embedded inside parsing code. We have MigrationPart and VersionRangeMigration types for this.

-- | A 'MigrationPart' specifies how and where to modify a 'Value'.
data MigrationPart = MigrationPart
  { targetDesc :: String
    -- ^ A description of the migration part
  , targetOp :: Value -> Parser Value
    -- ^ An alteration of the 'Value' at the positions given by 'targetFoci'.
  , targetFoci :: Foci' Value
    -- ^ All the positions in the 'Value' at which to apply 'targetOp'
  }

-- | A 'VersionRangeMigration' specifies the range of versions for which a
-- migration applies.
data VersionRangeMigration =
  VersionRangeMigration
  { smFrom :: Int -- ^ the version inclusive from which this migration applies
  , smTo :: Int -- ^ the version inclusive up to which this migration applies
  , smPart :: MigrationPart
  }

A MigrationPart is simply a unit of migration; it has a description (so that error reporting would be much clearer and understandable), an operation to perform modeled as Value -> Parser Value, and as well as foci, or places to perform the migration. A VersionRangeMigration is then a MigrationPart tagged with version ranges for which it is applicable.

The Parser type above is exported from Data.Aeson.Types. Perhaps confusingly, the aeson library deals with two distinct kinds of parsers: the first is the one that converts a ByteString into a Value, which are primarily mentioned in Data.Aeson.Parser and defined in the attoparsec library by the same author; the second is the one that deals with converting an already-parsed Value into a user-defined type. This second kind is the one relevant in this article.

An algebra of foci

The Foci' type mentioned above, on the other hand, is defined by us, together with a constellation of other types.

data Focus a b = Focus
  { focusMatchers :: [Matcher a]
    -- ^ the 'Focus' matches if all the 'Matcher's match
  , focusTraverse :: Traversal' a b
    -- ^ the position at which to modify
  }

data Matcher a = forall m. Matcher (Traversal' a m)

type Focus' a = Focus a a

We first have the idea of a Focus. It is essentially a plain old Traversal' together with a list of preconditions represented by Matcher. The Traversal' inside only works when that everything in that list matches; otherwise it is simply ignored. The purpose of having a list of matchers is to deal with sum types effectively: if a sum type has three alternatives, and only one of them are affected by a migration, we simply check a tag field in the Value to determine whether or not this Value needs to be migrated. Sometimes there can be more complicated checks, so a list is needed, but usually we don’t. So we have a convenient constructor function for this case:

focus :: Traversal' Value m -> Traversal' Value Value -> Focus' Value
focus m = Focus [Matcher m]

We have the ability to determine whether or not a focus matches:

matchFocus :: Focus a b -> a -> Bool
matchFocus Focus{..} val = all (`runMatcher` val) focusMatchers
  where
    runMatcher :: Matcher a -> a -> Bool
    runMatcher (Matcher m) = has m

And we also have the ability to “run” a focus:

traverseFocus :: Applicative f => (b -> f b) -> Focus a b -> a -> f a
traverseFocus f focus a = traverseOf possiblyIgnored f a
  where possiblyIgnored =
          if matchFocus focus a
          then focusTraverse focus
          else ignored

We can also compose them, given two with matching types. Indeed they form a category:

-- import qualified Control.Category as Cat
instance Cat.Category Focus where
  id = Focus [] id
  (.) = zoomFocus

zoomFocus :: Focus b c -> Focus a b -> Focus a c
zoomFocus g f =
  Focus
  { focusMatchers = focusMatchers f ++ map (\(Matcher m) -> Matcher $ focusTraverse f . m) (focusMatchers g)
  , focusTraverse = focusTraverse f . focusTraverse g
  }

This means that we compose two Focus by zooming further with the first (small) Focus at the zoomed in position of the second (big) Focus. The semantics of the resulting Focus is such that:

  • The new focus matches if the big focus matches, and the small focus matches when zoomed in at the focusTraverse of the big focus;
  • The new focusTraverse is simply “zooming in”: the big focusTraverse composed with the small focusTraverse.

We also have Semigroup and Monoid instances for Focus' a.

The final piece of the puzzle is the Foci type. It is nothing more than just a list of Focus.

newtype Foci a b = Foci { unFoci :: [ Focus a b ] }

type Foci' a = Foci a a

And here’s the function to “run” foci:

traverseFoci :: Monad f => Foci a b -> (b -> f b) -> a -> f a
traverseFoci (Foci foci) f = foldMapKleisli (traverseFocus f) foci

foldMapKleisli :: (Foldable t, Monad m)  => (b -> a -> m a) -> t b -> a -> m a
foldMapKleisli = foldMapBy (>=>) return

Composing Foci is just the lifted version of composing their individual Focus:

instance Cat.Category Foci where
  id = Foci [ Cat.id ]
  (.) = zoomFoci

zoomFoci :: Foci b c -> Foci a b -> Foci a c
zoomFoci (Foci gs) (Foci fs) = Foci (liftA2 zoomFocus gs fs)

This may seem obtuse, but the primary motivation is nested data types: if a data type A contains a data type B needing migration, the easiest way is to compose the outer foci, containing locations of where B exists within A, together with the inner foci containing locations of where migrations need to happen within B.

All of the above code is available in a single file here. While at Capital Match we haven’t found any other use beyond our own migration system, I imagine it might potentially be useful for other kinds of term rewriting or syntax tree transformation; after all migrating JSON is just a kind of syntax tree transformation. If that is the case, we would definitely make it a proper library and keep maintaining it.

Finally, we just need a function to tie everything together. Here’s a small function that does that just:

runApplicableMigrations :: FromJSON a => [VersionRangeMigration] -> Int -> Value -> Either String a
runApplicableMigrations allMigrations versionToParse =
  parseEither (parseJSON <=< foldMapKleisli targetToParser applicableMigrationParts)
  where
    applicableMigrationParts :: [MigrationPart]
    applicableMigrationParts =
      smPart <$> filter (\VersionRangeMigration {..} -> smFrom <= versionToParse && versionToParse <= smTo) allMigrations
    targetToParser :: MigrationPart -> Value -> Parser Value
    targetToParser MigrationPart {..} = traverseFoci targetFoci targetOp

The function looks long, but conceptually it isn’t that complicated at all: it takes a list of VersionRangeMigrations, a version (which means that the version need not be stored within the Value at all), a raw Value and then does the migration.

The function actually used by Capital Match is about twice as long, but only because the extra code is there to provide better error messages when things go wrong. Specifically:

  • When a migration fails, the error message contains the original Value, its version, and all migrations that have been attempted;

  • MigrationParts themselves can fail, in which case the error should be reported;

  • It is also possible that MigrationParts succeed, but the resultant migrated Value still could not be parsed, in which case the post-migration Value should be mentioned as well.

Our version simple adds possible error messages (using for example modifyFailure) at each of the above steps.

Example

A simple example of a migration adding a field might then be expressed as:

migrationAddFooToBar :: MigrationPart
migrationAddFooToBar = MigrationPart "add field foo to type Bar, default Null"
  (spliceKey "foo" Null) $ Foci [ focus _Object id ]

spliceKey :: T.Text -> Value -> Value -> Parser Value
spliceKey k v = withObject "spliceKey expecting Object" $ pure . Object . HM.insert k v

Here, spliceKey is a simple helper function to add a key/value pair to an existing Object. The expression focus _Object id refers to the Focus that expects the current Value to be an object using the prism _Object and that the alteration should be done with the current Value (id).

Updates inside deeply nested object structures are similarly effortless, thanks to lenses. Here’s an example if Bar is a sum type and the migration only applies to the Bar1 alternative.

migrationAddFooToBar :: MigrationPart
migrationAddFooToBar = MigrationPart "add field foo to type Bar, default Null"
  (spliceKey "foo" Null) $ Foci [ focus (tag "Bar1") (key "contents" . nth 0) ]

tag :: (Applicative f, AsValue t) => T.Text -> (() -> f ()) -> t -> f t
tag tagStr = key "tag" . _String . only tagStr

Here, within the matcher of the focus, we check for the tag of an object (simply the field named tag in the object), and within the traversal part, we look at the first element in the array contained in the contents field of the object, which correspond to the first argument to a data constructor.

It should be noted that the second argument to MigrationPart, the targetOp has type Value -> Parser Value. This is mainly to give programmers the choice to either extract data in traditional aeson style (using .: for example) or lens style. Usually the aeson style is more appropriate with automatic error reporting, but occasionally the lens style can be useful when the operation is particularly complicated.

Closing Remarks

After migrating the migration code to this new style, migration has not been a pain point in our code base any more. That’s definitely a good thing, but it sometimes also leads us to wonder whether using JSON as the permanent storage format (and thus needing migrations) is a good idea in the first place. Even while still keeping our event-sourcing architecture, alternatives such as safecopy are worth considering before developing an in-house solution. Of course, a traditional SQL database that handles almost everything for you should remain on your palette of choices as well, especially since the state of SQL bindings/DSLs have become much better in recent years.