Untangling semver with regular expressions

Last Sunday I was bored and had absolutely nothing to do, mostly because I recently hurt my knee in an accident and while it’s healing I can’t move much. So I started to look for some fun random stuff to code, when I remembered that some time ago at work I made a simple function to compare Semantic Versions. It only worked for the more basic cases, it didn’t handle invalid leading 0’s, pre-releases and build metadata. So I thought “hey, gonna write a package that’s 100% compliant with the spec, that should be fun!”. It was!

The Semantic Version (also called semver) specification is laid out on https://semver.org/ and it’s reasonably simple. The versions are defined as strings with the format MAJOR.MINOR.PATCH.PRE-RELEASE.BUILDMETADATA, where PRE-RELEASE and BUILD-METADATA are optional. The rules are as follows:

MAJOR.MINOR.PATCH: “A normal version number MUST take the form X.Y.Z where X, Y, and Z are non-negative integers, and MUST NOT contain leading zeroes”. https://semver.org/#spec-item-2.
PRE-RELEASE: “A pre-release version MAY be denoted by appending a hyphen and a series of dot separated identifiers immediately following the patch version. Identifiers MUST comprise only ASCII alphanumerics and hyphen [0-9A-Za-z-]. Identifiers MUST NOT be empty. Numeric identifiers MUST NOT include leading zeroes”. https://semver.org/#spec-item-9.
BUILD-METADATA: “Build metadata MAY be denoted by appending a plus sign and a series of dot separated identifiers immediately following the patch or pre-release version. Identifiers MUST comprise only ASCII alphanumerics and hyphen [0-9A-Za-z-]. Identifiers MUST NOT be empty”. https://semver.org/#spec-item-10.

Since these rules specify a rigid structure without any “balanced parenthesis” part we can say that semvers are a type of regular language, hence they can be expressed with regular expressions. Tackling each of the three rules separately I came up with the following solution:

MAJOR.MINOR.PATCH: This one is pretty simple, integers without leading zeroes. The regex (\d|[1-9]\d*)\.(\d|[1-9]\d*)\.(\d|[1-9]\d*) describes the pattern “INTEGER dot INTEGER dot INTEGER”, with \d|[1-9]\d* ensuring that we don’t have anything like “01.” (a leading 0).
PRE-RELEASE: That’s the hardest. Start with the -, then express the required character with 0|[1-9A-Za-z-][0-9A-Za-z-]*|[0-9]*[A-Za-z-][0-9A-Za-z-]* which reads “a single 0 OR a valid identifier (excluding 0) followed by any number of valid identifiers OR any number of digits followed by one non-numeric identifier followed by any number of valid identifiers”, and finally use (\.(0|[1-9A-Za-z-][0-9A-Za-z-]*|[0-9]*[A-Za-z-][0-9A-Za-z-]*))* to denote any number of “dot followed by the same pattern used for the first part”. Both parts together results in (-(0|[1-9A-Za-z-][0-9A-Za-z-]*|[0-9]*[A-Za-z-][0-9A-Za-z-]*)(\.(0|[1-9A-Za-z-][0-9A-Za-z-]*|[0-9]*[A-Za-z-][0-9A-Za-z-]*))*)?.
BUILD-METADATA: \+[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)* a plus sign followed by dot separated identifiers (the first one is required).

Wrapping everything together, and adding enclosing parenthesis followed by ? to the pre-release and built-metadata pattern (since they are optional) I ended up with:

^(\d|[1-9]\d*)\.(\d|[1-9]\d*)\.(\d|[1-9]\d*)(-(0|[1-9A-Za-z-][0-9A-Za-z-]*|[0-9]*[A-Za-z-][0-9A-Za-z-]*)(\.(0|[1-9A-Za-z-][0-9A-Za-z-]*|[0-9]*[A-Za-z-][0-9A-Za-z-]*))*)?(\+[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*)?$

Just look at this beauty! That’s an insanely long regex, I don’t think I ever wrote something like that before. It’s interesting to note that I wouldn’t be able to write that had I tried to do it all at once, with long and complex regular expressions you start to loose track of the patterns and parenthesis pretty quickly, writing everything separately and then using the parts to compose the final pattern is the only sane way to do it.

With the whole pattern encoded in a single regex it becomes easy to manipulate semantic versions, so using that I created two quasi-identical libraries for that purpose, one for Node.js and other for PHP ¹:

Apart from the regex itself the libraries are pretty basic, they have only three methods: isValid which validates the semver using the regex, split which splits the semver in its parts (also using the regex), and compare which co mpares two semvers following the precedence rules described in https://semver.org/#spec-item-11. The test cases were all taken from the specification main page, I put them in a JSON structure so that both projects could share it.

¹ Mostly because these are the languages that I’ve been using the most at work recently, and we actually needed semver parsing there.