Feeds:
Posts
Comments

Archive for June, 2015

Regular Expression for Time in R with Fractional Seconds

Dealing with data can be full of wonderful surprises, like suddenly needing to convert from one date/time format to another and making sure user-input strings are in the correct format.  One aspect of time parsing that doesn’t come up too often outside of high frequency trading is dealing with fractional seconds.  R has lots of excellent libraries for dealing with date formats, like lubridate. Lubridate has a parsing function that can recognize fractional seconds, but IMHO it’s not a top tier use case since you have to set a global option(and set it back)

## ** fractional seconds **
op <- options(digits.secs=3)
dmy_hms("20/2/06 11:16:16.683")
## "2006-02-20 11:16:16.683 UTC"
options(op)

(example from man page reprinted at inside-R)

But a humble regular expression can also do the job of recognizing when a string is in correct format.  So in the following I’ll go through an example of how to quickly build up a functional, working regular expression in R to recognize correct time strings in 24-hour format with optional seconds and optional fractional seconds up to microsecond precision.  The key to using regular expressions is to start out simply and test often.

Step 1: Starting out simply

Staring out with hours and minutes only, we basically want two digits after the start (^) of the string, followed by a colon, and followed by two digits in range [0-9] at the end of the string ($).  Using the {N} occurrence operator to specify exactly how many occurrences of the preceding element we want, we can quickly come up with the following and test using grep:

> grep("^[0-9]{2}:[0-9]{2}$", c(good="22:00", bad="111:22"))
[1] 1

So the regexp matches the first, model string and rejects the second malformed string.  But there’s a problem: people write times all the time without two mandatory digits in the hour.  To fix that, we go from the above to something slightly more complicated, using the zero or one occurrence (?) operator:

grep("^[0-9]?[0-9]:[0-9]{2}$", c(good=c("22:00", "2:00"), bad=c("111:22", ":33")))
[1] 1 2

This worked as expected, matching the first two “good” elements and rejecting the bad ones.  (Note we added to our good and bad sets as we went along.  This is important to keep it real, regexps can get very complicated very quickly.)

Step 2: Add complexity as needed

Now let’s add in the optional seconds.  To do this, we will add a group () at the end to represent the seconds and make the whole group optional with the zero or one occurrence (?) operator:

> grep("^[0-9]?[0-9]:[0-9]{2}(:[0-9]{2})?$", 
              c(good=c("22:00", "2:00", "3:34:33"), 
                 bad=c("111:22", ":33", "3:44:4")))
[1] 1 2 3

What about fractional seconds?  Easy, add an optional group within the just-added optional group.  But to go to microseconds, we want to restrict fractional digits to between 1 and 6 occurrences, which you can do with two arguments to the occurrence operator.  (Since “.” is an operator in regular expressions, don’t forget to escape it with a double backslash to let it know you just want the decimal point and not the operator!)

> grep("^[0-9]?[0-9]:[0-9]{2}(:[0-9]{2}(\\.[0-9]{1,6})?)?$", 
             c(good=c("22:00", "2:00", "3:34:33", "12:02:22.2345"), 
               bad=c("111:22", ":33", "3:44:4", "3:12:00.")))
[1] 1 2 3 4

Step 3: Polish it up

The main use cases are taken care of, but the astute reader will notice that although the regular expression counts the digits correctly, it nonetheless allows for nonsense times like “4:71” or “33:23:11.223”.  Let’s modify the above to allow for minutes and seconds from 00 – 59 only:

> grep("^[0-9]?[0-9]:[0-5][0-9](:[0-5][0-9](\\.[0-9]{1,6})?)?$", 
      c(good=c("22:00", "2:00", "3:34:33", "12:02:22.2345"), 
         bad=c("111:22", ":33", "3:44:4", "3:12:00.", "3:66", "3:23:99")))
[1] 1 2 3 4

Here we’ve simply removed one of the occurrences of [0-9] in minutes and seconds and replaced the leading one with a [0-5].  The leading hour is a little more tricky, in that when the leading digit is a “2” then the second range must be [0-3] instead of the full [0-9].  We will break up the hours into sub expressions using the “|” logical OR operator:

> grep("^([0-1]?[0-9]|2[0-3]):[0-5][0-9](:[0-5][0-9](\\.[0-9]{1,6})?)?$", 
     c(good=c("22:00", "2:00", "3:34:33", "12:02:22.2345", "14:55"), 
         bad=c("111:22", ":33", "3:44:4", "3:12:00.", "3:66", "3:23:99", "25:22")))
[1] 1 2 3 4 5

Let’s look a little more closely at the logical OR.  In the first branch, we have an optional [0-1] followed by a mandatory [0-9], because the following are all good 24-hour forms: “15:00”, “04:00”, “6:00”.  The second is a mandatory 2 followed by a mandatory [0-3].  Since the second branch is OR’d with the first, it just adds new strings to the acceptable forms: “23:30” for example.

Conclusion

This blog post showed how to build up a regular expression in R for the purposes of recognizing valid time strings with fractional seconds.  A process was shown for growing the complex regular expression from simple beginnings using standard regular expression syntax and testing at every step.  The testing at every step is crucial: you can easily dig yourself into a hole with complex regular expressions, so you should test often and detect problems as they occur.

Happy regexp-ing!

Read Full Post »