Feeds:
Posts
Comments

Exploring HBase

I posted a small tutorial to help those just starting to explore HBase at Code Project Apache HBase Example Using Java with associated code available on GitHub.

I’m just starting out my exploration of this technology.  There are >50 NoSQL database offerings, and they each seem to be targeting specific use cases.  I don’t think there is a single one-size-fits-all “NoSQL” technology; they all seem to deliver performance by relaxing the relational model in slightly different ways.  If you need a general, flexible solution and you don’t have petabytes of data, traditional RDBMS may be for you, and scalability is your trade-off.

But if you really need the scalability, then it is imperative to pick the NoSQL solution that most closely mirrors your expected use case.

I started building R 3.2.2 on Fedora 22 today, and I got the dreaded

configure: WARNING: you cannot build info or HTML versions of the R manuals
configure: WARNING: you cannot build PDF versions of the R manuals
configure: WARNING: you cannot build info or HTML versions of vignettes and help pages

And google turned up about 99 solutions telling me to go read the manual.

But hey, I figure I’ll read the manual so that you don’t have to.  Here is the installation procedure I followed.  It probably won’t match the exact magical incantations you’d need on another flavor of Linux.  But at least you can use this as a guide to get some additional ideas on what packages you have to have, and then create your own spells.

Downloading the Source

The source can be downloaded as a .tar.gz file from www.r-project.org.

Preparing the System to Build

First, if you don’t already have it, you’ll need to install a compiler.  In fact, you’ll need three: a C compiler, C++ compiler, and a Fortran 90 compiler.

sudo yum install gcc
sudo yum install gcc-c++
sudo yum install gcc-gfortran

Many R packages also depend on Java, so you may want to download the latest JRE from www.java.com

To install kernel headers and development libraries, yum has a nice group install feature.  You’ll need to install the following groups:

sudo yum groupinstall "Development tools"
sudo yum groupinstall "Development Libraries"
sudo yum groupinstall "X Software Development"

Installing Stuff the R Needs

A close reading of the R build and installation manual above reveals that there are a number of packages required on Linux, and while it would have been nice if they had provided them in a list, here is a list of things I needed to install.  (Some of these may come up already installed on your system.)

sudo yum install zlib
sudo yum install lzma
sudo yum install curl
sudo yum install pcre
sudo yum install bzip2

And then there are the TeX libraries

sudo yum install texinfo
sudo yum install texinfo-tex
sudo yum install tex
sudo yum install texlive-scheme-basic    # A.K.A. LaTex, lol!
sudo yum install texlive-inconsolata        #  A font needed by R 3.2.2 manuals

Configuring and Building

This is usually a two step process.  There should be a script called configure in the R-3.2.2 directory.  Most people will want to build R with the shared libraries, especially if they want to use it with RStudio, so it should be run with that option. If the configure step fails, it usually does so with a message that indicates a missing package.  If there is something missing that is not covered above, you may have to google for the exact package name given the message.  Then you can make, and make pdf, and make install.

./configure --enable-R-shlib
make
make pdf
sudo make install

That’s it!

Now maybe you don’t need to make pdf, but when I did that and tried to make install, it complained that there was a missing “NEWS.pdf”.

Yamslam Odds

Yamslam is a dice game by Blue Orange Games that our family loves to play. The game is exciting, and we have a tradition that if one of us gets a Large Straight or a Yamslam (the highest roll), they slap the 50-point chip on their forehead, raise their arms in the air and shout “Yeaaaahhhh!”

In this blog post, I will calculate the odds of rolling various Yamslam rolls in one roll.

Possible Rolls

There are five dice and three chances to roll during each turn. The player chooses which dice to keep and which to re-roll to improve their lot.

The rolls in Yamslam roughly follow poker hands, listed here in order from highest to lowest: Yamslam (5 of a kind), Large Straight (5 in a row), Four of a Kind, Full House, Flush (all evens or all odds), Small Straight (4 in a row), Three of a Kind, and Two Pair. In addition to the scoring rolls, I will also calculate the odds of getting One Pair and our family’s unoffical “nothing” roll, Bupkiss.

Calculation

Since there are 5 dice and each die can take on one of 6 values, there are 6^5 = 7776 possible rolls. In order to come up with basic probabilities for the above Yamslam rolls, we will basically count the occurrences of each type of roll and divide by 7776.

For each roll, I will count both inclusively and exclusively. Inclusive means that the count will include higher scoring rolls that also match the given roll. For example, a Three of a Kind inclusive count would include Yamslam, Four of a Kind, and Full House as well as Three of a Kind because those rolls include a Three of a Kind.

Yamslam

The basic pattern of this roll is AAAAA. Of all 7776 possible rolls of 5 dice, exactly 6 of these have the Yamslam, or 5 of a kind, pattern. Since Yamslam is the highest possible roll, this count is both inclusive and exclusive.

Yamslam Inclusive Exclusive
Count 6 6
Probability 0.08% 0.08%

Large Straight

The pattern is ABCDE with successive numbers. For five dice, there are only two kinds of Large Straights possible: 1-2-3-4-5 and 2-3-4-5-6. But each die is distinct and can arranged in any position, so the total count is 2 * 5! = 2 * (5 * 4 * 3 * 2 * 1) = 240.

If you don’t see where the 5! comes from, consider that for any sequence of 5 distinct elements, there is a choice of 5 places where the first element can go, times 4 remaining choices of where the second element can go, times 3 remaining choices of where the third element can go, times 2 remaining choices of where the fourth element can go, times 1 remaining spot for the last element.

This count is also both inclusive and exclusive since there is no way a Yamslam can masquerade as a Large Straight.

Large Straight Inclusive Exclusive
Count 240 240
Probability 3.09% 3.09%

Four of a Kind

This pattern is AAAAB. To count up the possible Four of a Kind rolls, consider that there are 6 choices for A, and 5 remaining choices for die B. Also, the remaining die can be arranged in any one of 5 positions. So the exclusive count is 6 * 5 * 5 = 150. Now every Yamslam can also be considered a Four of a Kind, so the inclusive count should include the 6 possible Yamslams for a total of 156.

Four of a Kind Inclusive Exclusive
Count 156 150
Probability 2.01% 1.93%

Full House

The Full House pattern is AAABB where A and B are distinct numbers. This means that neither a Yamslam nor a Four of a Kind can ever be counted as a Full House, and certainly neither can a Large Straight. So the exclusive and inclusive counts will be the same. There are 6 choices for the A and 5 choices for the B for a total of 30 distinct AAABB number pairings, but how many arranegements are there? There are 5! ways to arrange 5 distinct dice, but the A’s can be rearranged amongst themselves in 3! ways and the two B’s can be rearranged in 2! ways, so the total count is 6 * 5 * 5! / (3! * 2!) = 300.

The quantity 5! / (2! * 3!) is also known as the combination of 5 objects taken 2 at a time and can be written as C(5,2). Another way to think about the foregoing calculation is to think of a bag containing the numbers 1 through 5 representing where the B’s would show up in a list of five slots, and you pull out 2 of those without replacement. How many different combinations do you come out with? 5! / (2! * 3!) = C(5,2).

Full House Inclusive Exclusive
Count 300 300
Probability 3.86% 3.86%

Flush
The Flush is where is starts to get interesting. In Yamslam, the flush is when all of the dice are odd or all are even. There’s no simple pattern for the Flush and it overlaps with many other rolls. But, for each set of evens or of odds there are 3 choices for each of 5 dice in the roll: 3^5 = 3*3*3*3*3 = 243, so the inclusive Flush count is 2 * 243 = 486.

To get the exclusive count of flushes we have to subtract off the cases where a Four of a Kind, a Full House, and a Yamslam are also flushes. (A Large Straight is never a flush.)

Obviously, all Yamslams are also flushes. How many Four of a Kind rolls are also flushes? There are 6 Four of a Kinds, and for each choice there are 2 choices to make a Flush, and 5 choices for where the last die goes: 6 * 2 * 5 = 60 Four of a Kinds that are also Flushes. Similarly, in a a Full House, for each of 6 choices for the triple, there are 2 choices for the pair to make a flush, and there are C(5,2) = 5! / (3! * 2!) arrangements just like before for a total of 6 * 2 * 5!/(3! * 2!) = 120 Full Houses that are also Flushes.

Therefore the exclusive number of Flushes is 486 – (6 + 60 + 120) = 300.

Flush Inclusive Exclusive
Count 486 300
Probability 6.25% 3.86%

Short Straight

There are three kinds of Short Straights in Yamslam: 1-2-3-4, 2-3-4-5, and 3-4-5-6. Counting the Short Straights is tricky because the combinatorics are different in the case where all of the dice are different versus the case where there is a Short Straight and a Pair. Consider the case where all of the dice are different: ABCDE. The distinct cases are 1-2-3-4-5, 1-2-3-4-6, 2-3-4-5-6, and 1-3-4-5-6. Since the die are all distinct, there are 5! combinations for a total of 4 * 5! = 480 combinations.

Now lets consider the case where there are pairs: AABCD. For each of the three Small Straights 1-2-3-4, 2-3-4-5, and 3-4-5-6, there are 4 choices for the pair. But how to arrange the results? There are C(5,2) ways that the pair can be distributed among the five available slots, but since the remaining 3 die are distinct, we multiply by 3 choices for the first remaining die, 2 choices for the next and 1 for the last. That gives us a total of 3 * 4 * 5!/(2! * 3!) * 3! = 720 additional overlap combinations, for a total of 480 + 720 = 1200 Short Straights, inclusive.

To get the exclusive figure, we need to count how many of these Short Straights include Large Straights. A careful observer will note that all 240 of the Large Straights were already included in the count. So then the exclusive figure for Short Straights is 1200 – 240 = 960.

Small Straight Inclusive Exclusive
Count 1200 960
Probability 15.43% 12.35%

Three of a Kind

To get the figure for Three of a Kind, consider the pattern AAABC exclusive of Yamslams, Four of a Kinds, and Full Houses. There are 6 choices for A with C(5,3) arrangements of the 3 A’s among 5 slots. Since the remaining dice are distinct, there are 5 choices for B and 4 choices for C for a total of 6 * 5 * 4 * 5! / (3! * 2!) = 1200.

But, this is a partially inclusive figure because many of these are also Flushes. The number of Three of a Kind rolls above that are also Flushes is 6 choices for the A with 2 choices for B (once A is fixed) and then C is determined. The arrangements are the same giving 6 * 2 * 5! / (3! * 2!) = 120 Three of a Kind rolls that are also Flushes. So the exclusive total count for Three of a Kind is 1200 – 180 = 1080.

To get the inclusive figure, we start with the 1200 above and simply add in the Yamslams, Four of a Kind, and Full Houses. So the inclusive figure is 1200 + 6 + 150 + 300 = 1656.

Three of a Kind Inclusive Exclusive
Count 1656 1080
Probability 21.30% 13.89%

Two Pair

The basic Two Pair pattern is AABBC, exclusive of Yamslam, Four of a Kind, Full House, and Three of a Kind. There are 6 choices for A, 5 remaining choices for B, and 4 remaining choices for C. But we’ve overcounted because there are two pairs: for example the case with A = 1 and B = 2 is double-counted by the case where A = 2 and B = 1. Since the pair AA and BB are indistinguishable as pairs, there are C(6,2) = 6! / (4! * 2!) = 6 * 5 / 2 = 15 distinct assignments to A and B. The first pair can be arranged in C(5,2) and the second pair in C(3,2) remaining slots. This gives a total of C(6,2) * C(5,2) * c(3,2) * 4 = 15 * 10 * 3 * 4 = 1800 cases.

But like the Three of a kind case, some of these overlap with Flushes. For each type of Flush (even or odd) in the AABBC pattern, there are C(3,2) distinct assignments for A and B, C is determined, and the number of arrangements are the same for a total of 2 * C(3,2) * C(3,2) * C(5,2) = 2 * 3 * 3 * 10 = 180 overlaps with Flushes. So the exclusive count for Two Pair is 1800 – 180 = 1620.

To get the inclusive count, start with the 1800 figure and add in the Full Houses to get 1800 + 300 = 2100.

Two Pair Inclusive Exclusive
Count 2100 1620
Probability 27.01% 20.83%

One Pair and Bupkiss

The Two Pair roll is the lowest scoring roll in Yamslam. One Pair and Bupkiss (aka “nothing”) are both scored as zero. However, it is convenient to count them separately.

The One Pair pattern is AABCD. There are 6 choices for A, and there are C(5,2) arrangements of the pair in five slots. The remaining dice are distinct, and so there are 5 remaining choices for the next die, times 4 remaining choices for the next die, times 3 for the last die: 6 * C(5,2) * 5 * 4 * 3 = 3600.

Does this include any Flushes? Actually, since the four A, B, C, and D are all distinct, and there are only 3 choices for these to be in a Flush, by the Pigeonhole Principle there are no Flushes with this pattern. But there are overlaps with Small Straights. We already counted these above, and there are 720 Small Straights that are also One Pairs. So the exclusive Single Pair count is 3600 – 720 = 2880.

To get the inclusive figure, start with the 3600 figure and add in everything else that overlaps with a pair: Two Pair, Three of a Kind, Four of a Kind, Full House and Yamslam: 3600 + 1620 + 1080 + 300 + 150 + 6 = 6756.

Bupkiss is surprisingly easy. To count pure Bupkisses, one only needs to realize that there are only two distinct pure Bupkiss patterns: 1-2-3-5-6 and 1-2-4-5-6. Each of these can be arranged in 5! ways for an inclusive and exclusive total 2 * 5! = 240.

Conclusion

In the above sections, we have counted all of the occurrences of different Yamslam rolls. These are collected in the table below. To check that all of the cases are accounted for, we sum up the exclusive counts and arrive at the expected 7776.

Inclusive Count Inclusive Probability Exclusive Count Exclusive Probability
Yamslam 6 0.08% 6 0.08%
Large Straight 240 3.09% 240 3.09%
Four of a Kind 156 2.01% 150 1.93%
Full House 300 3.86% 300 3.86%
Flush 486 6.25% 300 3.86%
Small Straight 1200 15.43% 960 12.35%
Three of a Kind 1656 21.30% 1080 13.89%
Two Pair 2100 27.01% 1620 20.83%
One Pair 6756 86.88% 2880 37.04%
Bupkiss 240 3.09% 240 3.09%
Total 7776 100.00%

That’s it!

This has been done to death, but I wrote a brief introduction to Basic Probability Distributions in R on Rpubs.  One thing that that this introduction has going for it that I don’t see in many other places is that it brings together plots of each of the basic distribution functions in R along with some examples for how they are used.  The hope is that the reader will get both a feel for the shapes of the probability distributions and will gain an understanding of the three standard kinds of distribution functions offered by R: the probability density, the cumulative distribution, and the quantile function. The following probability distributions are covered.

  • Normal Distribution: dnorm, pnorm, and qnorm
  • Poisson Distribution: dpois, ppois, and qpois
  • Binomial Distribution: dbinom, pbinom, and qbinom
  • Exponential Distribution: dexp, pexp, and qexp
  • Chi Square Distribution: dchisq, pchisq, and qchisq

In general, R has excellent online documentation, but it can be a little dry.  It can be tough to remember the differences between p-this and d-that and q-who, and I find that it helps me to remember these if I visualize the functions and work a couple of examples.

Anyway, I hope someone finds this useful!

Dealing with data can be full of wonderful surprises, like suddenly needing to convert from one date/time format to another and making sure user-input strings are in the correct format.  One aspect of time parsing that doesn’t come up too often outside of high frequency trading is dealing with fractional seconds.  R has lots of excellent libraries for dealing with date formats, like lubridate. Lubridate has a parsing function that can recognize fractional seconds, but IMHO it’s not a top tier use case since you have to set a global option(and set it back)

## ** fractional seconds **
op <- options(digits.secs=3)
dmy_hms("20/2/06 11:16:16.683")
## "2006-02-20 11:16:16.683 UTC"
options(op)

(example from man page reprinted at inside-R)

But a humble regular expression can also do the job of recognizing when a string is in correct format.  So in the following I’ll go through an example of how to quickly build up a functional, working regular expression in R to recognize correct time strings in 24-hour format with optional seconds and optional fractional seconds up to microsecond precision.  The key to using regular expressions is to start out simply and test often.

Step 1: Starting out simply

Staring out with hours and minutes only, we basically want two digits after the start (^) of the string, followed by a colon, and followed by two digits in range [0-9] at the end of the string ($).  Using the {N} occurrence operator to specify exactly how many occurrences of the preceding element we want, we can quickly come up with the following and test using grep:

> grep("^[0-9]{2}:[0-9]{2}$", c(good="22:00", bad="111:22"))
[1] 1

So the regexp matches the first, model string and rejects the second malformed string.  But there’s a problem: people write times all the time without two mandatory digits in the hour.  To fix that, we go from the above to something slightly more complicated, using the zero or one occurrence (?) operator:

grep("^[0-9]?[0-9]:[0-9]{2}$", c(good=c("22:00", "2:00"), bad=c("111:22", ":33")))
[1] 1 2

This worked as expected, matching the first two “good” elements and rejecting the bad ones.  (Note we added to our good and bad sets as we went along.  This is important to keep it real, regexps can get very complicated very quickly.)

Step 2: Add complexity as needed

Now let’s add in the optional seconds.  To do this, we will add a group () at the end to represent the seconds and make the whole group optional with the zero or one occurrence (?) operator:

> grep("^[0-9]?[0-9]:[0-9]{2}(:[0-9]{2})?$", 
              c(good=c("22:00", "2:00", "3:34:33"), 
                 bad=c("111:22", ":33", "3:44:4")))
[1] 1 2 3

What about fractional seconds?  Easy, add an optional group within the just-added optional group.  But to go to microseconds, we want to restrict fractional digits to between 1 and 6 occurrences, which you can do with two arguments to the occurrence operator.  (Since “.” is an operator in regular expressions, don’t forget to escape it with a double backslash to let it know you just want the decimal point and not the operator!)

> grep("^[0-9]?[0-9]:[0-9]{2}(:[0-9]{2}(\\.[0-9]{1,6})?)?$", 
             c(good=c("22:00", "2:00", "3:34:33", "12:02:22.2345"), 
               bad=c("111:22", ":33", "3:44:4", "3:12:00.")))
[1] 1 2 3 4

Step 3: Polish it up

The main use cases are taken care of, but the astute reader will notice that although the regular expression counts the digits correctly, it nonetheless allows for nonsense times like “4:71” or “33:23:11.223”.  Let’s modify the above to allow for minutes and seconds from 00 – 59 only:

> grep("^[0-9]?[0-9]:[0-5][0-9](:[0-5][0-9](\\.[0-9]{1,6})?)?$", 
      c(good=c("22:00", "2:00", "3:34:33", "12:02:22.2345"), 
         bad=c("111:22", ":33", "3:44:4", "3:12:00.", "3:66", "3:23:99")))
[1] 1 2 3 4

Here we’ve simply removed one of the occurrences of [0-9] in minutes and seconds and replaced the leading one with a [0-5].  The leading hour is a little more tricky, in that when the leading digit is a “2” then the second range must be [0-3] instead of the full [0-9].  We will break up the hours into sub expressions using the “|” logical OR operator:

> grep("^([0-1]?[0-9]|2[0-3]):[0-5][0-9](:[0-5][0-9](\\.[0-9]{1,6})?)?$", 
     c(good=c("22:00", "2:00", "3:34:33", "12:02:22.2345", "14:55"), 
         bad=c("111:22", ":33", "3:44:4", "3:12:00.", "3:66", "3:23:99", "25:22")))
[1] 1 2 3 4 5

Let’s look a little more closely at the logical OR.  In the first branch, we have an optional [0-1] followed by a mandatory [0-9], because the following are all good 24-hour forms: “15:00”, “04:00”, “6:00”.  The second is a mandatory 2 followed by a mandatory [0-3].  Since the second branch is OR’d with the first, it just adds new strings to the acceptable forms: “23:30” for example.

Conclusion

This blog post showed how to build up a regular expression in R for the purposes of recognizing valid time strings with fractional seconds.  A process was shown for growing the complex regular expression from simple beginnings using standard regular expression syntax and testing at every step.  The testing at every step is crucial: you can easily dig yourself into a hole with complex regular expressions, so you should test often and detect problems as they occur.

Happy regexp-ing!

The Alpha Walkers

Standing on the platform at Ogilvie, waiting
as others pass by, to board the Wheaton Rocket!
‘Cause the last ones in are the first ones out
with the best chance to beat the evening riot.

“Ding Dong” the doors are closing, and all aboard
the laggards take their seats. But Alpha Walkers
stand all the way home, watching, pensively peering
at their phones or reading their newspapers.

Some riders get up at Elmhurst, and make their way
towards the door, forming a line. But already in
the vestibule Alphas are planning their exit,
blocking the stragglers now just wandering in.

Rounding the great Glen Ellyn Pause
on the 504 Express, the Alpha Walkers
make ready to disembark. With steely eyes they
coil themselves and glare at any talkers

stealing attention from the grim task ahead.
Now approaching: Wheaton! Steady-on, lads!
The train slows, creeps, shutters to a stop.
With a “Ding Dong” the doors open at last.

Out they pop like synchronized champagne corks
to the delight of the staff at Adelle’s.
“Ho! Look at them go!” as they pause with trays
of food and drink, taking in the days’ spectacle.

Some walk, some run! At the leading edge of
the crowd, the Alpha Walkers seize the day.
Into their cars first, onto the street first,

thus getting away
while escaping the purgatory of delay.

 

Bald Head

Bald head. Bald head.

Wearing glasses with thick black rims
and a warm winter coat of red.
But oddly no hat is perched upon your head!

Jostling my elbow as you hurry past,
where are you going so fast?

Farewell, bald head. Bald head.