Today I wrote a script in Ruby that, among other things, receives a text file as input and calculates the word count. One of the things I love about Ruby is that there’s always more than one method for doing things. But what I learned today is, not all methods are created equal. Let’s say we wanted to count the words in the following text:
What an excellent example of the power of dress, young Oliver
Twist was! Wrapped in the blanket which had hitherto formed his
only covering, he might have been the child of a nobleman or a
beggar; it would have been hard for the haughtiest stranger to
have assigned him his proper station in society. But now that he
was enveloped in the old calico robes which had grown yellow in
the same service, he was badged and ticketed, and fell into his
place at once–a parish child–the orphan of a workhouse–the
humble, half-starved drudge–to be cuffed and buffeted through
the world–despised by all, and pitied by none.
One way to accomplish this task is by using the
scan method and passing the regular expression
/\w+/ as an argument.
scan iterates over a string and looks for a certain pattern passed to it as an argument, then outputs any matches into an array. So let’s say we store the Oliver Twist text above in a variable named
text, and use the
scan method to search for any word character using regular expressions, then ask to return the number of words found. Here’s how that would look:
scan method searched for all alphanumeric characters then returned the results into an array. The
length method returns the number of words found. In this case, 113 words.
Another method we can use to count how many words are stored in the text variable is to use the
split method. When no arguments are passed to the
split method, it will automatically split the string by whitespace and return the results in an array. Passing the
length method to that result will also return the number of words stored in the text variable. This is what that would look like:
split method returned only 107 words. Do you know why this may be? The reason is that by passing the regular expression,
/\w+/, as an argument, the
scan method counted the hyphenated words as two words, when they should have only been counted as one. So it seems to me that using
split can provide a more accurate method to determine word count.
What do you think? Do you agree that using the
split method can provide a more accurate word count, or can you use regular expressions to achieve the same result? Leave a comment below, I’d love to hear your thoughts.