DEV Community

Cover image for Secret Message
Ryan Palo
Ryan Palo

Posted on • Edited on • Originally published at assertnotmagic.com

Secret Message

Today I was reading through an article on Dev.to by Ben Greenberg about an interview coding challenge, and I got hooked and had to try it for myself. I highly recommend you read his original post before you read this so you have some background. That being said, for the lazy ones, let me give you...

Some Background

The challenge is this (copied from the original post):

Sort the characters in the following string:
abcdefghijklmnopqrstuvwxyz_

by the number of times the character appears in the following text (descending):

… [String omitted for brevity]

Now take the sorted string, and drop all the characters after (and including) the _. The remaining word is the answer.

You can find the super long string in the comments section of his post (where I asked for it 😁) if you want to try it yourself. You should! And share your solution! Anyways, on with the story.

My Solution (Part 1)

If it's not abundantly clear, I'm about to share my solution. If you want to work it out for yourself without spoilers, go ahead and stop reading until you're done. I won't write anymore until you're finished. 😄

Waiting...

Finished? Great! ONWARD.

I decided to use Ruby because Ruby is fun and great. Initially, I just wanted to knock out the first solution I could think of -- regardless of how inefficient or slow it was -- and come back to clean it up after.

# I stored the huge string in a separate file called
# `blob.txt` so as to not clutter my main script

blob = File.read('blob.txt')
letters = 'abcdefghijklmnopqrstuvwxyz_'.chars

def naive_message(letters, blob)
  letters
    .sort_by { |letter| blob.count(letter) }
    .reverse  # to get desc. order
    .join
    .split('_')
    .first
end

puts naive_message(letters, blob)
Enter fullscreen mode Exit fullscreen mode

I won't spoil the answer, but let's just say that I "pray, plead, beseech, urge" you to try it on your own. Possibly in the past tense. I think the markdown format of the initial text blob may have skewed the number of "_" characters, causing my answer to have slightly more characters on the end than should actually be there. UPDATE: That's totally what happened. Accurate text blob here.

My Solution (Part 2)

Anyways, I looked back at my solution and thought to myself, "Ryan, that doesn't look very performant. I have to imagine that running blob.count(letter) for each letter is the worst case performance for this scenario (27 'letters' * n chars in the blob, looping through the whole blob for each letter). It seems like it should be more efficient to do it the way Ben did it, which is by looping through the blob once and counting each letter along the way. So I tried that.

def efficient_message(letters, blob)
  # counter = {a: 0, b: 0, c: 0...}
  counter = letters.product([0]).to_h

  # run through blob once only
  blob.each_char { |c| counter[c] += 1 }

  # sort and trim off everything after _
  counter
    .sort_by(&:last)    # sort by the count
    .map(&:first)       # grab just the letter key into an array
    .reverse
    .join
    .split('_')
    .first
end
Enter fullscreen mode Exit fullscreen mode

Not as pretty, in my opinion, but hopefully faster. Ruby, being interpreted, is slower than most compiled langauges, so this should help. (So I thought...)

Comparing Performance

Was this optimization worth it? I needed to find out. Luckily Ruby comes with an awesome Benchmarking library built-in. (Oh Ruby, what is there that you can't do?)

require 'benchmark'

# ... My code above

Benchmark.bmbm do |bm|
  bm.report("Naive: ") { 1000.times { naive_message(letters, contents) } }
  bm.report("Efficient: ") { 1000.times { efficient_message(letters, contents) } }
end
Enter fullscreen mode Exit fullscreen mode

Benchmark has a method called bmbm that runs one trial run and then a second real run. This helps shake out any overhead performance drains from the garbage collector. And to my horror:

~\desktop> ruby .\secret_word.rb
Rehearsal -----------------------------------------------
Naive:        0.047000   0.000000   0.047000 (  0.039974)
Efficient:    0.484000   0.000000   0.484000 (  0.481631)
-------------------------------------- total: 0.531000sec

                  user     system      total        real
Naive:        0.031000   0.000000   0.031000 (  0.038011)
Efficient:    0.483000   0.000000   0.483000 (  0.478715)
Enter fullscreen mode Exit fullscreen mode

The "Efficient" version is ~10x slower than the "Naive" version! Noooo! "But, why?" you ask. "How can this be?" I had the same questions.

Algorithmic Profiling

Ruby has a built-in profiler, but a short Google search told me that there was a better option: ruby-prof. After a quick gem install ruby-prof, I was back at it again with the white vans. (Check out the Ruby-Prof Documentation to learn more).

require 'ruby-prof'

# ... The previous code

RubyProf.start
1000.times { naive_message(letters, contents) }
result = RubyProf.stop

RubyProf::FlatPrinter.new(result).print(STDOUT)

RubyProf.start
1000.times { efficient_message(letters, contents) }
result = RubyProf.stop

RubyProf::FlatPrinter.new(result).print(STDOUT)
Enter fullscreen mode Exit fullscreen mode

I added titles below for clarity.

Naive:
===================
Measure Mode: wall_time
Thread ID: 3259000
Fiber ID: 20752200
Total: 0.066000
Sort by: self_time

 %self      total      self      wait     child     calls  name
 50.00      0.033     0.033     0.000     0.000    27000   String#count
 18.18      0.059     0.012     0.000     0.047     1000   Enumerable#sort_by
 10.61      0.007     0.007     0.000     0.000   136000   Integer#<=>
 10.61      0.040     0.007     0.000     0.033     1000   Array#each
  3.03      0.002     0.002     0.000     0.000     1000   Array#reverse
  3.03      0.066     0.002     0.000     0.064     1000   Object#naive_message
  1.52      0.001     0.001     0.000     0.000     1000   Array#first
  1.52      0.001     0.001     0.000     0.000     1000   Array#join
  1.52      0.001     0.001     0.000     0.000     1000   String#split
  0.00      0.066     0.000     0.000     0.066        1   Global#[No method]
  0.00      0.066     0.000     0.000     0.066        1   Integer#times

* indicates recursively called methods

Efficient:
==============
Measure Mode: wall_time
Thread ID: 3259000
Fiber ID: 20752200
Total: 0.688000
Sort by: self_time

 %self      total      self      wait     child     calls  name
 93.60      0.644     0.644     0.000     0.000     1000   String#each_char
  2.04      0.025     0.014     0.000     0.011     1000   Enumerable#sort_by
  1.16      0.008     0.008     0.000     0.000   136000   Integer#<=>
  0.58      0.005     0.004     0.000     0.001     1000   Array#map
  0.44      0.688     0.003     0.000     0.685     1000   Object#efficient_message
  0.44      0.003     0.003     0.000     0.000     1000   Array#reverse
  0.44      0.003     0.003     0.000     0.000     1000   Array#product
  0.44      0.003     0.003     0.000     0.000     1000   Array#to_h
  0.29      0.003     0.002     0.000     0.001     1000   Hash#each
  0.29      0.002     0.002     0.000     0.000     1000   String#split
  0.15      0.001     0.001     0.000     0.000    27000   Array#last
  0.15      0.001     0.001     0.000     0.000    28000   Array#first
  0.00      0.688     0.000     0.000     0.688        1   Global#[No method]
  0.00      0.000     0.000     0.000     0.000     1000   Array#join
  0.00      0.688     0.000     0.000     0.688        1   Integer#times

* indicates recursively called methods
Enter fullscreen mode Exit fullscreen mode

From what I can tell, the String#count method is super-optimized, and String#each_char is a relatively expensive operation (it has to create an array the length of the blob!). So, in the long run, looping through the blob string a bunch of times using the faster String#count ends up being more performant. So much for going through the trouble to generate an efficient solution.

Wrap Up

Anyways, I hope you're able to get your heart rate back to normal after such a roller-coaster ride. Be sure to share your solution on Ben's post. Also, since this was originally a kind of interview code puzzle, if you're someone that has interviewed people, I would love any feedback on my solution or the surrounding presentation! Is it similar to what you would expect from an applicant, or am I missing something important?

Thanks for reading!


Originally posted on assert_not magic?

Top comments (21)

Collapse
 
bengreenberg profile image
Ben Greenberg

This is great! I can't even begin to describe my love for Ruby. It is such a clean language.

Collapse
 
imox2 profile image
Talha • Edited
    def give_me_word(text):
        ans=dict()
        the_text=""
        for i in text:
            if i in ans:
                ans[i]+=1
            else:
                ans[i]=1

        comparison=ans['_']
        sorted_list=sorted(ans.values(),reverse=True)
        rev_dict = dict((v,k) for k,v in ans.iteritems())

        for i in sorted_list:
            if(i>=comparison and rev_dict[i]!='_'):
                the_text=the_text+rev_dict[i]
            elif rev_dict[i]=='_':
                break

        return the_text

    text="the_text"

    print give_me_word(text)

Not fully optimized script to do the same in python

Collapse
 
rpalo profile image
Ryan Palo • Edited

Nice! Thanks for sharing. :) Have you seen the built-in Counter class? This seems like a perfect use-case for it.

from collections import Counter

blob = "..."
letters = "abcdefghijklmnopqrstuvwxyz_"

def message(letters, blob):
    hist = Counter(blob)    # Counter takes any iterable and makes
                            # a histogram of its items!

    sorted_letters = ''.join(x[0] for x in hist.most_common())
    return sorted_letters.split('_')[0]

print(message(letters, blob))
Collapse
 
rpalo profile image
Ryan Palo

Bonus! On my machine, this version is even 5 times faster than the best Ruby version!

cat test.py | python -m timeit
100000000 loops, best of 3: 0.00844 usec per loop
Thread Thread
 
imox2 profile image
Talha

This is great. Will use counter effectively. Thanks for showing me counter(). :-)

Collapse
 
edh_developer profile image
edh_developer

I wondered if I could do better performance-wise, based on knowledge of the problem that the more general built-in classes couldn't take advantage of. Ended up with this:

import string

longString="..."

def decode(msg):
    # Testing suggests this is a little faster than using defaultdict
    charCountHash = {}.fromkeys(string.ascii_lowercase,0)
    charCountHash['_'] = 0

    for ch in msg:
        charCountHash[ch] += 1

    c = charCountHash['_']
    resultHash = {charCountHash[k]: k for k in charCountHash.keys() \
        if charCountHash[k] > c }

    return ''.join( [ resultHash[index] for index in \
        sorted(resultHash.keys(), reverse=True) ] )

print(decode(longString))

Testing both implementations a few times suggests this way is faster, but only 1% or so.

Might try this in go, just for the practice...

Thread Thread
 
rpalo profile image
Ryan Palo

Nice! I haven't used fromkeys before. That's a nice alternative to defaultdict (and you don't have to import anything!) if you already know the keys you need.

Collapse
 
bengreenberg profile image
Ben Greenberg

Python is on my list of things to learn soon and can't help but remark how much it "looks" like Ruby without knowing much about it yet.

Collapse
 
imox2 profile image
Talha

"...without knowing much about it yet." This is great about python. Nice challenge this one. :-)

Collapse
 
buntine profile image
Andrew Buntine

Nice post. It's an interesting look at how algorithmic complexity can be trumped by real-world details.

But I think you'll find that the performance is not exactly in the each_char (which is probably lazily evalutated) but is caused by:

  • The setup/teardown (stack frames, etc) for the block that is passed to each_char
  • The repeated hash lookups (although they are supposed to be O(1))
Collapse
 
rpalo profile image
Ryan Palo

Thanks, that’s super interesting! Good to consider for future block usage. I can’t think of a way to speed up the block setup and tear down, but I wonder if I’d pick any performance up by switching the hash keys to symbols. I’ll have to do some ‘sperimenting 👍🏻

Collapse
 
tobias_salzmann profile image
Tobias Salzmann • Edited

Obvious (!!!) question to ask: What if the text does not fit into memory?

Distributed version in Scala with Spark:

def message(
  distributedBlob: RDD[Char], 
  chars: Set[Char] =  ('a' to 'z').toSet + '_'
)(implicit sc: SparkContext): String = {
  val frequencies = distributedBlob
    .filter(chars.contains)
    .countByValue()
  frequencies
    .toSeq
    .sortBy(- _._2)
    .map(_._1)
    .mkString
    .takeWhile('_' !=)
}
Collapse
 
mukeshbhardwaaj profile image
Mukesh Bhardwaj

Thanks for the amazing community. The way you welcome your new users are amazing. Hey Wait did you check this Solarmovies alternatives Thanks

Collapse
 
der_gopher profile image
Alex Pliutau

Interesting challenge. I want to see how it can be gracefully implemented in Go, even it will be more code than Ruby, but it can be elegant. github.com/plutov/practice-go/tree...

Collapse
 
techpout profile image
TechPout

I am deeply attracted by your post. It is really a nice and informative one. I will recommend it to my friends. Also, you can read this: Putlocker

Collapse
 
der_gopher profile image
Alex Pliutau

Here is solution for same problem in Go - github.com/plutov/practice-go/blob...

Collapse
 
nimuvea1 profile image
Nimuvea

Thanks for sharing this article. Also see the thetechbytes.net.

Collapse
 
jhoncarter profile image
Jhon Carter

Thank you for telling us but can you tell me how to add images in footer of this site: youtech.ooo/

Some comments may only be visible to logged-in visitors. Sign in to view all comments.