PWC is back, with forever an asterisk where #232 should be. Best wishes for full recovery to our intrepid host and benevolent overlord.
This week's challenges have something in common: they require us to organize data by a trait that is not directly in the data. All software problems, of course, can be solved by adding a level of indirection, and that applies here. Supplementing data with an invented data structure on the side is a common programming strategy. It also reminds me of the Schwartzian transform, and maybe that will be useful this week.
Task 1: Similar Words
You are given an array of words made up of alphabets only.
Write a script to find the number of pairs of similar words.
Two words are similar if they consist of the same characters.
Example 1
Input:
@words = ("aba", "aabb", "abcd", "bac", "aabc")
Output:2
Pair 1: similar words ("aba", "aabb")
Pair 2: similar words ("bac", "aabc")Example 2
Input:
@words = ("aabb", "ab", "ba")
Output:3
Pair 1: similar words ("aabb", "ab")
Pair 2: similar words ("aabb", "ba")
Pair 3: similar words ("ab", "ba")Example 3
Input:
@words = ("nba", "cba", "dba")
Output:0
Thoughts
First, I notice that the task requires us to print a count of pairs, not the pairs themselves. Counting pairs is a specific case of a general combinatorial problem; how many ways can we choose k things at a time from a set of n? In general, choosing k from n is n!/(n-k)!•k!; when k is 2, it reduces to n•(n-1)/2. Let's tuck that away for a moment.
To see if words are similar by the task definition, we will need to find the signature of a word that identifies which letters it contains. We'll need to decide what "made up of alphabets only" really means. A diligent developer would consider at least Unicode, case folding and things like whether n and ñ are "similar". But we are Perl programmers, and our defining virtue is (say it with me) laziness, so we choose "alphabets only" to mean lower-case English alphabet a to z.
I can think of a couple of ways to make a word signature.
- We can split the word into its characters; then sort that list of characters, remove duplicates, and then join the now-sorted characters into a string. For instance "joinuniqsortsplit" becomes "ijlnopqrstu".
- Given that we have only the 26 lowercase letters of the English alphabet, we have a set that can be efficiently represented by a bit vector. If we let bit 0 correspond to a and bit 25 to z, we can set the bits to 1 for letters that are present, and each such set creates an integer between 1 ("a", "aa", "aaa" and so on) and 2^26-1 ("abcdefghijklmnopqrstuvwxyz" and any other string that contains all the letters). For instance, "joinuniqsortsplit" becomes 0x01feb00, or 2091776.
zy xwvu tsrq ponm lkji hgfe dcba
00 0001 1111 1110 1011 0000 0000
The bit vector solution is kind of fun; it's probably what I would do in a C/C++ solution. But Perl, so the first one.
So what data do we need? How about this: for every unique word signature, a list of words that have that signature. For example 1, that would look like:
my %similar = (
ab => [ "aba", "aabb" ],
abc => [ "bac", "aabc" ],
abcd => [ "abcd" ],
);
Because we only need a count of pairs, we don't really need to preserve the list of words -- we could map the signatures to a count. But let's roll with it for now and optimize later if we must (say it with me -- "laziness").
So, on to our solution. There are two parts. One is a function that reduces a word to its signature, using uniq
from List::Util to reduce a sorted list to single elements:
sub wordHash($word)
{
join "", uniq sort { $a cmp $b } split(//, $word)
}
Why did I name this "wordHash"? Reducing a word to its signature reminded me of reducing a complex data structure to a single value which could be used as a key in a hash table. The same sort of problem arises in serializing and de-serializing data, for instance in Java remote procedure calls. I should probably change the name from "hash" to "signature" or something more directly applicable to this problem. Naming things is one of the two hardest problems in computer science; the others are cache invalidation and off-by-one errors.
The other part is to create a hash of similar words, and then iterate over it to find the pair counts:
sub similarWords(@wordList)
{
# Create a hash mapping signatures to words that have that signature
my %similar;
for my $word ( @wordList )
{
# Build an array of words via reference
push @{$similar{wordHash($word)}}, $word;
}
my $pairCount = 0;
for my $signature ( keys %similar)
{
my $n = scalar $similar{$signature}->@*;
next unless $n > 1;
$pairCount += $n*($n-1) / 2;
}
return $pairCount;
}
Here we retrieve the combinatorial pair-counting from where we tucked it away earlier to add up the sum count. Getting $n
is perhaps a little weird, so let's break it down:
-
$similar{$signature}
retrieves the value from the hash. The values stored are array references. -
$similar{$signature}->@*
is post-fix dereferencing, which is a scary way to say "the actual array" -
scalar ...
when applied to an array yields the size of the array.
One more thing that might be considered is what should happen if there are duplicate words with the same signature. Suppose our list is ("ab", "ab", "ab")
. Is that three unique words and therefore three pairs, or is it just one and therefore no pairs? What to do, what to do? Hint: laziness.
Task 2: Frequency Sort
You are given an array of integers.
Write a script to sort the given array in increasing order
based on the frequency of the values. If multiple values
have the same frequency then sort them in decreasing order.
Example 1
Input:
@ints = (1,1,2,2,2,3)
Ouput:(3,1,1,2,2,2)
'3' has a frequency of 1
'1' has a frequency of 2
'2' has a frequency of 3
Example 2
Input:
@ints = (2,3,1,3,2)
Ouput:(1,3,3,2,2)
'2' and '3' both have a frequency of 2, so they are sorted in decreasing order.
Example 3
Input:
@ints = (-1,1,-6,4,5,-6,1,4,1)
Ouput:(5,-1,4,4,-6,-6,1,1,1)
Thoughts
Hmm, a problem of multiple sort criteria. Let's restate it.
Put things in order. The things that only occur once should come first, then the things that occur twice, and so on. Within the groups, the numbers should be in descending order.
More than thirty years ago, Stephen Covey told us in The 7 Habits of Highly Effective People that we should begin with the end in mind. And since many people have called me a highly effective person ... hold on, I'm being told the word they used was "defective." Anyway.
In the end, we're going to need to determine the frequency of each element in the list. And if we had a table of frequencies, it would be very easy to sort by frequency. To make such a table, it's not hard to iterate over the list and increment a count in a hash, but it's even easier to use a module: the frequency
function from List::MoreUtils
.
Given the frequency table, then implementing the two-part sort is easy, if somewhat idiomatic.
sub frequencySort(@list)
{
use List::MoreUtils qw/frequency/;
my %freq = frequency(@list);
[ sort { $freq{$a} <=> $freq{$b} || $b <=> $a} @list ]
}
The sort criteria are two parts separated by ||
. If the first condition determines the order, that's that, and the second condition isn't even considered (shortcut evaluation). But if they compare equal on frequency, then the <=>
operator returns 0
, which is false in Perl. Therefore, the second condition does get evaluated. The second condition is a simple numeric comparison, but $b
and $a
have been switched so that it becomes a descending sort.
Finally, putting the whole thing in []
brackets creates a reference to an anonymous array. Because that's the last statement in the subroutine, it becomes the return value. Shame on me for not being explicit with a return
keyword. Or ending the statement with a semicolon. Did I mention "laziness"?
Top comments (0)