This was originally posted on my blog a while ago
I was given a fairly mundane task at work and when this happens I try and find ways to make it interesting. I will usually find a way to learn something new to do the task so I can always get something out of the work I do, even if it's dull.
I would quite like to get better at using the legendary tools available to me on the command line. I have never felt super confident doing text processing there, often giving up and just going to my editor instead.
I am not going to exhaustively document every command I do, if you're curious about the details consult man $command
just like how I did! This post will just be illustrating how by doing a little research you can quickly do some rad text processing.
The task
I am trying to analyse how much memory we are provisioning compared to actual usage for some of our services to see if our AWS bill can be cut down a bit.
The data
I thought it would be fun (??) to use CSV, as it feels like CSV like data. Here is a sample of the data I captured.
name, usage (mb), allocated (mb), CPU %, containers dispatcher, 150, 512, 40, 10 assembler, 175, 512, 75, 10 matcher, 85, 512, 15, 10 user-profile, 128, 512, 40, 5 profile-search, 220, 512, 80, 10 reporter, 90, 512, 40, 10 mailgun-listener, 90, 512, 10, 5 unsubscribe, 64, 512, 3, 5 bounce, 8, 128, 0.5, 3 legacy-reporting, 30, 512, 15, 3 content-store, 80, 256, 30, 10 legacy-alert-poller, 64, 256, 1, 1 migrator, 80, 256, 10, 5 entitlements-update, 150, 256, 70, 3
Display it nice
This is nice and easy column -s, -t < data.csv
-t
determines the number of columns the input contains and combined with -s
specifies a set of characters to delimit by. If you dont specify a -s
it defaults to using space.
name usage (mb) allocated (mb) CPU % containers dispatcher 150 512 40 10 assembler 175 512 75 10 matcher 85 512 15 10 user-profile 128 512 40 5 profile-search 220 512 80 10 reporter 90 512 40 10 mailgun-listener 90 512 10 5 unsubscribe 64 512 3 5 bounce 8 128 0.5 3 legacy-reporting 30 512 15 3 content-store 80 256 30 10 legacy-alert-poller 64 256 1 1 migrator 80 256 10 5 entitlements-update 150 256 70 3
Sorting by usage
cat data.csv | sort -n --field-separator=',' --key=2 | column -s, -t
name usage (mb) allocated (mb) CPU % containers bounce 8 128 0.5 3 legacy-reporting 30 512 15 3 legacy-alert-poller 64 256 1 1 unsubscribe 64 512 3 5 content-store 80 256 30 10 migrator 80 256 10 5 matcher 85 512 15 10 mailgun-listener 90 512 10 5 reporter 90 512 40 10 user-profile 128 512 40 5 dispatcher 150 512 40 10 entitlements-update 150 256 70 3 assembler 175 512 75 10 profile-search 220 512 80 10
--key=2
means sort by the second column
Using awk to figure out the memory differences
What we're really interested in is the difference between the amount of memory provisioned vs usage.
awk -F , '{print $1, $3-$2}' data.csv
Let's pipe that into column again
awk -F , '{print $1, $3-$2}' data.csv | column -t
name 0 dispatcher 362 assembler 337 matcher 427 user-profile 384 profile-search 292 reporter 422 mailgun-listener 422 unsubscribe 448 bounce 120 legacy-reporting 482 content-store 176 legacy-alert-poller 192 migrator 176 entitlements-update 106
This is nice but it would be good to ignore the first line.
awk -F , '{print $1, $3-$2}' data.csv | tail -n +2
tail -n X
prints the last X
lines, the plus inverts it so its the first X
lines.
Sort mk 2
Now we have some memory differences it would be handy to sort them so we can address the most inefficient configurations first
awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2
And of course use column again to make it look pretty
awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2 | column -t
legacy-reporting 482 unsubscribe 448 matcher 427 reporter 422 mailgun-listener 422 user-profile 384 dispatcher 362 assembler 337 profile-search 292 legacy-alert-poller 192 migrator 176 content-store 176 bounce 120 entitlements-update 106
WTF
There it is! The utterly indecipherable bash command that someone reads 6 months later and scratches their head. In fact there has been 2 weeks since I wrote the first draft of this and I look at the final command and weep.
It is very easy to throw up your hands when you see a shell script that doesn't make sense but there are things you can do.
Remember that the process will usually start small like we have here, starting with one command, piping it into another, into another. This gives the lazy dev like me the perception that it is a complicated command but all it really is a set of steps to process some data. So if you're struggling you can wind back some of the steps for yourself by just deleting some of the steps and see what happens.
If it is an important business process that needs to be understood for a long time, you're probably better off writing it in a language where you can write some automated tests around it.
But a lot of the work we do does involve doing things that are ad-hoc and doesn't reside in "the codebase" where you can easily get into a TDD rhythm to accomplish something shiny. Often you have to do something a little boring and sometimes the tools available on your computer can really help you out. They're so old and well established that you can find tons of documentation and tips so dive in!
If you need more reasons to get to know the shell better read how command line tools can be up to 235x faster than your hadoop cluster
Top comments (5)
Cool command! You could remove the head bit though with awk.
By default the condition is NUL, which matches every line (so the command is run on every line). You have access to different variable (in both the condition and the command I think), one of them being
NR
for the line number.So, if you set the condition to
NR!=1
, it'll run on every line different than 1 (which is the first one).So, your command can be shortened from:
to
:smile:
I learned a bunch otherwise, so thanks!
Thanks for the improvements!
data.csv
column -s, -t < data.csv
Not applicable to general CSV.
I've never used
column
before. That's pretty cool.As far as the WTFness of it all goes, if you split it up and assign variables with nice names it'll be easy enough to understand. There's something about shell scripts though that make people write things as tersely as possible. I'd end up with something like that and refer back to it in my history if I needed it again soon after, but if I made it into a script I'd either comment the hell out of it or split it into parts, I think.
Bugfix:
if it did that, it'd be
head
:) The plus means take the offset from the start of the file instead of the end.May I suggest csvkit as a means to work with CSV data at the command line? It has several nifty tools to handle CSV files including getting rid of the column headers, splitting and merging several tables by columns, and even has ways of converting Excel xlsx tables to csv. I would recommend giving it a try if you have to deal with these types of files at the command line.