A new mission. We need to watch all of the Stephen King adaptations in order of release. It's all on https://stephenking.com/works/movie/index.html but I want to pull each title into a spreadsheet. So I'm going to knock up a quick script to extract all of the movies in order.
The markup on the page is quite nice.
<div class="works-inner">
<a href="/works/movie/carrie.html" class="row work" data-date="1976-0-03, " data-sort="Carrie">
<div class="col-12 col-sm-6 works-title">Carrie</div>
<div class="col-6 col-sm-3 works-type">Movie</div>
<div class="col-6 col-sm-3 works-date">November 03rd, 1976</div>
</a>
<a href="/works/movie/shining.html" class="row work" data-date="1980-0-23, " data-sort="Shining, The">
<div class="col-12 col-sm-6 works-title">The Shining</div>
<div class="col-6 col-sm-3 works-type">Movie</div>
<div class="col-6 col-sm-3 works-date">May 23rd, 1980</div>
</a>
</div>
I can grab the markup quite easily.
require 'open-uri'
html = open('https://stephenking.com/works/movie/index.html').read
Ok, but I want to parse it. I can use Nokogiri to extract the data I need.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('https://stephenking.com/works/movie/index.html'))
We can use .css method to extract all matches for the CSS selector.
doc.css('.work')
Each link has a selector of work and inside that we have a div each with a convenient selector for the data we want
doc.css('.work').map do |w|
[
w.css('.works-title')[0].content,
w.css('.works-date')[0].content
]
end
That's great but I want to sort by the date of release. Ruby copes with Dates and Times sure. But Rails has some handy convenience extensions provided by active_support.
> Date.parse('November 03rd, 1976')
=> Wed, 03 Nov 1976
Not every value is actually a date.
> Date.parse('TBD')
Traceback (most recent call last):
(irb):10:in `parse': invalid date (Date::Error)
We can use a quick rescue. This is horrid but this is a quick script.
irb(main):012:0> Date.parse('TBD') rescue nil
=> nil
We can then sort our records by the date. Here's the full script. It works a treat.
require 'active_support'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('https://stephenking.com/works/movie/index.html'))
puts doc.css('.work').map do |w|
[
w.css('.works-title')[0].content,
(Time.parse(w.css('.works-date')[0].content) rescue nil)
]
end
.sort_by { |a| a[1] || Time.now }
.map { |a| a[0] }
Carrie
The Shining
Creepshow
Cujo
The Dead Zone
Christine
Children of the Corn
Cat's Eye
Silver Bullet
Maximum Overdrive
Stand By Me
Creepshow 2
The Running Man
Pet Sematary (1989)
Tales from the Darkside: The Movie
Graveyard Shift
Misery
Sleepwalkers
The Dark Half
Needful Things
The Shawshank Redemption
The Mangler
Dolores Claiborne
Thinner
The Night Flier
Apt Pupil
The Green Mile
Hearts in Atlantis
Dreamcatcher
Secret Window
Riding the Bullet
1408
The Mist
Dolan's Cadillac
Mercy
A Good Marriage
Cell
My Pretty Pony
The Dark Tower
IT - Part 1: The Losers' Club
Gerald's Game
1922
Pet Sematary (2019)
IT: Chapter Two
In the Tall Grass
Doctor Sleep
Firestarter
Mr. Harrigan's Phone
The Girl Who Loved Tom Gordon
Hearts
Suffer the Little Children
Salem's Lot
I guess we're starting with Brian De Palma's Carrie.
Top comments (0)