Webdriver has been a persistently alluring technology since I discovered it a couple of years ago. However, regular HTTP clients have always been sufficient for my needs.
I have recently wanted to pull some data off the local church website, and I have been unable to log in with any HTTP clients. So, I attempted to throw Etaoin at the problem, and it worked marvelously.
You will need a username and password for a congregate site to follow along. I suspect the routes will be identical.
;; Getting started
(require '[etaoin.api :as api])
(def base-url "https://mysite.com")
(def user "...")
(def pass "...")
(def ff (api/firefox))
Logging in is easy.
(api/go ff (str base-url "/members/login/"))
(api/fill-multi ff {:username user :password pass})
(api/submit ff {:id "password"})
After submitting the login form, I am unsure how to verify that the member landing page has loaded. So, for now, I advise just waiting a few seconds. If you're following along, you will have the browser in front of you and can "eyeball" it. I would appreciate any suggestions for improvement here.
We're in, so what now?
Well, I have trouble remembering names and faces. What if I had a flashcard system to help me memorize them? We can build that from the directory.
Let's navigate to the directory page and inspect it before proceeding.
(api/go ff (str base-url "/members/directory"))
It looks like each directory element is identifiable by the album
class.
Let's dig into an album tag.
<div class="album">
<a href="/members/directory/family/XXX">
<span class="album-img">
<img src="image-url" alt="...">
</span>
<span class="album-title">Doe, John</span>
</a>
</div>
We'll need a couple of functions. One takes an album and grabs the tag's value with class=album-title
, and the other grabs the image source.
(require '[clojure.string :as s])
(defn get-album-title [album-entry]
(->> {:class "album-title"}
(api/child ff album-entry)
(api/get-element-text-el ff)))
(defn get-album-image [album-entry]
(as-> album-entry $
(api/child ff $ {:tag "img"})
(api/get-element-attr-el ff $ :src)
(s/replace $ #"\?.*" "")
(str base-url $)))
This code may look familiar because it is similar to the kind of web-scraping you would do with a regular HTTP client. If not, I've got you covered.
album-entry
represents a DOM element like the <div class=album>
tag we inspected earlier, children and all. Call the child
function to get the sub-element we want, and then finally, a get-element-<thing>
function returns the string we need.
Let's put it together.
(->> {:class "album"}
(api/query-all ff)
(mapv (juxt get-album-title get-album-image)))
;=>
[["Doe, John" "path/to/image.jpg"]
["Doe, Jane" "path/to/image2.jpg"] ...]
At this point, I am beginning to lose interest. But I like having options. So, let's convert this to JSON and print it to the console. You can see the project here
Perhaps I will revisit and finish the project in another post.
Top comments (0)