Intro
A website needs to be able to retrieve data, such as information on a set of products or users. To do this, many websites make requests to APIs, which in turn access data from a database or some other backend service. Today I'm going to be talking about how I was able to scrape JSON data from an undocumented API, in order to integrate that data into my own application.
The website
Currently I am working on an application that allows users to create a grocery list (here), as well as compare items. For this I need several pieces of data, such as the name and price of the product(s), as well as nutrition information, ingredients, and images of the product and/or its packaging. Conveniently, I found all of this information in the API at https://www.bakersplus.com/atlas/v1
.
Retrieving the data
Preliminary request to get product ids
Secondary request that returns product data
Upon looking at the web traffic, there are two web requests here that are important. The first one access the endpoint /search/v1/products-search
, which takes parameters filter.query
and page.size
. This endpoint returns product id numbers in a JSON object.
h = {"User-Agent": user_agent, "Accept":"*/*", "Host":"www.bakersplus.com", "x-laf-object":json.dumps(x_laf_obj)}
def getProductIds(query, num):
#query = "rice"
ids = []
req = requests.get(base_url+"/search/v1/products-search?filter.query="+query+"&page.size="+str(num), headers=h)
The second request is made to the /product/v2/products
endpoint. An array of the product numbers (filter.gtin13s
) is passed as a url parameter. What we get as the response is all the information we need, although, including some extra info we don't want that is filtered out via a helper function.
def getProductInfo(ids):
p = {"filter.gtin13s":[]}
for i in ids:
p["filter.gtin13s"].append(i)
req = requests.get(base_url+"/product/v2/products", headers=h, params=p)
Rate limiting/Request restrictions
APIs, especially when publicly exposed, will limit the amount of consecutive requests. In my experience with the above example, it is more likely to timeout if the same search is made consecutively. In addition, this API also required certain HTTP headers to be included, which others did not require. One of these was called x-laf-object
, which appeared to be some kind of location tracking object.
Conclusion
Many websites don't make it this easy to scrape their data and effectively bypass their frontend. However, if the data these services provide is public anyway, and is effectively rate-limited and protected from abuse, then there isn't really a reason to not build a solution that is modular and easier to fix/debug.
Livestreams/VODs of dev: Youtube
Code: Github
Top comments (0)