Intro
You can use official the Google Play Developer API which has a default limit of 200,000 requests per day, 60 requests per hour for retrieving the list of reviews and individual reviews, which is roughly 1 request every 2 minutes.
You can use a complete third-party Google Play Store App scraping solution for Python google-play-scraper
without any external dependencies, and JavaScript google-play-scraper
. Third-party solutions are usually used to break the quota limit.
You don't really need to read this post unless you need a step-by-step explanation without using browser automation such as playwright
or selenium
since you can see what Python google-play-scraper
regex
solution is, how it scrapes app results, and how it scrapes review results.
This ongoing blog post is meant to give an idea and actual step-by-step examples of how to scrape Google Play Store App using beautifulsoup
and regular expressions to creating something on your own.
What will be scraped
Prerequisites
Separate virtual environment
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
📌Note: this is not a strict requirement for this blog post.
Install libraries:
pip install requests lxml beautifulsoup4
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites. Only user-agent
, which is the easiest method, was covered in this blog post.
Full Code
from bs4 import BeautifulSoup
import requests, lxml, re, json
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
"id": "com.nintendo.zara", # app name
"gl": "US", # country of the search
"hl": "en_GB" # language of the search
}
def google_store_app_data():
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# where all app data will be stored
app_data = {
"basic_info":{
"developer":{},
"downloads_info": {}
},
"user_comments": []
}
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
# https://regex101.com/r/6Reb0M/1
additional_basic_info = re.search(fr"<script nonce=\"\w+\">AF_initDataCallback\(.*?(\"{basic_app_info.get('name')}\".*?)\);<\/script>",
str(soup.select("script")), re.M|re.DOTALL).group(1)
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("@type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
app_data["basic_info"]["description"] = basic_app_info.get("description").replace("\n", "") # replace new line character to nothing
app_data["basic_info"]["application_category"] = basic_app_info.get("applicationCategory")
app_data["basic_info"]["operating_system"] = basic_app_info.get("operatingSystem")
app_data["basic_info"]["thumbnail"] = basic_app_info.get("image")
app_data["basic_info"]["content_rating"] = basic_app_info.get("contentRating")
app_data["basic_info"]["rating"] = round(float(basic_app_info.get("aggregateRating").get("ratingValue")), 1) # 4.287856 -> 4.3
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["price"] = basic_app_info["offers"][0]["price"]
app_data["basic_info"]["developer"]["name"] = basic_app_info.get("author").get("name")
app_data["basic_info"]["developer"]["url"] = basic_app_info.get("author").get("url")
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["developer"]["email"] = re.search(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", additional_basic_info).group(0)
# https://regex101.com/r/Y2mWEX/1 (a few matches but re.search always matches the first occurence)
app_data["basic_info"]["release_date"] = re.search(r"\d{1,2}\s[A-Z-a-z]{3}\s\d{4}", additional_basic_info).group(0)
# https://regex101.com/r/7yxDJM/1
app_data["basic_info"]["downloads_info"]["long_form_not_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(1)
app_data["basic_info"]["downloads_info"]["long_form_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(2)
app_data["basic_info"]["downloads_info"]["as_displayed_short_form"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(4)
app_data["basic_info"]["downloads_info"]["actual_downloads"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(3)
# https://regex101.com/r/jjsdUP/1
# [2:] skips 2 PEGI logo thumbnails and extracts only app images
app_data["basic_info"]["images"] = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", additional_basic_info)[2:]
try:
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["video_trailer"] = "".join(re.findall(r"\"(https:\/\/play-games\.\w+\.com\/vp\/mp4\/\d+x\d+\/\S+\.mp4)\"", additional_basic_info)[0])
except:
app_data["basic_info"]["video_trailer"] = None
# User reviews
# https://regex101.com/r/xDVZq7/1
user_reviews = re.findall(r'Write a short review.*?<script nonce="\w+">AF_initDataCallback\({key:.*data:\[\[\[\"\w.*?\",(.*?)sideChannel: {}}\);<\/script>',
str(soup.select("script")), re.DOTALL)
# https://regex101.com/r/D6BIBP/1
# [::3] to grab every 2nd (second) picture to avoid duplicates
avatars = re.findall(r",\"(https:.*?)\"\].*?\d{1}", str(user_reviews))[::3]
# https://regex101.com/r/18EziQ/1
ratings = re.findall(r"https:.*?\],(\d{1})", str(user_reviews))
# https://regex101.com/r/mSku7n/1
comments = re.findall(r"https:.*?\],\d{1}.*?\"(.*?)\",\[\d+,\d+\]", str(user_reviews))
for comment, rating, avatar in zip(comments, ratings, avatars):
app_data["user_comments"].append({
"user_avatar": avatar,
"user_rating": rating,
"user_comment": comment
})
print(json.dumps(app_data, indent=2, ensure_ascii=False))
if __name__ == "__main__":
# https://stackoverflow.com/a/17533149/15164646
# reruns script if `basic_app_info` or `additional_basic_info` throws an exception due to <script> position change
while True:
try:
google_store_app_data()
except:
pass
else:
break
Code explanation
Import libraries:
from bs4 import BeautifulSoup
import requests, lxml, re, json
-
BeautifulSoup
,lxml
to parse HTML. -
requests
to make a request to a website. -
re
to match parts of the HTML where needed data is located via regular expression. -
json
to convert parsed data from JSON to Python dictionary, and for pretty printing.
Create global
request headers, and search query params
:
# user-agent headers to act as a "real" user visit
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
# search query parameters
params = {
"id": "com.nintendo.zara", # app name
"gl": "US", # country of the search
"hl": "en_GB" # language of the search
}
-
user-agent
is used to pretend that it's a real user visit from an actual browser so websites will assume that it's not a bot that send a request. Make sure you user-agent is up to date.
Pass params
, headers
to a request:
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
-
timeout
argument will tellrequest
to stop waiting for a response after 30 seconds.
Create a BeautifulSoup
object from returned HTML, pass HTML parser which in this case is lxml
:
soup = BeautifulSoup(html.text, "lxml")
Create a dict()
to store extracted app data. Here I'm creating the overall structure of the data and how it could be organized:
app_data = {
"basic_info":{
"developer":{},
"downloads_info": {}
},
"user_comments": []
}
App basic info
Match basic and additional app information via regular expression:
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce=\"\w+\" type=\"application/ld\+json\">({.*?)</script>",
str(soup.select("script")[11]), re.DOTALL)[0])
# https://regex101.com/r/6Reb0M/1
additional_basic_info = re.search(fr"<script nonce=\"\w+\">AF_initDataCallback\(.*?(\"{basic_app_info.get('name')}\".*?)\);<\/script>",
str(soup.select("script")), re.M|re.DOTALL).group(1)
-
re.findall()
will find all matched patterns in the HTML. Follow commented link to better understand what regular expression is matching. -
\w+
is a word metacharacter that matches any word. -
(.*?)
is a regex capture group(...)
, and.*?
is a pattern to capture everything. -
str(soup.select("script")[12])
is the secondre.findall()
argument which: - tells
soup
to grab all foundscript
tags, - then grab only
[12]
index from returned<script>
tags, - convert it to a
string
sore
module could process it. -
re.DOTALL
will tellre
to match everything, including newlines. -
re.M
is an alias forre.MULTILINE
. It will match everything immediately following each newline. -
re.findall()[0]
will access first index from the returnedlist
of matches which is the only match in this case and used to convert thetype
fromlist
tostr
. -
json.loads()
will convert (deserialize) parsed JSON to Python dictionary.
Access parsed JSON converted to dictionary data from basic_app_info
variable:
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["type"] = basic_app_info.get("@type")
app_data["basic_info"]["url"] = basic_app_info.get("url")
app_data["basic_info"]["description"] = basic_app_info.get("description").replace("\n", "") # replace new line character to nothing
app_data["basic_info"]["application_category"] = basic_app_info.get("applicationCategory")
app_data["basic_info"]["operating_system"] = basic_app_info.get("operatingSystem")
app_data["basic_info"]["thumbnail"] = basic_app_info.get("image")
app_data["basic_info"]["content_rating"] = basic_app_info.get("contentRating")
app_data["basic_info"]["rating"] = round(float(basic_app_info.get("aggregateRating").get("ratingValue")), 1) # 4.287856 -> 4.3
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["reviews"] = basic_app_info.get("aggregateRating").get("ratingCount")
app_data["basic_info"]["price"] = basic_app_info["offers"][0]["price"]
app_data["basic_info"]["developer"]["name"] = basic_app_info.get("author").get("name")
app_data["basic_info"]["developer"]["url"] = basic_app_info.get("author").get("url")
Next step is extracting additional data, some it doesn't shown on the page like developer email:
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["developer"]["email"] = re.search(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", additional_basic_info).group(0)
# https://regex101.com/r/Y2mWEX/1 (a few matches occures but re.search always matches the first occurence)
app_data["basic_info"]["release_date"] = re.search(r"\d{1,2}\s[A-Z-a-z]{3}\s\d{4}", additional_basic_info).group(0)
# https://regex101.com/r/7yxDJM/1
# using different groups to extract different data
app_data["basic_info"]["downloads_info"]["long_form_not_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(1)
app_data["basic_info"]["downloads_info"]["long_form_formatted"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(2)
app_data["basic_info"]["downloads_info"]["as_displayed_short_form"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(4)
app_data["basic_info"]["downloads_info"]["actual_downloads"] = re.search(r"\"(\d+,?\d+,?\d+\+)\"\,(\d+),(\d+),\"(\d+M\+)\"", additional_basic_info).group(3)
# ...
try:
# https://regex101.com/r/C1WnuO/1
app_data["basic_info"]["video_trailer"] = "".join(re.findall(r"\"(https:\/\/play-games\.\w+\.com\/vp\/mp4\/\d+x\d+\/\S+\.mp4)\"", additional_basic_info)[0])
except:
app_data["basic_info"]["video_trailer"] = None
App images
App images are located in the inline JSON from which we can extract images using regular expression. Here's an example where they're located as well as app description which is not being currently extracted:
# https://regex101.com/r/jjsdUP/1
# [2:] skips 2 PEGI logo thumbnails and extracts only app images
app_data["basic_info"]["images"] = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", additional_basic_info)[2:]
App comments
Match user comments data using regular expression:
# User reviews
# https://regex101.com/r/xDVZq7/1
user_reviews = re.findall(r'Write a short review.*?<script nonce="\w+">AF_initDataCallback\({key:.*data:\[\[\[\"\w.*?\",(.*?)sideChannel: {}}\);<\/script>',
str(soup.select("script")), re.DOTALL)
-
re.findall()
will find all matched patterns in the HTML. Follow commented link to better understand what regular expression is matching. -
(.*?)
is a regex capture group(...)
, and.*?
is a pattern to capture everything. -
re.DOTALL
will tellre
to match everything including newlines.
Next step is to extract all avatars
, ratings
and comments
itself using re.findall()
:
# https://regex101.com/r/D6BIBP/1
# [::3] to grab every 2nd (second) picture to avoid duplicates
avatars = re.findall(r",\"(https:.*?)\"\].*?\d{1}", str(user_reviews))[::3]
# https://regex101.com/r/18EziQ/1
ratings = re.findall(r"https:.*?\],(\d{1})", str(user_reviews))
# https://regex101.com/r/mSku7n/1
comments = re.findall(r"https:.*?\],\d{1}.*?\"(.*?)\",\[\d+,\d+\]", str(user_reviews))
-
\d{1}
to match exactly 1 digit number.
Finally, we need to interate over multiple iterables (extracted comments data) and append it to the dictionary:
for comment, rating, avatar in zip(comments, ratings, avatars):
app_data["user_comments"].append({
"user_avatar": avatar,
"user_rating": rating,
"user_comment": comment
})
-
zip()
takes multiple iterables, aggregates them in a tuple and returns it. In this case, number of each value will identical for all iterables e.g. 40 avatars, ratings and comments. -
append()
appends certain element to the end of thelist
.
Append user comments data to temporary list
:
# for name, ... in zip(...) is here
app_user_comments.append({
"user_name": name,
"user_avatar": avatar,
"comment": comment,
"user_app_rating": user_app_rating,
"user__comment_likes": likes,
"user_comment_published_at": date,
"user_comment_id": comment_id
})
Print the data:
print(json.dumps(app_data, indent=2, ensure_ascii=False))
Final step would be to add a boilerplate code that protects users from accidentally invoking the script when they didn't intend to:
if __name__ == "__main__":
# https://stackoverflow.com/a/17533149/15164646
# reruns script if `basic_app_info` or `additional_basic_info` throws an exception due to <script> position change
while True:
try:
google_store_app_data()
except:
pass
else:
break
while
loop was used to rerun the script if exception occurred. In this case, the exception will be IndexError
which appears from basic_app_info
or additional_basic_info
variables.
This error occurs because on each page load, Google Play changes <script>
elements position, sometimes it's at index [11]
(most often), sometimes at a different index. Rerunning the script fixes this problem for now.
An obviously better approach will be to create a better regex which will be updated on the next blog post update.
Output
{
"basic_info": {
"developer": {
"name": "Nintendo Co., Ltd.",
"url": "https://supermariorun.com/",
"email": "supermariorun-support@nintendo.co.jp"
},
"downloads_info": {
"long_form_not_formatted": "100,000,000+",
"long_form_formatted": "100000000",
"as_displayed_short_form": "100M+",
"actual_downloads": "211560819"
},
"name": "Super Mario Run",
"type": "SoftwareApplication",
"url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_GB&gl=US",
"description": "Control Mario with just a tap!",
"application_category": "GAME_ACTION",
"operating_system": "ANDROID",
"thumbnail": "https://play-lh.googleusercontent.com/3ZKfMRp_QrdN-LzsZTbXdXBH-LS1iykSg9ikNq_8T2ppc92ltNbFxS-tORxw2-6kGA",
"content_rating": "Everyone",
"rating": 4.0,
"reviews": "1643139",
"price": "0",
"release_date": "22 Mar 2017",
"images": [
"https://play-lh.googleusercontent.com/yT8ZCQHNB_MGT9Oc6mC5_mQS5vZ-5A4fvKQHHOl9NBy8yWGbM5-EFG_uISOXmypBYQ6G",
"https://play-lh.googleusercontent.com/AvRrlEpV8TCryInAnA__FcXqDu5d3i-XrUp8acW2LNmzkU-rFXkAKgmJPA_4AHbNjyY",
"https://play-lh.googleusercontent.com/AESbAa4QFa9-lVJY0vmAWyq2GXysv5VYtpPuDizOQn40jS9Z_ji8HXHA5hnOIzaf_w",
"https://play-lh.googleusercontent.com/KOCWy63UI2p7Fc65_X5gnIHsErEt7gpuKoD-KcvpGfRSHp-4k8YBGyPPopnrNQpdiQ",
"https://play-lh.googleusercontent.com/iDJagD2rKMJ92hNUi5WS2S_mQ6IrKkz6-G8c_zHNU9Ck8XMrZZP-1S_KkDsA6KDJ9No",
"https://play-lh.googleusercontent.com/QsdO8Pn6qxvfAi4es7uicI-xB21dPN3s8SBfmnuXPjFftdXCuugxis7CDJbAkQ_pzA",
"https://play-lh.googleusercontent.com/oEIUG3KTnijbe5TH3HO3NMAF5Ai8LkIAtKOO__TduDq4wOzGQA2PzZlBJg2C4mURDR8",
"https://play-lh.googleusercontent.com/BErkwcIVa4ldoVL56EvGWTQJ2nPu-Y6EFeAS4dfK7l0CufebWdrRC9CduHqNwysPYf8",
"https://play-lh.googleusercontent.com/cw86ny78mbNHVRDLlhw1fxVbZxiYFC7yYDRY3Nt2dnRGihRhxo1eOy4IjrSVVzKW9Is",
"https://play-lh.googleusercontent.com/Kx0gmRSH582Te-BeTo-C87f3hl-2sf7DRaWso3qZ46p9PZ97socE6FuK09vzebVF8AA",
"https://play-lh.googleusercontent.com/OJhOUUZjTUw4e3EEbPlZnuKdmUIGdLSSwUgb5ygPfiO0h1SeHIl3s_L7R8xBDLVnjPU",
"https://play-lh.googleusercontent.com/Z0Ggjrocxk7SRTAhFCL6ZEc04eCAdI09Xf08Th7dfn_ViIBrK7E8Bd1p3Lfi-pjiLLWz",
"https://play-lh.googleusercontent.com/pn58u5DpcUNOgE4NOQc4jFJaFyR3EaiO0YWlekYdQmBV3Q6jrF_ioX78gbtH2eZTTA",
"https://play-lh.googleusercontent.com/EItdRRArK4yI7LPArgKOhwTrcALMSFS41F49dOuX6c8a7XPw20WNfSiDrE7ZnIbTRME",
"https://play-lh.googleusercontent.com/xDFJgEfAPeGcfk72Nfe9jE-7oDyMDYtucW4W0mYh3vV8YgMb2T91BQ1do1r_8fU-Sw",
"https://play-lh.googleusercontent.com/Bn6SFuIjgL8CLHTB6C7t_Dv7MCGwAxh8OIV7z-gKhNpJtxss2Vqwl_50HdHFUyoet7s",
"https://play-lh.googleusercontent.com/eEKSdZPf7yo-WWcb9tGLQ-O17XVbd02rGREHwWC79JDOgVZFTaWmi0s1vg2H4Mn51hI",
"https://play-lh.googleusercontent.com/vlOYHPoi3AwQuAEAuWi1pu37cnxObDelQ5xQQP3ojAmptiJbBereG8Ugvlp_vihDS9c",
"https://play-lh.googleusercontent.com/2PuQ1L2sE0opnEG9AywzAzNBIV0sZo1y1ftrJ518oPwgjtUJ6iUrKskgn8DWRClFQnM",
"https://play-lh.googleusercontent.com/TvcAspZw7Tc1CQV3DJrzPL_I4sACQhvNhDqB90r9yiYfAnPOUk8gi1fFcT1NdAsKG_l-",
"https://play-lh.googleusercontent.com/vpt0r-PxWy2ea8xvuPSg0cn3iNXrS1v6pCFzWSPOane0lkDcfIGoSTvhiFz_en4CePI",
"https://play-lh.googleusercontent.com/3ZKfMRp_QrdN-LzsZTbXdXBH-LS1iykSg9ikNq_8T2ppc92ltNbFxS-tORxw2-6kGA",
"https://play-lh.googleusercontent.com/iTZtyWYr4T-slu1nifgRqEhtMLmxcNagc2rDAyiWntDQWCVLlGR7rDvx0uK6z-zLujwv",
"https://play-lh.googleusercontent.com/iTZtyWYr4T-slu1nifgRqEhtMLmxcNagc2rDAyiWntDQWCVLlGR7rDvx0uK6z-zLujwv"
],
"video_trailer": "https://play-games.googleusercontent.com/vp/mp4/1280x720/qjHSn4GwQWY.mp4"
},
"user_comments": [
{
"user_avatar": "https://play-lh.googleusercontent.com/EGemoI2NTXmTsBVtJqk8jxF9rh8ApRWfsIMQSt2uE4OcpQqbFu7f7NbTK05lx80nuSijCz7sc3a277R67g",
"user_rating": "3",
"user_comment": "Now, while I love the Mario Series, I will say that I am not the biggest fan of this game. When playing Remix 10, I found that the screen lagged for seemingly no reason, which threw me off plenty of times. The level design also seems pretty bland and just the same old settings you see over and over again. Overall I feel like this was just another cash grab from Nintendo, not to mention you actually need to PAY to unlock the rest of the game. But other than that, it looks decent graphic-wise."
}, ... other comments
{
"user_avatar": "https://play-lh.googleusercontent.com/EGemoI2NTXmTsBVtJqk8jxF9rh8ApRWfsIMQSt2uE4OcpQqbFu7f7NbTK05lx80nuSijCz7sc3a277R67g",
"user_rating": "2",
"user_comment": "Too many tutorials that dont even let you play until 5 minutes of tapping the screen. Then after only a few levels you have to pay for the rest of them. Nintendo makes so much money you\\'d think they could make a game that allowed you to pay to remove ads, not pay to play the game you installed in the first place. But when you aren\\'t being forcefed tutorials for a game you won\\'t play that long anyway, the gameplay is actually pretty fun and challenging. Those are the only pros."
}
]
}
Using Google Play Product API from SerpApi
The following section is for comparison example between DIY solution and API solution. SerpApi also extract data without browser automation including all reviews extraction.
The biggest difference is that SerpApi bypasses blocks from Google. It removes the need to figure out how to uses proxies, CAPTCHAs and which providers are good, and there's no need to maintain the parser if Google Play updates again. Have a look at links section below to test this code in the online IDE.
Two examples of extracting certain app info and all reviews using SerpApi pagination:
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
import os, json
params = {
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google_play_product", # parsing engine
"store": "apps", # app page
"gl": "us", # country of the search
"product_id": "com.MapstarAG.Mapstar3D", # low review count example to show it exits the while loop
"all_reviews": "true" # shows all reviews
}
search = GoogleSearch(params) # where data extraction happens
def serpapi_scrape_google_play_app_data():
results = search.get_dict()
print(json.dumps(results["product_info"], indent=2, ensure_ascii=False))
print(json.dumps(results["media"], indent=2, ensure_ascii=False))
# other data
def serpapi_scrape_google_play_app_reviews():
# to show the page number
page_num = 0
# iterate over all pages
while True:
results = search.get_dict() # JSON -> Python dict
if "error" in results:
print(results["error"])
break
page_num += 1
print(f"Current page: {page_num}")
# iterate over organic results and extract the data
for result in results.get("reviews", []):
print(result.get("title"), result.get("date"), sep="\n")
# check if the next page key is present in the JSON
# if present -> split URL in parts and update to the next page
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
else:
break
Top comments (0)