Today we received beta access to Codex and GPT-3 models and started playing with automated web scraping.
OpenAI has a good example for prompt design in the documentation for Completion. Based on the example, we provided samples of HTML and expected results in English, French and German languages.
This is an extractor of the number of search results from HTML
HTML: "<div id="result-stats">About 3,200,000,000 results<nobr> (0.97 seconds) </nobr></div>"
Number of search results: 3200000000
HTML: "<div id="result-stats">About 2,200,000,000 results<nobr> (0.297 seconds) </nobr></div>"
Number of search results: 2200000000
HTML: "<div id="result-stats">Environ 1 400 000 000 résultats<nobr> (1,05 secondes) </nobr></div>"
Number of search results: 1 400 000 000
HTML: "<div id="result-stats">About 1,790,000 results<nobr> (0.55 seconds) </nobr></div>"
Number of search results: 5000000
HTML: "<span class="nums_text">百度为您找到相关结果约100,000,000个</span>"
Number of search results: 100000000
HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds) </nobr></div>"
2. "<div id="result-stats">About 953,626,112 results<nobr> (1.29 seconds) </nobr></div>"
3. "<div id="result-stats">Ungefähr 1.240.000.000 Ergebnisse<nobr> (0,72 Sekunden) </nobr></div>"
Extracted number of search results
1. 18000020000
2. 953626112
3. 1240000000
Then we provided HTML to extract data and a prompt.
HTML text
1. "<div id="result-stats">About 1,800,0020,000 results<nobr> (0.589 seconds) </nobr></div>"
2. "<div id="result-stats">約 1,510,000,000 件<nobr> (0.82 秒) </nobr></div>"
3. "<div id="result-stats">Aproximadamente 2.180.000.000 resultados<nobr> (0,73 segundos) </nobr></div>"
4. "<div id="result-stats">Sekitar 2.480.000.000 hasil<nobr> (0,72 detik) </nobr></div>"
5. "<div id="result-stats">حوالى ١٧٬٤٤٠٬٠٠٠٬٠٠٠ نتيجة<nobr> (٠٫٩٠ ثانية) </nobr></div>"
6. "<div id="result-stats">Yaklaşık 2.680.000.000 sonuç bulundu<nobr> (0,62 saniye) </nobr></div>"
7. "<div id="result-stats">Приблизна кількість результатів: 2 630 000 000<nobr> (1,38 с) </nobr></div>"
8. "<div id="result-stats">Aproximadamente 19.250.000.000 resultados<nobr> (0,73 segundos) </nobr></div>"
9. "<div id="result-stats">Ungefär 1 960 000 000 resultat<nobr> (0,80 sekunder) </nobr></div>"
10. "<div id="result-stats">Περίπου 2.480.000.000 αποτελέσματα<nobr> (0,76 δευτερόλεπτα) </nobr></div>"
Extracted number of search results
1.
OpenAI was able to extract data from different languages too: Japanese, Ukrainian, Greek, Turkish, Spanish. Absolutely amazing!
Extracted number of search results
1. 18000020000
2. 15100000
3. 21800000
4. 24800000000
5. 196000000
6. 268000000
7. 26300000000
8. 19000000
9. 19600000000
10. 2480000000
It has incorrectly extracted Arabic digits (#5) probably because there were no examples.
We used the Davinci model and default parameters in the OpenAI Playground.
Here is a video of data extraction on OpenAI Playground:
Next time we will use the Codex model to generate a Ruby or Python program that extracts the number of search results. The end goal is to replace part of hand-crafted parsers with automated data extraction.
Links
OpenAI Playground • Request beta access for OpenAI • Try SerpApi for free
Outro
If you have any questions or an idea on how to properly automatically extract data from SERPs, feel free to drop a comment via Twitter at @serp_api.
Top comments (1)
Can it also extract google voice numbers?