Posted on Oct 16, 2017

Crawling Websites in React-Native

#crawling #coding #tutorial

Coming from years of web developing React-Native feels like a fresh start to me. You get better access to native functionality AND you have fewer rules imposed to your app. For example, you can use fetch() toy get any website you want. What this enables is client site web crawling.

Why

Maybe you need data from a service, but they don't expose an API or the API doesn't give you all the data you need or the API is simply bad. Normally you would have to set up a server that crawls the target website and turns it into an API that you can use, but when you can access all data from all websites inside your client, you can save time.

Lets take the Amazon website for example. You want to show all products of a page and a way to load the next, but you want it in our own data structure, so you can build your own UI around it.

How

Get the HTML from the server
Extract the needed data from the HTML
Reshape the data for our use

1 Get the HTML from the Server

That's the easy part.

async function loadGraphicCards(page = 1) {
  const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`;
  const response = await fetch(searchUrl);   // fetch page

  const htmlString = await response.text();  // get response text
  ...
}

Fetching a URL with a search pattern returns a HTML page with some items.

2 Extract the Needed Data from the HTML

This is a bit trickier. The data is inside the HTML, but it's a string.

The naive approach would be to use a regular expression to parse the string and get the data, but HTML doesn't have a regular grammar so that wouldn't work.

The better way is to use a HTML parser and CSS selectors.

Cheerio is this solution. It comes with a HTML parser and a re-implementation of jQuerys core functionality, so you can use it on Node.js.

Problem is, React-Native is missing most Node.js packages so it doesn't work.

I searched quite some time to finde a re-implementation of Cheerio that works on React-Native the naming of the package was a bit strange, haha.

But with this, the extraction of the data is now childs play too.

async function loadGraphicCards(page = 1) {
  const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`;
  const response = await fetch(searchUrl);      // fetch page 

  const htmlString = await response.text();     // get response text
  const $ = cheerio.load(htmlString);           // parse HTML string

  const liList = $("#s-results-list-atf > li"); // select result <li>s
  ...
}

3 Reshape the Data for further Use

After the data has been extracted from the HTML, we can start to reshape it for our use-cases. Extraction and reshaping are a bit blurry here, the <li>s we selected are full of markup and getting the right data out of them is extraction too, but often these two steps go hand-in-hand.

async function loadGraphicCards(page = 1) {
  const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`;
  const response = await fetch(searchUrl);  // fetch page 

  const htmlString = await response.text(); // get response text
  const $ = cheerio.load(htmlString);       // parse HTML string

  return $("#s-results-list-atf > li")             // select result <li>s
    .map((_, li) => ({                      // map to an list of objects
      asin: $(li).data("asin"),                   
      title: $("h2", li).text(),                
      price: $("span.a-color-price", li).text(),
      rating: $("span.a-icon-alt", li).text(),
      imageUrl: $("img.s-access-image").attr("src")
    }));
}

This is not a robust example, but I think you get the idea. We can now use the new list of objects in our app to make our own UI for the Amazon results.


class App extends ReactComponent {
  state = {
    page: 0,
    items: [],
  };

  componentDidMount = () => this.loadNextPage();

  loadNextPage = () =>
    this.setState(async state => {
      const page = state.page + 1;
      const items = await loadGraphicCards(page);
      return {items, page};
    });

  render = () => (
    <ScrollView>
      {this.state.items.map(item => <Item {...item} key={item.asin}/>)}
    </ScrollView>
  );
}

const Item = props => (
  <TouchableOpacity onPress={() => alert("ASIN:" + props.asin)}>
    <Text>{props.title}</Text>
    <Image source={{uri: props.imageUrl}}/>
    <Text>{props.price}</Text>
    <Text>{props.rating}</Text>
  </TouchableOpacity>
);

Conclusion

As with most problems, if you have the right tools solutions can become simple. Often the problem is more about finding these tools :D

This client side crawling approach can be used to build quick prototypes without the need of an API. Amazon is so nice to deliver okay-ish static HTML, so it works rather well on their sites.

Top comments (17)

K • Jun 16 '22

Glad this article is still helpful after all that time :D

Acaraccioli • May 13 '20

Hello K, great post learned a lot I didnt know this was possible using fetch. Just one quick question. How would you manage if you wanted to fetch some quick data in front end (react) but had to enter information in an input tag and maybe even click a button? I hope you can help me out a bit

K • May 14 '20

Glad you liked it.

I'd use React hooks.

function MyComponent(props) {
  const [info, setInfo] = React.useState("");
  const [remoteData, setRemoteData] = React.useState(
    "No data fetched yet!"
  );

  async function load() {
    const response = await fetch(
      "http://example.com?info=" + info
    );
    const text = await response.text();
    setRemoteData(text);
  }

  return (
    <div>
      <input
        value={info}
        onChange={(e) => setInfo(e.target.value)}
      />
      <button onClick={load}>Fetch</button>
      <textarea>{remoteData}</textarea>
    </div>
  );
}

Acaraccioli • May 14 '20

That is great I think that might works thanks a lot! I'm just trying to figure out this error:
Access to fetch at 'MyUrl' from origin 'localhost:8100' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's. I've found to fix it adding {mode:"no-cors"} in the fetch call but the object returns null. Do you know anything about this kind of error?

Hardiansa • Mar 1 '19

Hi K, this awesome post.
So already try and it's working fine,

But I have some problem,
I try scrape from streaming service website (anime).

But there are no video tag inside the website.

So I try to inspect element, then I saw that website need to click "play" button then I got ifarame embedded html document with video tag.

So how can i do click on cheerio then get embeded video?

Thanks

K • Mar 1 '19

Sorry, I don't know if Cheerio works with JavaScript sites.

One way to solve this would be to check if you could calculate the video URL from the data that is already in the HTML.

Otherwise I don't know.

Bagustyo • Nov 14 '18

why Async ? what if just fetch ?

K • Nov 14 '18

You can use fetch without async/await. React-Native supports async/await, that's why I used it, but it isn't needed, you can use promises directly :)

Bart Karalus • Oct 16 '17

Nice one! I had no clue there was a jquery-like tool for RN. Very useful.

Martin Sone • Aug 18 '18

Hi K, could the above crawling applies to reactjs or only to react native?

K • Aug 18 '18

Only React-Native, because you can't access other websites from within a browsers, just sites from the same domain or such that are CORS enabled.

Prakort Lean • Aug 12 '19

Please go in depth, i couldn't get cheerio-without-node-native to work

Richard Joseph • Jan 27 '21

There's also react-native-cheerio, I've not yet used it myself but, obviously, I'm doing the research, also.

ynstl • May 31 '19

No, it's not working. ERROR.
ESLint Parsing error: Unexpected token

lilraja-x • Dec 25 '22

I'm new to this react native thing... I've tried your exact method as of now yet nothing is displayed on my react native mobile app.
I'm new to this so your help will matter alot.

lilraja-x • Dec 25 '22

There's no error but also nothing's displayed.

View full discussion (17 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community

Crawling Websites in React-Native

Why

How

1 Get the HTML from the Server

2 Extract the Needed Data from the HTML

3 Reshape the Data for further Use

Conclusion

Top comments (17)

Read next

HubSpot offers a powerful platform for creating user interfaces (UI) and serverless functions using React, allowing users to develop highly customizable pages and forms directly within the CRM itself.

What is JSON Merge Patch?

What's new in Flutter 3.27

Building a Nickname-Based Crypto Transfer Service Like WhiteBIT's QuickSend: A Developer's Guide