Sometimes in ecommerce web scraping you scrape the same product but with different names like :
Msi RTX 2070 Ti
Nvidia rtx 2070ti MSI 8gb
And more complicated examples, I wonder if there is any neural net to classify these type of products
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (4)
Rather than a neural network, I think you should look at more differentiating features from say the description and remove the duplicate products. For instance, the model number will be the same for the above laptops, if you are using a custom scraper, it should be doable, find the model number from the details page and make it the primary key in your db, that should do.
it's doable for 2 3 websites but i am scraping like 8 websites
Yes I think it's still doable, but I will still try to find any network if available. Also, you will be adding great overhead if you're using a neural net not to mention the r sources you'd need in case you need to run it on server
A good tokenizer and pairwise Levenshtein distance should be a good starting point. en.wikipedia.org/wiki/Levenshtein_...