10.02.2019 - Jay M. Patel - Reading time ~3 Minutes
Web scraping, also called web harvesting, web data extraction, or even web data mining is defined as a software program or code designed to automate the downloading and parsing of the data from the web.
Nowadays many websites such as Twitter, Facebook etc. provides REST based Application Programming Interface (APIs) to programmatically consume the structured data available on their websites and data obtained that way is usually not only “cleaner” but also easy and hassle-free compared to web scraping. I always try to use rest APIs if I can and only resort to web scraping if that’s not an option.
Why is web scraping essential?
The top reasons for why web scraping is essential for data mining are:
The website you want to extract data from does not provide a public API.
If there is a API then the free tier is rate limited meaning you are capped to calling it only a certain number of times. The paid tier of the API is cost prohibitive for your intended use case but accessing the website itself is free.
The API does not expose all the data you wish to obtain even in their paid tier whereas the website contains that information.
Who uses web scraping?
Marketing and lead generation: Businesses use web scraping to create marketing databases by extracting email addresses, phone numbers to generate leads by going through directories such as Yellowpages, Yelp.
Search engines: general purpose search engines like Google, Bing etc. run large scale web scrapers called web crawlers which go out and grab billions of webpages, index and rank them according to various natural language processing and web graph algorithms which not only power their core search functionality, but also products like google advertising, google translate etc.
Vertical search engines for recruitment, real estate and travel: Websites such as indeed.com, Expedia, Kayak all run web scrapers/crawlers to gather data focusing on specific segment of online content which they process further to extract out more relevant information such as name of the company, city, state, job title in case of indeed.com which can be used by the users for filtering through the search results. The same is true of all search engines where web scraping is at the core of their product and the only differentiation between them is the segment they operate in and the algorithms they use to process the HTML content to extract out content which is used to power the search filters.
Brand and competitor and price monitoring: Web scraping is used by companies to monitor prices of various products on ecommerce sites as well as customer reviews, social media posts, news articles for not just their own brands but also for their competitors. This data helps companies understand how effective their current marketing funnel has been and also lets them get ahead of any negative reviews before it causes a noticeable impact on sales.