Scraping JavaScript pages with Scrapy and Splash




This video is part of the “Learn Scrapy” series. In this video, you’ll learn how to use Splash to render JavaScript based pages for your Scrapy spiders.

Have a look at the companion website: https://learn.scrapinghub.com/scrapy/

– Splash docs: https://splash.readthedocs.io/en/stable/
– Scrapy-Splash plugin: https://github.com/scrapy-plugins/scrapy-splash

Settings for ScrapySplash:

SPLASH_URL = ‘http://localhost:8050’
DOWNLOADER_MIDDLEWARES = {
‘scrapy_splash.SplashCookiesMiddleware’: 723,
‘scrapy_splash.SplashMiddleware’: 725,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’: 810,
}
SPIDER_MIDDLEWARES = {
‘scrapy_splash.SplashDeduplicateArgsMiddleware’: 100,
}
DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter’

Original source


17 responses to “Scraping JavaScript pages with Scrapy and Splash”

  1. login form is working on selenium , not scrapy but I am facing on issue , comon chrome browser , I can submit form , but i can not submit form on selenium . so scrapy can solve issue ?

  2. This needs more documentation and examples, I tried applying splash to my project but discarded it because I did not know how to set up a navigation flow using splash. Could you guys please provide us with more real life examples?

  3. Hi Guys – three Questions, First, why do we run scrapy in a docker container? Second, pip install doesn't work in my case and it gives an expectation error? lastly, I usually run spiders without creating projects. Do I have to create a prokect to use Splash?

  4. Hey, please add in the video description about how to stop splash container and free up the port. I've never used Docker before, and so had no idea about it. For those who are like me, first press Ctrl+C to exit from "docker run 8050:8050 …" command. Then type "docker ps" (without quotes) and copy Container ID of scrapinghub/splash which looks something like 31bbfd572c09 (yours will be different). Then type "docker stop [container_id]" which in my case will be "docker stop 31bbfd572c09". Then confirm that it has stopped running by running "docker ps" again, this time it'll not show any container.

Leave a Reply