Dynamic Javascript Scraping – Web scraping with Beautiful Soup 4 p.4

Welcome to part 4 of the web scraping with Beautiful Soup 4 tutorial mini-series. Here, we’re going to discuss how to parse dynamically updated data via javascript.

Many websites will supply data that is dynamically loaded via javascript. In Python, you can make use of jinja templating and do this without javascript, but many websites use javascript to populate data. To simulate this, I have some javascript added to the sample page: https://pythonprogramming.net/parsememcparseface/

https://pythonprogramming.net
Tweets by Sentdex
https://www.facebook.com/pythonprogramming.net/
https://plus.google.com/+sentdex

Original source

August 21, 2017

d4mer

Javascript

beautiful soup, beautiful soup 4, Dynamic, javascript, javascript tutorial, parse, parsing, python, scrape, spider, tutorial, web scraping

40 responses to “Dynamic Javascript Scraping – Web scraping with Beautiful Soup 4 p.4”

Simon Chan says:

August 21, 2017 at 07:29

Hello Harrison. Did you eventually make that tutorial on multi-processing / mutlithreading with PyQt?

Log in to Reply
Eric Choi says:

August 21, 2017 at 07:29

Hi, thanks for making this tutorial. Can you also provide the codes for PyQt5? I've tried installing PyQt4 but i just couldn't get it to install. I have no other choice but to work with PyQt5 that comes with Python 3.6.

Log in to Reply
segun oyebode says:

August 21, 2017 at 07:29

i am crapping a page that required login, i have login with my code but i can't scrap the data from the table beacause it is dynamic how can i do that with pyqt with the login?

Log in to Reply
Product Dutt says:

August 21, 2017 at 07:29

"Cannot connect to X server" what is the issue?

Log in to Reply
Harsha’s Python Guide says:

August 21, 2017 at 07:29

Thanks for the playlist..

Log in to Reply
Samir Saci says:

August 21, 2017 at 07:29

If you got the error for js_test.text : be sure to have urllib.request.urlopen(link) and not urllib.request.urlopen(link).read()

Log in to Reply
Huan Wang says:

August 21, 2017 at 07:29

Hi sentdex, thank you very much for sharing your Python programming experience. May I ask a question? Is it possible to extract the information "Look at you shinin!" between the <script> tag without mimicking the browser?

Log in to Reply
Md Sarwar says:

August 21, 2017 at 07:29

2:20
when i run the code showing this error:
Traceback (most recent call last):
File "C:UsersusernameDesktopa.py", line 9, in <module>
print(js_test.text)
AttributeError: 'NoneType' object has no attribute 'text'

Log in to Reply
Naimur Rahman says:

August 21, 2017 at 07:29

is there any way to use it in a py 'Qt designer' Gui app?
as QApplication(sys.argv) is called twice then and so new event loop is created and function fails to execute..

any solution? :/

Log in to Reply
Еркін Абдукаримов says:

August 21, 2017 at 07:29

Hello, i want parsing one website, which information update(add new) when you scroll down(info in table),and how i can parse all 'td.text'

Log in to Reply
Yawgmoth1806 says:

August 21, 2017 at 07:29

Hi,
I've just seen your video and it helped me understanding the principle behind scraping dynamic pages. I tried the code on your page and it worked fine, but I ran into a problem: I tried it on another website and after like 15 minutes the line: "client_response = Client(url)" is still being executed. Does scraping like this takes an eternity for bigger sites? Or is something wrong with code?
I am using pythin 3.6 and 4.11 pyqt.
Regards

Log in to Reply
Sohil Luhar says:

August 21, 2017 at 07:29

I get this error when try to run please help

File "D:/Python/test.py", line 20
url = 'https://pythonprogramming.net/parsememcparseface/'
^
IndentationError: unindent does not match any outer indentation level

Log in to Reply
ekbastu says:

August 21, 2017 at 07:29

write a book dude…..

Log in to Reply
Qian Li says:

August 21, 2017 at 07:29

Can you make a tutorial of explaining how to import from a website that contains a list of links, and each link points to a different dataset. I wonder how to import those datasets from the links in the same webpage and combine them in a dataframe. Thaaaaaanksssss……

Log in to Reply
PRASHANT GOYAL 4-Yr B.Tech. Chemical Engg. says:

August 21, 2017 at 07:29

I wanted to how can I scrape the title of all the videos in a playlist of more than 100 videos using this from Youtube. Can anyone help.

Log in to Reply
subhrajit mohanty says:

August 21, 2017 at 07:29

I want to scrap from a website containing reviews comments load on click of read more. Could you please suggest me what I have to do? I am new to web scraping.

Log in to Reply
CariagaXIII says:

August 21, 2017 at 07:29

working code
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
import bs4 as bs
import urllib.request

class Client(QWebPage):
def _init_(self,url):
self.app= QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec_()

def on_page_load(self):
self.app.quit()

url = 'https://pythonprogramming.net/parsememcparseface/'
client_response = Client(url)
source = client_response.mainFrame().toHtml()

soup = bs.BeautifulSoup(source,'lxml')
js_test = soup.find('p',class_='jstest')
print(js_test.text)

Log in to Reply
Mark Jay says:

August 21, 2017 at 07:29

dang I wanted to see a Qt browser

Log in to Reply
Efthimis Ath says:

August 21, 2017 at 07:29

Nice Tutorial. I am trying to scrape some data from the website http://www.airlinequality.com but i don't know why the code below it is bit working. can you help me?

from bs4 import BeautifulSoup
import os
import urllib.request
import re

thepage = urllib.request.urlopen("http://www.airlinequality.com/airline-reviews/aegean-airlines")
soup = BeautifulSoup(thepage, "lxml")
#print(soup)
for profile in soup.findAll('article',{"itemprop":"review"}):
image = profile.text
print(image)

Log in to Reply
aeroplaneman747 says:

August 21, 2017 at 07:29

Thanks a lot of this great tutorial! It works really nicely for scraping a single page, but when looping through multiple pages it retrieves all the html but throws this error at the end:
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
Any idea on how to fix this?

Log in to Reply
Mitchell Woodin says:

August 21, 2017 at 07:29

Is there any way to scrape comments from html to be able to manipulate that text?

I can't seem to use soup.find_all('<!–') to pull it out.

Log in to Reply
West Jr says:

August 21, 2017 at 07:29

would this work with data generated from react.js??

Log in to Reply
Eric Roque says:

August 21, 2017 at 07:29

Hey. Is there any way I can use Beautiful Soup to fill out forms, click a button, then scrape information off of a page?

I want to create a web scraper/crawler that will scrape textbook information off of an online textbook store. To search for the textbook, I need to fill out a form and pick several options (department, term, course, section, etc), click a submit button, and wait for the page to load.

Any ideas?

Thanks.

Log in to Reply
chari Muvilla says:

August 21, 2017 at 07:29

It's amazing how everytime i have a problem in python i run into one of tutorials and solve it XD. Just thank you. But i still have a question:
To make the program lighter in case there are several scripts can you somehow onl run one of them?
Thanks again for the tutorials :p

Log in to Reply
Mahmoud Talebi says:

August 21, 2017 at 07:29

hi,
can we install pyQt4 on centos 6.
or on the other hand i wana develop webapp and upload in VPS host for extracing data. PhantomJS makes so many problem in cgi-bin therefor I thought qtwebkit could be better.

Log in to Reply
Chris Grippo says:

August 21, 2017 at 07:29

I'm getting an error "AttributeError: 'Client' object has no attribute 'mainFrame'" any thoughts on how to fix this? I'm using Python 3 and PyQt5.

For PyQt5 I used:
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebEnginePage

I can't figure out what's causing that.

Log in to Reply
Sora Amm Keyheart says:

August 21, 2017 at 07:29

how do I link that code to html <input> tag?
so when the user paste a link it scrape and display the data on html?

Log in to Reply
Anastasia Lee says:

August 21, 2017 at 07:29

Hello! Thank you for these lessons! What is wrong i did?[ Traceback (most recent call last):
File "C:Pythonparse1.py", line 2, in <module>
from PyQt4.QtGui import QApplication
ImportError: DLL load failed: no found this module]

Log in to Reply
Ratul Shams says:

August 21, 2017 at 07:29

@sentdex Bro, I've been watching your tutes of a long time and its helped me loads! <3 Love it! You make the hardest stuff easy! And also show implementations! Can you please give some more tutorials on A.I. for beginners? Would love that mate! best wishess!

Log in to Reply
Zhuchang Zhan says:

August 21, 2017 at 07:29

Oh sentdex thank you so much again for making me level up in programming grind. What makes you keep going with all the programming? Too much coding often drives me nuts.

Log in to Reply
G-FORCE GAMING says:

August 21, 2017 at 07:29

What about xhr I'm a beginner btw, I'm getting none for some sites

Log in to Reply
panzach says:

August 21, 2017 at 07:29

for PyQt5:
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage

Log in to Reply
chemhong says:

August 21, 2017 at 07:29

How can i add an header just User-Agent on Url request, sentdex?
thanks

Log in to Reply
გიორგი კაკულაშვილი says:

August 21, 2017 at 07:29

Can we get 'inspect element' instead of 'source code' of html by python?

Log in to Reply
Nevil Dsouza says:

August 21, 2017 at 07:29

Hey guys.New to scrapping. I used Scrapy for web scraping. Worked well until the issue arised maybe because of Google tag or AJAX. Need help. Here's the project and the issue : https://github.com/ZNClub-PA-ML-AI/Scrapy-Spiders/issues/2

Log in to Reply
Wojciech Orzechowski says:

August 21, 2017 at 07:29

QtWebKit does not work anymore 🙁 can you update the video for Python 3? I would really appreciate that. You are great!

Log in to Reply
宏杰李 says:

August 21, 2017 at 07:29

you should try selenium. it's less type and user_friendly. and it's more acceptable for beginner.

Log in to Reply
Logan Lee says:

August 21, 2017 at 07:29

I have a question~! How can I make a new window in matplotlib? When I run plt.show(), it just shows its graph in ipyton console instead of making a new window. I use anaconda Spyder python IDE. Please… tell me how to open a new window~!

Log in to Reply
Hugo Peralta says:

August 21, 2017 at 07:29

How does QWebPage work behind a proxy?

Log in to Reply
Miguel Serrano says:

August 21, 2017 at 07:29

Thank you Harrison.
I'm a fan of your python tutorials, I love python.
Could you please make some tutorials about web scraping using Selenium to login in forms and scrap dynamic data?

Log in to Reply

D4mations.com

Dynamic Javascript Scraping – Web scraping with Beautiful Soup 4 p.4

40 responses to “Dynamic Javascript Scraping – Web scraping with Beautiful Soup 4 p.4”

Leave a Reply Cancel reply