Beautiful Soup的使用

实例HTML页面

此处使用之前我们抓取到的百度的HTML文本进行抓取，内容如图：

当然，为了格式好看，我们可以将这段代码复制出来，在test文件夹下新建一个名为bs4_test的Python软件包，并右键新建一个名为test.html的文件，将这段代码粘贴。然后在代码 - 重新格式化代码即可。

使用实例

我们在bs4_test文件夹内，新建一个名为test的Python文件，并输入以下代码：

  
from bs4 import BeautifulSoup

with open("./test.html", encoding='utf-8') as file:
    html_doc = file.read()  # read the file

soup = BeautifulSoup(html_doc, 'html.parser')  # parse the file

links = soup.find_all("a")  # find all the links
for link in links:
    print(link.name, link['href'], link.get_text())
    
img = soup.find("img")
print(img["src"])

这里简单解释一下各个部分的作用。

首先是第一行，显然这是个将程序包导入到我们这个python脚本的作用

第二段

  
with open("./test.html", encoding='utf-8') as file:
    html_doc = file.read()  # read the file

使用的是with open语句打开我们的html文档，并且指定编码为utf8，打开的文档名为file，然后将文本内容完全读入到html_doc变量中。用with open语句的好处是当with open代码块运行结束后，他会自动将打开的文件关闭，不会一直处于占用状态，而且当文件读写发生了IO异常时，其能确保文件能够被关闭。而单纯使用open()方法则有可能在产生IO异常时，文件并未关闭。

第三段

  
soup = BeautifulSoup(html_doc, 'html.parser')  # parse the file

将HTML文本内容使用BeautifulSoup进行解析。

第四段

  
links = soup.find_all("a")  # find all the links
for link in links:
    print(link.name, link['href'], link.get_text())

查找所有名为a的文本，并放入以列表形式赋值给links。然后我们使用for-each语句遍历所有links，并显示每条link的变量名，herf属性和文本。

第五段

  
img = soup.find("img")
print(img["src"])

找到图片并输出它的链接。

运行这个脚本，我们得到了以下信息：

a http://news.baidu.com 新闻
a https://www.hao123.com hao123
a http://map.baidu.com 地图
a http://v.baidu.com 视频
a http://tieba.baidu.com 贴吧
a http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 登录
a //www.baidu.com/more/ 更多产品
a http://home.baidu.com 关于百度
a http://ir.baidu.com About Baidu
a http://www.baidu.com/duty/ 使用百度前必读
a http://jianyi.baidu.com/ 意见反馈
//www.baidu.com/img/bd_logo1.png

一般正常开发中，抓取的HTML会有很多div区块，如果我们直接使用find()方法，则会在整个页面中找到符合要求的内容，这有时候会多出很多冗余信息，因此我们可以通过先定位到对应的div区块，再在这个区块内找我们需要的内容。

定位到对应的div区块，我们可以使用定位div的ID来实现。一般来说，每个HTML文本，div区块都有唯一一个对应的ID，一个class可以对应多个div区块。

修改我们刚才的代码如下：

  
from bs4 import BeautifulSoup

with open("./test.html", encoding='utf-8') as file:
    html_doc = file.read()  # read the file

soup = BeautifulSoup(html_doc, 'html.parser')  # parse the file

div_node = soup.find("div", id="u1")

links = div_node.find_all("a")  # replace soup by div_node
for link in links:
    print(link.name, link['href'], link.get_text())

img = div_node.find("img")
print(img["src"])

运行结果如下：

Traceback (most recent call last):
  File "D:\Coder\PythonProject\Web Crawler\test\bs4_test\test.py", line 15, in <module>
    print(img["src"])
TypeError: 'NoneType' object is not subscriptable
a http://news.baidu.com 新闻
a https://www.hao123.com hao123
a http://map.baidu.com 地图
a http://v.baidu.com 视频
a http://tieba.baidu.com 贴吧
a http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 登录
a //www.baidu.com/more/ 更多产品

可以看到这里抛出了一个类型异常，原因就在于我们的HTML文本中，在ID为u1的div区块内，并不存在类型为img的元素，因此img这个变量是一个空值null，当我们尝试对一个空值通过下标访问其属性，肯定是有问题的。

Python爬虫1.3 - Beautiful Soup的使用

Beautiful Soup的使用

实例HTML页面

使用实例

相关文章

Python爬虫0.5 - 各个模块的介绍

Python爬虫1.1 - Requests的使用

Python爬虫1.2 - URL管理器