python爬虫基础html内容解析库BeautifulSoup

本文介绍: 我们通过Reque sts 请求 url 获取数据，请求把数据返回来之后就要提取目标数据，不同的网站返回的内容通常有多种不同的格式，一种是 json 格式，我们可以直接通过json.load s 转换 python的json 对象处理。另一种 XML 格式的，还有一种最常见格式的是 HTML 文档，今天就来讲讲如何从 HTML 中提取出感兴趣的数据。

我们通过Reque sts 请求 url 获取数据，请求把数据返回来之后就要提取目标数据，不同的网站返回的内容通常有多种不同的格式，一种是 json 格式，我们可以直接通过json.lo ad s 转换 python的js on 对象处理。另一种 XML 格式的，还有一种最常见格式的是 HTML 文档，今天就来讲讲如何从 HTML 中提取出感兴趣的数据。

Be auti f ulSo up 是一个用于解析 HTML 文档的 Py th on 库，通过 Be auti f ulSo up，你只需要用很少的代码就可以提取出 HTML 中任何感兴趣的内容，此外，它还有一定的 HTML 容错能力，对于一个格式不完整的HTML 文档，它也可以正确处理。

pip install beautifulsoup4

初始化对象时可以直接传递字符串或者文件句柄

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("&lt;html&gt;data&lt;/html&gt;")

支持多种解析接口

# python内置HTML解析
BeautifulSoup(markup, "html.parser")
# lxml语言支持HTML解析
BeautifulSoup(markup, "lxml")
# 解析XML引擎
BeautifulSoup(markup, "xml")
# 解析HTML5引擎
BeautifulSoup(markup, "html5lib")

下面是一段不规范的html，缺少闭合标签

html_doc = """
&lt;html&gt;<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

prettify()标准缩进格式的输出。输出内容如下：

 <html>
  <head>
   <title>
   The Dormouse's story
   </title>
  </head>
  <body>
   <p class="title">
    <b>
     The Dormouse's story
    </b>
   </p>
   <p class="story">
   Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
  </p>
   <p class="story">
    ...
   </p>
  </body>
 </html>


# title标签
soup.title
# <title>The Dormouse's story</title>

# title标签名称
soup.title.name
# 'title'

# # title标签的文本字符内容
soup.title.string
# 'The Dormouse's story'

# title标签父节点名称
soup.title.parent.name
# 'head'

# 从前向后找到html孙节点第一个p节点
soup.p
# <p class="title"><b>The Dormouse's story</b></p>

# p节点的class属性
soup.p['class']
# ['title']

# 进栈出栈的方式找到第一个a标签
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

# p节点的href属性
soup.a["href"]
# 'http://example.com/elsie'

soup.find_all('a')
# 同上
soup.find_all("p")[1].find_all("a")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


type(soup)
# <class 'bs4.BeautifulSoup'>
type(soup.p)
# <class 'bs4.element.Tag'>
# type(soup.p.string)
<class 'bs4.element.NavigableString'>

soup.p.name
# 'p'

soup.p['class']
# ['title']

soup.a['href']
# 'http://example.com/elsie'

soup.p.string
# "The Dormouse's story"
type(soup.p.string)
bs4.element.NavigableString

find_all( name , attrs , recursive , text , **kwargs )

# 所有p标签
soup.find_all("p")

# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were three little sisters; and their names were
#     <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#     <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#     <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>,
# <p class="story">...</p>]

soup.find_all("p","title")
# 同上
soup.find_all("p",class_ ="title")

# [<p class="title"><b>The Dormouse's story</b></p>]

import re
# 支持使用标签属性
soup.find_all(href="http://example.com/lacie")
soup.find_all(id="link2")

# 支持使用正则
soup.find_all(href=re.compile("lacie"))
 
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

# 支持使用布尔类型
soup.find_all('a',id=True)

soup.body.find_all('a',id=True)

soup.body.find("a")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.body.find("a").get_text()
# Elsie

显示所有内容

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

安装 beautiful soup

自动 添加和补全 标签

bs4获取标签及内容 示例

遍历 文档树

搜索 文档树

发表回复取消回复

自动添加和补全标签

bs4获取标签及内容示例

遍历文档树

搜索文档树

相关文章

发表回复 取消回复

发表回复取消回复