我们可以将 XPath 与 BeautifulSoup 一起使用吗？

python web-scraping xpath beautifulsoup urllib

我正在使用 BeautifulSoup 来抓取一个 URL，并且我有以下代码来查找类为 'empformbody' 的 td 标记：

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)

soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中，我们可以使用 findAll 来获取与它们相关的标签和信息，但我想使用 XPath。是否可以将 XPath 与 BeautifulSoup 一起使用？如果可能，请提供我的示例代码。

Martijn Pieters

不，BeautifulSoup 本身不支持 XPath 表达式。

另一个库，lxml，确实支持 XPath 1.0。它有一个 BeautifulSoup compatible mode，它将尝试像 Soup 那样解析损坏的 HTML。但是，default lxml HTML parser 在解析损坏的 HTML 方面同样出色，而且我相信它更快。

将文档解析为 lxml 树后，您可以使用 .xpath() 方法搜索元素。

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

还有一个具有附加功能的 dedicated lxml.html() module。

请注意，在上面的示例中，我将 response 对象直接传递给 lxml，因为让解析器直接从流中读取比先将响应读取到大字符串中更有效。要对 requests 库执行相同操作，您需要设置 stream=True 并传入 response.raw 对象 after enabling transparent transport decompression：

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

您可能感兴趣的是 CSS Selector support； CSSSelector 类将 CSS 语句转换为 XPath 表达式，使您对 td.empformbody 的搜索变得更加容易：

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

绕了一圈：BeautifulSoup 本身确实有非常完整的CSS selector support：

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

非常感谢 Pieters，我从你的代码中得到了两个信息，1。说明我们不能在 BS 2 中使用 xpath。关于如何使用 lxml 的一个很好的例子。我们是否可以在特定文档中看到“我们不能以书面形式使用 BS 实现 xpath”，因为我们应该向那些要求澄清的人展示一些证据，对吗？

很难证明是否定的； BeautifulSoup 4 documentation 具有搜索功能，并且“xpath”没有命中。

我尝试在上面运行您的代码，但出现错误“未定义名称'xpathselector'”

@Zvi 代码没有定义 Xpath 选择器；我的意思是将其解读为“在此处使用您自己的 XPath 表达式”。

Leonard Richardson

我可以确认 Beautiful Soup 中没有 XPath 支持。

注意：Leonard Richardson 是 Beautiful Soup 的作者，如果您点击进入他的用户资料，您会看到。

如果能在 BeautifulSoup 中使用 XPATH 那就太好了

那么替代方案是什么？

@leonard-richardson 现在是 2021 年，您还在确认 BeautifulSoup 仍然没有 xpath 支持吗？

wordsforthewise

正如其他人所说，BeautifulSoup 没有 xpath 支持。可能有很多方法可以从 xpath 中获取内容，包括使用 Selenium。但是，这里有一个适用于 Python 2 或 3 的解决方案：

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

我使用 this 作为参考。

一个警告：我注意到如果在根目录之外有一些东西（比如外部 <html> 标签之外的 \n），那么通过根目录引用 xpaths 将不起作用，您必须使用相对 xpaths。 lxml.de/xpathxslt.html

Martijn 的代码不再正常工作（它现在已经 4 年多了...），etree.parse() 行打印到控制台并且没有将值分配给树变量。这是一个相当的主张。我当然无法重现它，而且它没有任何意义。您确定要使用 Python 2 来测试我的代码，还是已将 urllib2 库使用转换为 Python 3 urllib.request？

是的，这可能是我在编写时使用 Python3 并且没有按预期工作的情况。刚刚经过测试，您的可以与 Python2 一起使用，但 Python3 更受青睐，因为 2 将在 2020 年停止使用（不再正式支持）。

绝对同意，但这里的问题使用 Python 2。

657784512

BeautifulSoup 有一个名为 findNext 的函数，来自当前元素导向的子元素，所以：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上面的代码可以模仿下面的xpath：

div[class=class_value]/div[id=id_value]

Deepak rayathurai

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

上面使用了 Soup 对象与 lxml 的组合，可以使用 xpath 提取值

Oleksandr Panchenko

当您使用 lxml 时，一切都很简单：

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但是当使用 BeautifulSoup BS4 时也很简单：

首先删除“//”和“@”

第二 - 在“=”之前添加星号

试试这个魔法：

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

如您所见，这不支持子标签，因此我删除了“/@href”部分

select() 用于 CSS 选择器，它根本不是 XPath。 如您所见，这不支持子标签 虽然我不确定当时是否如此，但现在肯定不是。

Nikola

我搜索了他们的 docs，似乎没有 XPath 选项。

此外，正如您在 SO 上的类似问题上看到的 here，OP 要求将 XPath 转换为 BeautifulSoup，所以我的结论是 - 不，没有可用的 XPath 解析。

是的，实际上直到现在我使用scrapy，它使用xpath来获取标签内的数据。它非常方便且易于获取数据，但我需要对beautifulsoup做同样的事情，所以期待它。

dabingsou

也许您可以在没有 XPath 的情况下尝试以下操作

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

David A

这是一个相当古老的线程，但现在有一个变通的解决方案，当时可能不在 BeautifulSoup 中。

这是我所做的一个例子。我使用“请求”模块来读取 RSS 提要并在名为“rss_text”的变量中获取其文本内容。有了它，我通过 BeautifulSoup 运行它，搜索 xpath /rss/channel/title，并检索其内容。它并不完全是 XPath 的全部荣耀（通配符、多路径等），但如果您只是想要定位一个基本路径，那么它可以工作。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

我相信这只会找到子元素。 XPath 是另一回事吗？

Γιωργος Αλεξανδρου

使用soup.find(class_='myclass')

我们可以将 XPath 与 BeautifulSoup 一起使用吗？

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

联系我们