使用 Python 从 HTML 文件中提取文本

python html text html-content-extraction

我想使用 Python 从 HTML 文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我希望得到的输出基本相同。

我想要比使用可能在格式不佳的 HTML 上失败的正则表达式更强大的东西。我见过很多人推荐 Beautiful Soup，但我在使用它时遇到了一些问题。一方面，它拾取了不需要的文本，例如 JavaScript 源代码。此外，它不解释 HTML 实体。例如，我希望 '在 HTML 源代码中转换为文本中的撇号，就像我将浏览器内容粘贴到记事本中一样。

更新 html2text 看起来很有希望。它正确处理 HTML 实体并忽略 JavaScript。但是，它并不完全生成纯文本。它会产生降价，然后必须将其转换为纯文本。它没有示例或文档，但代码看起来很干净。

相关问题：

过滤掉 HTML 标签并解析 python 中的实体

在 Python 中将 XML/HTML 实体转换为 Unicode 字符串

很长一段时间以来，人们似乎发现我的 NLTK 答案（最近的）非常有用，因此，您可能需要考虑更改已接受的答案。谢谢！

我从没想过我会遇到我最喜欢的博客的作者提出的问题！努力！

@Shatu 现在您的解决方案已不再有效，您可能想删除您的评论。谢谢！ ;)

MattDMo

我发现的最好的一段代码，用于在不获取 javascript 或不需要的东西的情况下提取文本：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

你只需要安装 BeautifulSoup 之前：

pip install beautifulsoup4

如果我们想选择某行，刚才说的，第 3 行怎么办？

杀戮脚本位，救世主！！

在经历了很多stackoverflow答案之后，我觉得这对我来说是最好的选择。我遇到的一个问题是在某些情况下将行添加在一起。我能够通过在 get_text 函数中添加分隔符来克服它：text = soup.get_text(separator=' ')

我使用 soup.body.get_text() 而不是 soup.get_text()，这样我就不会从 <head> 中得到任何文本。元素，例如标题。

对于 Python 3，from urllib.request import urlopen

Alireza Savand

html2text 是一个 Python 程序，在这方面做得很好。

位它是 gpl 3.0，这意味着它可能不兼容

惊人！它的作者是 RIP Aaron Swartz。

有没有人因为 GPL 3.0 找到了 html2text 的替代品？

我尝试了 html2text 和 nltk，但它们对我不起作用。我最终选择了 Beautiful Soup 4，效果很好（没有双关语）。

我知道那不是（根本）那个地方，但我点击了 Aaron 的博客和 github 个人资料和项目的链接，发现自己对没有提及他的死的事实感到非常不安，而且它当然在 2012 年被冻结，好像时间停止了，或者他休了一个很长的假期。非常令人不安。

Shatu

注意： NTLK 不再支持 clean_html 函数

下面的原始答案，以及评论部分的替代方案。

使用NLTK

我浪费了 4-5 个小时来解决 html2text 的问题。幸运的是我可以遇到 NLTK。它神奇地起作用。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

有时这就足够了:)

我想投票一千次。我陷入了正则表达式的地狱，但是，现在我看到了 NLTK 的智慧。

显然，clean_html 不再受支持：github.com/nltk/nltk/commit/…

为这样一个简单的任务导入像 nltk 这样的繁重库会太多了

@alexanderlukanin13 来源：raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

xperroni

发现自己今天面临同样的问题。我编写了一个非常简单的 HTML 解析器来去除所有标记的传入内容，只返回具有最少格式的剩余文本。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

这似乎是仅使用默认模块在 Python (2.7) 中执行此操作的最直接方法。这真的很愚蠢，因为这是一个经常需要的东西，并且没有充分的理由为什么在默认的 HTMLParser 模块中没有解析器。

我认为不会将 html 字符转换为 unicode，对吗？例如，& 不会转换为 &，对吗？

对于 Python 3，请使用 from html.parser import HTMLParser

Floyd

我知道已经有很多答案，但我发现的最elegent 和 pythonic 解决方案部分描述为 here。

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

更新

根据弗雷泽的评论，这里是更优雅的解决方案：

from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

为避免出现警告，请指定 BeautifulSoup 使用的解析器：text = ''.join(BeautifulSoup(some_html_string, "lxml").findAll(text=True))

您可以使用 stripped_strings 生成器来避免过多的空格 - 即 clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings

我会推荐 ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings) 至少有一个空格，否则像 Please click <a href="link">text</a> to continue 这样的字符串将呈现为 Please clicktextto continue

bit4

这是 xperroni 的答案的一个版本，它更完整一些。它跳过脚本和样式部分并翻译字符引用（例如，'）和 HTML 实体（例如，&）。

它还包括一个简单的纯文本到 html 逆转换器。

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

python 3 版本：gist.github.com/Crazometer/af441bc7dc7353d41390a59f20f07b51

在 get_text 中，''.join 应该是 ''.join。应该有一个空格，否则一些文本会连接在一起。

此外，这不会捕获所有文本，除非您包含其他文本容器标签，如 H1、H2 ......、跨度等。我必须对其进行调整以获得更好的覆盖范围。

GeekTantra

您也可以在 stripogram 库中使用 html2text 方法。

from stripogram import html2text
text = html2text(your_html_string)

要安装条形图，请运行 sudo easy_install stripogram

根据 its pypi page，此模块已被弃用：“除非您有使用此软件包的某些历史原因，否则我建议您不要使用它！”

spatel4140

我知道这里已经有很多答案，但我认为 newspaper3k 也值得一提。我最近需要完成一项类似的任务，即从网络上的文章中提取文本，到目前为止，这个库在我的测试中做得非常出色。它忽略菜单项和侧边栏中的文本以及作为 OP 请求出现在页面上的任何 JavaScript。

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

如果您已经下载了 HTML 文件，则可以执行以下操作：

article = Article('')
article.set_html(html)
article.parse()
article.text

它甚至还有一些用于总结文章主题的 NLP 功能：

article.nlp()
article.summary

Nuncjo

有用于数据挖掘的模式库。

http://www.clips.ua.ac.be/pages/pattern-web

您甚至可以决定保留哪些标签：

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

Hodza

如果您需要更快的速度和更低的准确性，那么您可以使用原始 lxml。

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

PyNEwbie

PyParsing 做得很好。 PyParsing wiki 已被终止，因此这里是另一个位置，其中有使用 PyParsing (example link) 的示例。在 pyparsing 上投入一点时间的一个原因是，他还编写了一个非常简短且组织良好的 O'Reilly Short Cut 手册，而且价格也不贵。

话虽如此，我经常使用 BeautifulSoup，处理实体问题并不难，您可以在运行 BeautifulSoup 之前将它们转换。

祝你好运

链接已失效或变质。

Andrew

这不完全是 Python 解决方案，但它会将 Javascript 生成的文本转换为文本，我认为这很重要（例如 google.com）。浏览器 Links（不是 Lynx）有一个 Javascript 引擎，并且会使用 -dump 选项将源代码转换为文本。

因此，您可以执行以下操作：

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

Ponkadoodle

而不是 HTMLParser 模块，请查看 htmllib。它具有类似的界面，但可以为您完成更多工作。（它非常古老，因此在摆脱 javascript 和 css 方面没有太大帮助。您可以创建一个派生类，但添加名称为 start_script 和 end_style 的方法（有关详细信息，请参阅 python 文档），但这很难为格式错误的 html 可靠地执行此操作。）无论如何，这里有一些简单的东西，可以将纯文本打印到控制台

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

注意：HTMLError 和 HTMLParserError 都应该是 HTMLParseError。这有效，但在维护换行符方面做得不好。

Li Yingjun

我推荐一个名为 goose-extractor 的 Python 包 Goose 会尝试提取以下信息：

文章的正文文章的主图文章中嵌入的任何 Youtube/Vimeo 电影元描述元标签

Pravitha V

使用安装 html2text

点安装 html2text

然后，

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

speedplane

美丽的汤确实可以转换 html 实体。考虑到 HTML 经常出错并且充满了 unicode 和 html 编码问题，这可能是您最好的选择。这是我用来将 html 转换为原始文本的代码：

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

YakovK

另一个非 python 解决方案：Libre Office：

soffice --headless --invisible --convert-to txt input1.html

与其他替代方案相比，我更喜欢这个的原因是每个 HTML 段落都被转换为单个文本行（没有换行符），这正是我所寻找的。其他方法需要后处理。 Lynx 确实产生了不错的输出，但不是我想要的。此外，Libre Office 可用于从各种格式转换...

rox

有人用 bleach 尝试过 bleach.clean(html,tags=[],strip=True) 吗？它对我有用。

似乎也对我有用，但他们不建议将其用于此目的：“此功能是一个以安全为中心的功能，其唯一目的是从字符串中删除恶意内容，以便它可以显示为网络中的内容页。” -> bleach.readthedocs.io/en/latest/clean.html#bleach.clean

Vim

最适合我的是 inscripts 。

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

结果真的很好

kodlan

我有一个类似的问题，实际上使用了 BeautifulSoup 的答案之一。问题是它真的很慢。我最终使用了名为 selectolax 的库。它非常有限，但它适用于这项任务。唯一的问题是我手动删除了不必要的空格。但是，BeautifulSoup 解决方案的工作速度似乎要快得多。

from selectolax.parser import HTMLParser

def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='')
    text = " ".join(text.split()) # this will remove all the whitespaces
    return text

John Lucas

另一种选择是通过基于文本的 Web 浏览器运行 html 并将其转储。例如（使用 Lynx）：

lynx -dump html_to_convert.html > converted_html.txt

这可以在 python 脚本中完成，如下所示：

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

它不会只为您提供 HTML 文件中的文本，但根据您的用例，它可能比 html2text 的输出更可取。

racitup

@PeYoTIL 使用 BeautifulSoup 并消除样式和脚本内容的答案对我不起作用。我尝试使用 decompose 而不是 extract，但它仍然不起作用。所以我创建了自己的，它还使用 <p> 标记格式化文本并用 href 链接替换 <a> 标记。还处理文本内的链接。可在 this gist 获得并嵌入测试文档。

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

谢谢，这个答案被低估了。对于我们这些想要拥有更像浏览器的干净文本表示（忽略换行符，只考虑段落和换行符）的人来说，BeautifulSoup 的 get_text 根本不适合。

@jrial 很高兴您发现它很有用，也感谢您的贡献。对于其他人来说，链接的要点已经得到了很大的增强。 OP似乎暗示的是一种将html呈现为文本的工具，就像像lynx这样的基于文本的浏览器。这就是该解决方案所尝试的。大多数人贡献的只是文本提取器。

确实被低估了，哇，谢谢！也会检查要点。

u-phoria

我使用 Apache Tika 取得了不错的成绩。它的目的是从内容中提取元数据和文本，因此底层解析器进行了相应的调整，开箱即用。

Tika 可以作为 server 运行，在 Docker 容器中运行/部署很简单，并且可以通过 Python bindings 访问。

David Fraga

以简单的方式

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

此代码查找以“<”开头并以“>”结尾的 html_text 的所有部分，并将找到的所有部分替换为空字符串

Community

在 Python 3.x 中，您可以通过导入 'imaplib' 和 'email' 包以非常简单的方式做到这一点。虽然这是一篇较旧的帖子，但也许我的回答可以帮助这篇文章的新人。

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

现在您可以打印正文变量，它将采用纯文本格式:) 如果它对您来说足够好，那么最好选择它作为接受的答案。

这不会转换任何东西。

这向您展示了如何从电子邮件中提取 text/plain 部分（如果其他人将其放入其中）。它不会做任何事情来将 HTML 转换为纯文本，并且如果您尝试从网站转换 HTML，它也不会做任何远程有用的事情。

troymyname00

这是我经常使用的代码。

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

我希望这会有所帮助。

saigopi.me

您只能使用 BeautifulSoup 从 HTML 中提取文本

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

Uri Goren

虽然很多人提到使用正则表达式来去除 html 标签，但也有很多缺点。

例如：

<p>hello&nbsp;world</p>I love you

应解析为：

Hello world
I love you

这是我想出的一个片段，您可以根据您的特定需求对其进行自定义，它就像一个魅力

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

Mike Q

在 Python 2.7.9+ 中使用 BeautifulSoup4 的另一个例子

包括：

import urllib2
from bs4 import BeautifulSoup

代码：

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

解释：

以 html 格式读取 url 数据（使用 BeautifulSoup），删除所有脚本和样式元素，并使用 .get_text() 仅获取文本。分成几行并删除每行的前导和尾随空格，然后将多个标题分成一行，每个块 = (phrase.strip() for line in lines for phrase in line.split(" "))。然后使用 text = '\n'.join，删除空白行，最后以认可的 utf-8 形式返回。

笔记：

由于 SSL 问题，某些运行此功能的系统会因 https:// 连接而失败，您可以关闭验证以解决该问题。修复示例：http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/

Python < 2.7.9 运行时可能会出现一些问题

text.encode('utf-8') 可能会留下奇怪的编码，可能只想返回 str(text) 。

Waqar Detho

我正在实现它。

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

我正在使用 python 3.4，这段代码对我来说很好。

文本中会有 html 标签

使用 Python 从 HTML 文件中提取文本

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

联系我们