ChatGPT解决这个技术问题 Extra ChatGPT

Remove HTML tags from a String

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\<.*?>", "") 

will work, but some things like &amp; won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

use this with following guide : compile 'org.jsoup:jsoup:1.9.2'

C
Community

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

RegEx match open tags except XHTML self-contained tags

What are the pros and cons of the leading Java HTML parsers?

XSS prevention in JSP/Servlet web application


Jsoup is nice, but I encountered some drawbacks with it. I use it to get rid of XSS, so basically I expect a plain text input, but some evil person could try to send me some HTML. Using Jsoup, I can remove all HTML but, unfortunately it also shrinks many spaces to one and removes link breaks (\n characters)
@Ridcully: for that you'd like to use Jsoup#clean() instead.
using clean() will still cause extra spaces and \n chars to be removed. ex: Jsoup.clean("a \n b", Whitelist.none()) returns "a b"
@Zeroows: this fails miserably on <p>Lorem ipsum 1 < 3 dolor sit amet</p>. Again, HTML is not a regular language. It's completely beyond me why everyone keeps trying to throw regex on it to parse parts of interest instead of using a real parser.
use Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false)); to preserve linebreaks
A
Amr263

If you're writing for Android you can do this...

androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()


Awesome tip. :) If you're displaying the text in a TextView, you can drop the .toString() to preserve some formatting, too.
@Branky It doesn't I have tried...the accepted answer works like charm
This is good, but tags are replaced with some bizarre things. I got small squares where there was an image
@BibaswannBandyopadhyay another answer helps getting rid of these characters
use package androidx.core.text instead of legacy android.text
C
Chris Marasti-Georg

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.


You also run into issues, if there is unescaped < or > sign inside the html node content. My age is < a lot's of text > then your age. i think that only 100% way to do this is via some XML DOM interface (like SAX or similar), to use node.getText().
C
CSchulz

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref : Remove HTML tags from a file to extract only the TEXT


The result of "a < b or b > c" is "a b or b > c", which seems unfortunate.
This worked the best for me. I needed to preserve line breaks. I did by adding this simple method to the parser: @Override public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) { if (t == HTML.Tag.P || t == HTML.Tag.BR) { s.append('\n'); } }
dfrankow: The mathematical expression a < b or b > c should be written in html like this: a < b or b > c
I love that this doesn't have external dependencies.
S
Serge

I think that the simpliest way to filter the html tags is:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

K
Kaitsu

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

Jericho was able to parse
to a line break. Jsoup and HTMLEditorKit could not do that.
Jericho is very capable of doing this job, used it a lot in owned projects.
Jericho worked like a charm. Thanks for the suggestion. One note: you don't have to create a Segment of the whole string. Source extends Segment, so either works in the Renderer constructor.
Jerico now seems to be a bit dated ( the last release was 3.4 in late 2015). However, if it still works well, then it still works well!
G
George G

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

It removes line breaks from the text

It converts text <script> into