ChatGPT解决这个技术问题 Extra ChatGPT

Strip HTML from Text JavaScript

Is there an easy way to take a string of html in JavaScript and strip out the html?


B
Black

If you're running in a browser, then the easiest way is just to let the browser do it for you...

function stripHtml(html)
{
   let tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Note: as folks have noted in the comments, this is best avoided if you don't control the source of the HTML (for example, don't run this on anything that could've come from user input). For those scenarios, you can still let the browser do the work for you - see Saba's answer on using the now widely-available DOMParser.


Just remember that this approach is rather inconsistent and will fail to strip certain characters in certain browsers. For example, in Prototype.js, we use this approach for performance, but work around some of the deficiencies - github.com/kangax/prototype/blob/…
Remember your whitespace will be messed about. I used to use this method, and then had problems as certain product codes contained double spaces, which ended up as single spaces after I got the innerText back from the DIV. Then the product codes did not match up later in the application.
@Magnus Smith: Yes, if whitespace is a concern - or really, if you have any need for this text that doesn't directly involve the specific HTML DOM you're working with - then you're better off using one of the other solutions given here. The primary advantages of this method are that it is 1) trivial, and 2) will reliably process tags, whitespace, entities, comments, etc. in the same way as the browser you're running in. That's frequently useful for web client code, but not necessarily appropriate for interacting with other systems where the rules are different.
Don't use this with HTML from an untrusted source. To see why, try running strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
If html contains images(img tags), the images will be requested by the browser. That's not good.
M
Mike Samuel
myString.replace(/<[^>]*>?/gm, '');

Doesn't work for <img src=http://www.google.com.kh/images/srpr/nav_logo27.png onload="alert(42)" if you're injecting via document.write or concatenating with a string that contains a > before injecting via innerHTML.
@PerishableDave, I agree that the > will be left in the second. That's not an injection hazard though. The hazard occurs due to < left in the first, which causes the HTML parser to be in a context other than data state when the second starts. Note there is no transition from data state on >.
@MikeSamuel Did we decide on this answer yet? Naive user here ready to copy-paste.
This also, I believe, gets completely confused if given something like <button onClick="dostuff('>');"></button> Assuming correctly written HTML, you still need to take into account that a greater than sign might be somewhere in the quoted text in an attribute. Also you would want to remove all the text inside of <script> tags, at least.
@AntonioMax, I've answered this question ad nauseam, but to the substance of your question, because security critical code shouldn't be copied & pasted. You should download a library, and keep it up-to-date and patched so that you're secure against recently discovered vulnerabilities and to changes in browsers.
C
Community

Simplest way:

jQuery(html).text();

That retrieves all the text from a string of html.


We always use jQuery for projects since invariably our projects have a lot of Javascript. Therefore we didn't add bulk, we took advantage of existing API code...
You use it, but the OP might not. the question was about Javascript NOT JQuery.
It's still a useful answer for people who need to do the same thing as the OP (like me) and don't mind using jQuery (like me), not to mention, it could have been useful to the OP if they were considering using jQuery. The point of the site is to share knowledge. Keep in mind that the chilling effect you might have by chastising useful answers without good reason.
@Dementic shockingly, I find the threads with multiple answers to be the most useful, because often a secondary answer meets my exact needs, while the primary answer meets the general case.
That will not work if you some part of string is not wrapped in html tag. e.g. "Error: Please enter a valid email" will return only "Error:"
B
Black

I would like to share an edited version of the Shog9's approved answer.

As Mike Samuel pointed with a comment, that function can execute inline javascript codes. But Shog9 is right when saying "let the browser do it for you..."

so.. here my edited version, using DOMParser:

function strip(html){
   let doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

here the code to test the inline javascript:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Also, it does not request resources on parse (like images)

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

It's worth to add that this solution work only in browser.
This is not strip tags, but more like PHP htmlspecialchars(). Still useful for me.
Note that this also removes whitespace from the beginning of the text.
also, it does not try to parse html using regex
This should be the accepted answer because it's the safest and fastest way to do
B
Black

As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)

jQuery(html).text();

will return an empty string if there is no HTML

Use:

jQuery('<p>' + html + '</p>').text();

instead.

Update: As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.


Or $("<p>").html(html).text();
This still executes probably dangerous code jQuery('<span>Text :) <img src="a" onerror="alert(1)"></span>').text()
try jQuery("aa<script>alert(1)</script>a").text();
V
Victor

Converting HTML for Plain Text emailing keeping hyperlinks (a href) intact

The above function posted by hypoxide works fine, but I was after something that would basically convert HTML created in a Web RichText editor (for example FCKEditor) and clear out all HTML but leave all the Links due the fact that I wanted both the HTML and the plain text version to aid creating the correct parts to an STMP email (both HTML and plain text).

After a long time of searching Google myself and my collegues came up with this using the regex engine in Javascript:

str='this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>
';
str=str.replace(/<br>/gi, "\n");
str=str.replace(/<p.*>/gi, "\n");
str=str.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 (Link->$1) ");
str=str.replace(/<(?:.|\s)*?>/g, "");

the str variable starts out like this:

this string has <i>html</i> code i want to <b>remove</b><br>Link Number 1 -><a href="http://www.bbc.co.uk">BBC</a> Link Number 1<br><p>Now back to normal text and stuff</p>

and then after the code has run it looks like this:-

this string has html code i want to remove
Link Number 1 -> BBC (Link->http://www.bbc.co.uk)  Link Number 1


Now back to normal text and stuff

As you can see the all the HTML has been removed and the Link have been persevered with the hyperlinked text is still intact. Also I have replaced the <p> and <br> tags with \n (newline char) so that some sort of visual formatting has been retained.

To change the link format (eg. BBC (Link->http://www.bbc.co.uk) ) just edit the $2 (Link->$1), where $1 is the href URL/URI and the $2 is the hyperlinked text. With the links directly in body of the plain text most SMTP Mail Clients convert these so the user has the ability to click on them.

Hope you find this useful.


It doesn't handle " "
J
Janghou

An improvement to the accepted answer.

function strip(html)
{
   var tmp = document.implementation.createHTMLDocument("New").body;
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

This way something running like this will do no harm:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Firefox, Chromium and Explorer 9+ are safe. Opera Presto is still vulnerable. Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.


This is some of the way there, but isn't safe from <script><script>alert();
That doesn't run any scripts here in Chromium/Opera/Firefox on Linux, so why isn't it safe?
My apologies, I must have miss-tested, I probably forgot to click run again on the jsFiddle.
The "New" argument is superfluous, I think?
According to the specs it's optional nowadays, but it wasn't always.
K
Karl.S

This should do the work on any Javascript environment (NodeJS included).

    const text = `
    <html lang="en">
      <head>
        <style type="text/css">*{color:red}</style>
        <script>alert('hello')</script>
      </head>
      <body><b>This is some text</b><br/><body>
    </html>`;
    
    // Remove style tags and content
    text.replace(/<style[^>]*>.*<\/style>/gm, '')
        // Remove script tags and content
        .replace(/<script[^>]*>.*<\/script>/gm, '')
        // Remove all opening, closing and orphan HTML tags
        .replace(/<[^>]+>/gm, '')
        // Remove leading spaces and repeated CR/LF
        .replace(/([\r\n]+ +)+/gm, '');

@pstanton could you give a working example of your statement ?
<html><style..>* {font-family:comic-sans;}</style>Some Text</html>
@pstanton I have fixed the code and added comments, sorry for the late response.
please consider reading these caveats: stackoverflow.com/a/1732454/501765
C
Community

I altered Jibberboy2000's answer to include several <BR /> tag formats, remove everything inside <SCRIPT> and <STYLE> tags, format the resulting HTML by removing multiple line breaks and spaces and convert some HTML-encoded code into normal. After some testing it appears that you can convert most of full web pages into simple text where page title and content are retained.

In the simple example,

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<!--comment-->

<head>

<title>This is my title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style>

    body {margin-top: 15px;}
    a { color: #D80C1F; font-weight:bold; text-decoration:none; }

</style>
</head>

<body>
    <center>
        This string has <i>html</i> code i want to <b>remove</b><br>
        In this line <a href="http://www.bbc.co.uk">BBC</a> with link is mentioned.<br/>Now back to &quot;normal text&quot; and stuff using &lt;html encoding&gt;                 
    </center>
</body>
</html>

becomes

This is my title This string has html code i want to remove In this line BBC (http://www.bbc.co.uk) with link is mentioned. Now back to "normal text" and stuff using

The JavaScript function and test page look this:

function convertHtmlToText() {
    var inputText = document.getElementById("input").value;
    var returnText = "" + inputText;

    //-- remove BR tags and replace them with line break
    returnText=returnText.replace(/<br>/gi, "\n");
    returnText=returnText.replace(/<br\s\/>/gi, "\n");
    returnText=returnText.replace(/<br\/>/gi, "\n");

    //-- remove P and A tags but preserve what's inside of them
    returnText=returnText.replace(/<p.*>/gi, "\n");
    returnText=returnText.replace(/<a.*href="(.*?)".*>(.*?)<\/a>/gi, " $2 ($1)");

    //-- remove all inside SCRIPT and STYLE tags
    returnText=returnText.replace(/<script.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/script>/gi, "");
    returnText=returnText.replace(/<style.*>[\w\W]{1,}(.*?)[\w\W]{1,}<\/style>/gi, "");
    //-- remove all else
    returnText=returnText.replace(/<(?:.|\s)*?>/g, "");

    //-- get rid of more than 2 multiple line breaks:
    returnText=returnText.replace(/(?:(?:\r\n|\r|\n)\s*){2,}/gim, "\n\n");

    //-- get rid of more than 2 spaces:
    returnText = returnText.replace(/ +(?= )/g,'');

    //-- get rid of html-encoded characters:
    returnText=returnText.replace(/&nbsp;/gi," ");
    returnText=returnText.replace(/&amp;/gi,"&");
    returnText=returnText.replace(/&quot;/gi,'"');
    returnText=returnText.replace(/&lt;/gi,'<');
    returnText=returnText.replace(/&gt;/gi,'>');

    //-- return
    document.getElementById("output").value = returnText;
}

It was used with this HTML:

<textarea id="input" style="width: 400px; height: 300px;"></textarea><br />
<button onclick="convertHtmlToText()">CONVERT</button><br />
<textarea id="output" style="width: 400px; height: 300px;"></textarea><br />

I like this solution because it has treatment of html special characters... but still not nearly enough of them... the best answer for me would deal with all of them. (which is probably what jquery does).
I think /<p.*>/gi should be /<p.*?>/gi.
Note that to remove all <br> tags you could use a good regular expression instead: /<br\s*\/?>/ that way you have just one replace instead of 3. Also it seems to me that except for the decoding of entities you can have a single regex, something like this: /<[a-z].*?\/?>/.
Nice script. But what about table content? Any idea how can it be displayed
@DanielGerson, encoding html gets real hairy, real quick, but the best approach seems to be using the he library
h
hegemon
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

This is a regex version, which is more resilient to malformed HTML, like:

Unclosed tags

Some text <img

"<", ">" inside tag attributes

Some text <img alt="x > y">

Newlines

Some <a href="http://google.com">

The code

var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

How could you flip this to do literally the opposite? I want to use string.replace() on ONLY the text part, and leave any HTML tags and their attributes unchanged.
My personal favourite, I would also add to remove newlines like: const deTagged = myString.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, ''); const deNewlined = deTagged.replace(/\n/g, '');
A
Anatol

from CSS tricks:

https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

const originalString = `

Hey that's somthing

`; const strippedString = originalString.replace(/(<([^>]+)>)/gi, ""); console.log(strippedString);


This fails to remove what is inside ipt>' then the stripped version will be this: ''. So this is an XSS vulnerability.
You should change the [^<>] with [^>] because a valid tag cannot include a < character, then the XSS vulnerability disappears.
a
aWebDeveloper

Below code allows you to retain some html tags while stripping all others

function strip_tags(input, allowed) {

  allowed = (((allowed || '') + '')
    .toLowerCase()
    .match(/<[a-z][a-z0-9]*>/g) || [])
    .join(''); // making sure the allowed arg is a string containing only tags in lowercase (<a><b><c>)

  var tags = /<\/?([a-z][a-z0-9]*)\b[^>]*>/gi,
      commentsAndPhpTags = /<!--[\s\S]*?-->|<\?(?:php)?[\s\S]*?\?>/gi;

  return input.replace(commentsAndPhpTags, '')
      .replace(tags, function($0, $1) {
          return allowed.indexOf('<' + $1.toLowerCase() + '>') > -1 ? $0 : '';
      });
}

You should quote the source (phpjs). If you use the allowed param you are vulnerable to XSS: stripTags('<p onclick="alert(1)">mytext</p>', '<p>') returns <p onclick="alert(1)">mytext</p>
F
FrigginGlorious

I just needed to strip out the <a> tags and replace them with the text of the link.

This seems to work great.

htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');

This only applies for a tags and needs tweaking for being a wide function.
Yeah, plus an anchor tag could have many other attributes such as the title="...".
b
basarat

The accepted answer works fine mostly, however in IE if the html string is null you get the "null" (instead of ''). Fixed:

function strip(html)
{
   if (html == null) return "";
   var tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

C
Community

A safer way to strip the html with jQuery is to first use jQuery.parseHTML to create a DOM, ignoring any scripts, before letting jQuery build an element and then retrieving only the text.

function stripHtml(unsafe) {
    return $($.parseHTML(unsafe)).text();
}

Can safely strip html from:

<img src="unknown.gif" onerror="console.log('running injections');">

And other exploits.

nJoy!


i
ianaz

With jQuery you can simply retrieving it by using

$('#elementID').text()

M
MarekJ47

I have created a working regular expression myself:

str=str.replace(/(<\?[a-z]*(\s[^>]*)?\?(>|$)|<!\[[a-z]*\[|\]\]>|<!DOCTYPE[^>]*?(>|$)|<!--[\s\S]*?(-->|$)|<[a-z?!\/]([a-z0-9_:.])*(\s[^>]*)?(>|$))/gi, ''); 

D
Developer

simple 2 line jquery to strip the html.

 var content = "<p>checking the html source&nbsp;</p><p>&nbsp;
  </p><p>with&nbsp;</p><p>all</p><p>the html&nbsp;</p><p>content</p>";

 var text = $(content).text();//It gets you the plain text
 console.log(text);//check the data in your console

 cj("#text_area_id").val(text);//set your content to text area using text_area_id

m
math2001

Using Jquery:

function stripTags() {
    return $('<p></p>').html(textToEscape).text()
}

M
Mike Datsko

input element support only one line text:

The text state represents a one line plain text edit control for the element's value.

function stripHtml(str) {
  var tmp = document.createElement('input');
  tmp.value = str;
  return tmp.value;
}

Update: this works as expected

function stripHtml(str) {
  // Remove some tags
  str = str.replace(/<[^>]+>/gim, '');

  // Remove BB code
  str = str.replace(/\[(\w+)[^\]]*](.*?)\[\/\1]/g, '$2 ');

  // Remove html and line breaks
  const div = document.createElement('div');
  div.innerHTML = str;

  const input = document.createElement('input');
  input.value = div.textContent || div.innerText || '';

  return input.value;
}

Doesn't work, please always mention the browser you are using when posting an answer. This is inaccurate and won't work in Chrome 61. Tags are just rendered as a string.