ChatGPT解决这个技术问题 Extra ChatGPT

How to escape text for regular expression in Java

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.


Y
Yannick Blondeau

Since Java 1.5, yes:

Pattern.quote("$5");

Please not that this doesn’t escape the string itself, but wraps it using \Q and \E. This may lead to unexpected results, for example Pattern.quote("*.wav").replaceAll("*",".*") will result in \Q.*.wav\E and not .*\.wav, as you might expect.
I just wantet to point out that this way of escaping applies escaping also on expressions that you introduce afterwards. This may be surprising. If you do "mouse".toUpperCase().replaceAll("OUS","ic") it will return MicE. You would’t expect it to return MICE because you didn’t apply toUpperCase() on ic. In my example quote() is applied on the .* insertet by replaceAll() as well. You have to do something else, perhaps .replaceAll("*","\\E.*\\Q") would work, but that’s counterintuitive.
@Parameleon: The best solution to the corresponding problem is to use a split-map-mkString method. ".wav".split("\\.").map(Pattern.quote).mkString(".").r
@Paramaleon If it did work by adding individual escapes, your initial example still wouldn't do what you wanted...if it escaped characters individually, it would turn *.wav into the regex pattern \*\.wav, and the replaceAll would turn it into \.*\.wav, meaning it would match files whose name consists of a arbitrary number of periods followed by .wav. You'd most likely have needed to replaceAll("\\*", ".*") if they'd gone with the more fragile implementation that relies on recognizing all possible active regex charachters and escaping them individually...would that be so much easier?
@Paramaeleon: the use case is "*.wav".replaceAll(Pattern.quote("*"), ".*").
f
fabian

Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example

s.replaceFirst(Pattern.quote("text to replace"), 
               Matcher.quoteReplacement("replacement text"));

Specifically, Pattern.quote replaces special characters in regex search strings, like .|+() etc, and Matcher.quoteReplacement replaces special characters in replacement strings, like \1 for backreferences.
I don't agree. Pattern.quote wraps its argument with \Q and \E. It does not escape special characters.
Matcher.quoteReplacement("4$&%$") produces "4\$&%\$". It escapes the special characters.
In other words: quoteReplacement only cares about the two symbols $ and \ which can for example be used in replacement strings as backreferences $1 or \1. It therefore must not be used to escape/quote a regex.
Awesome. Here is an example where we want to replace $Group$ with T$UYO$HI. The $ symbol is special both in the pattern and in the replacement: "$Group$ Members".replaceFirst(Pattern.quote("$Group$"), Matcher.quoteReplacement("T$UYO$HI"))
A
Androidme

It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:

Pattern.compile(textToFormat, Pattern.LITERAL);

It's especially nice because you can combine it with Pattern.CASE_INSENSITIVE
A
Alex Shesterov

I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.

See Pattern javadoc for details.


I'm curious if there's any difference between this and using the LITERAL flag, since the javadoc says there is no embedded flag to switch LITERAL on and off: java.sun.com/j2se/1.5.0/docs/api/java/util/regex/…
Note that literally using \Q and \E is only fine if you know your input. Pattern.quote(s) will also handle the case where your text actually contains these sequences.
M
Meower68

First off, if

you use replaceAll()

you DON'T use Matcher.quoteReplacement()

the text to be substituted in includes a $1

it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.

I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.

I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:

java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)

In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.

Given:

// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input

replacing

msg = msg.replaceAll("<userInput \\/>", userInput);

with

msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));

solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.


t
toinetoine

To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.

public class Test {
    public static void main(String[] args) {
        String str = "y z (111)";
        String p1 = "x x (111)";
        String p2 = ".* .* \\(111\\)";

        p1 = escapeRE(p1);

        p1 = p1.replace("x", ".*");

        System.out.println( p1 + "-->" + str.matches(p1) ); 
            //.*\ .*\ \(111\)-->true
        System.out.println( p2 + "-->" + str.matches(p2) ); 
            //.* .* \(111\)-->true
    }

    public static String escapeRE(String str) {
        //Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
        //return escaper.matcher(str).replaceAll("\\\\$1");
        return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
    }
}

You don't have to escape spaces. So you can chagne your pattern to "([^a-zA-z0-9 ])".
Small typo, big consequences: "([^a-zA-z0-9])" does also not match (i.e. not escape) [, \, ], ^ which you certainly want to have escaped! The typo is the second 'z' which should be a 'Z', otherwise everything from ASCII 65 to ASCII 122 is included
A
Adam111p

Pattern.quote("blabla") works nicely.

The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E". However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:

String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

This method returns: Some/\s/wText*/\,**

Code for example and tests:

String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

+1 This works pretty good for transforming a user-specified string of non-standard chars to a regex-compatible pattern. I'm using it for enforcing those chars in a password. Thanks.
k
kit

^(Negation) symbol is used to match something that is not in the character group.

This is the link to Regular Expressions

Here is the image info about negation:

https://i.stack.imgur.com/m5cXU.png


I don't see how this addresses the question at all.