Java equivalent to JavaScript's encodeURIComponent that produces identical output?

java javascript unicode utf-8

I've been experimenting with various bits of Java code trying to come up with something that will encode a string containing quotes, spaces and "exotic" Unicode characters and produce output that's identical to JavaScript's encodeURIComponent function.

My torture test string is: "A" B ± "

If I enter the following JavaScript statement in Firebug:

encodeURIComponent('"A" B ± "');

—Then I get:

"%22A%22%20B%20%C2%B1%20%22"

Here's my little test Java program:

import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;

public class EncodingTest
{
  public static void main(String[] args) throws UnsupportedEncodingException
  {
    String s = "\"A\" B ± \"";
    System.out.println("URLEncoder.encode returns "
      + URLEncoder.encode(s, "UTF-8"));

    System.out.println("getBytes returns "
      + new String(s.getBytes("UTF-8"), "ISO-8859-1"));
  }
}

—This program outputs:

URLEncoder.encode returns %22A%22+B+%C2%B1+%22
getBytes returns "A" B ± "

Close, but no cigar! What is the best way of encoding a UTF-8 string using Java so that it produces the same output as JavaScript's encodeURIComponent?

EDIT: I'm using Java 1.4 moving to Java 5 shortly.

ripper234

This is the class I came up with in the end:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;

/**
 * Utility class for JavaScript compatible UTF-8 encoding and decoding.
 * 
 * @see http://stackoverflow.com/questions/607176/java-equivalent-to-javascripts-encodeuricomponent-that-produces-identical-output
 * @author John Topley 
 */
public class EncodingUtil
{
  /**
   * Decodes the passed UTF-8 String using an algorithm that's compatible with
   * JavaScript's <code>decodeURIComponent</code> function. Returns
   * <code>null</code> if the String is <code>null</code>.
   *
   * @param s The UTF-8 encoded String to be decoded
   * @return the decoded String
   */
  public static String decodeURIComponent(String s)
  {
    if (s == null)
    {
      return null;
    }

    String result = null;

    try
    {
      result = URLDecoder.decode(s, "UTF-8");
    }

    // This exception should never occur.
    catch (UnsupportedEncodingException e)
    {
      result = s;  
    }

    return result;
  }

  /**
   * Encodes the passed String as UTF-8 using an algorithm that's compatible
   * with JavaScript's <code>encodeURIComponent</code> function. Returns
   * <code>null</code> if the String is <code>null</code>.
   * 
   * @param s The String to be encoded
   * @return the encoded String
   */
  public static String encodeURIComponent(String s)
  {
    String result = null;

    try
    {
      result = URLEncoder.encode(s, "UTF-8")
                         .replaceAll("\\+", "%20")
                         .replaceAll("\\%21", "!")
                         .replaceAll("\\%27", "'")
                         .replaceAll("\\%28", "(")
                         .replaceAll("\\%29", ")")
                         .replaceAll("\\%7E", "~");
    }

    // This exception should never occur.
    catch (UnsupportedEncodingException e)
    {
      result = s;
    }

    return result;
  }  

  /**
   * Private constructor to prevent this class from being instantiated.
   */
  private EncodingUtil()
  {
    super();
  }
}

Adding a tip. In Android 4.4 I found that we also need to replace %0A which means a return key in Android input, or it will crash the js.

Do you cover everything at here: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/…

@Aloong What do you mean by replace "%0A"? What character would be the replacement? Is it just empty string""?

There is no need to use replaceAll when a simple replace has the same effect. There is no need to escape the % in regular expressions, so instead of \\% just write %. If "this exception should never occur", rather throw an Error or at least an IllegalStateException, but don't silently do something buggy.

Tomalak

Looking at the implementation differences, I see that:

MDC on encodeURIComponent():

literal characters (regex representation): [-a-zA-Z0-9._*~'()!]

Java 1.5.0 documentation on URLEncoder:

literal characters (regex representation): [-a-zA-Z0-9._*]

the space character " " is converted into a plus sign "+".

So basically, to get the desired result, use URLEncoder.encode(s, "UTF-8") and then do some post-processing:

replace all occurrences of "+" with "%20"

replace all occurrences of "%xx" representing any of [~'()!] back to their literal counter-parts

I wish you had written "Replace all occurrences of "%xx" representing any of [~'()!] back to their literal counter-parts" in some simple language. :( my tiny head is not able to understand it .......

@Shailendra [~'()!] means "~" or "'" or "(" or ")" or "!". :) I recommend learning the regex basics, too, though. (I also didn't expand on that since at least two other answers show the respective Java code.)

Replacing all occurrences of "+" with "%20" is potentially destructive, as "+" is a legal character in URI paths (though not in the query string). For example, "a+b c" should be encoded as "a+b%20c"; this solution would convert it to "a%20b%20c". Instead, use new URI(null, null, value, null).getRawPath().

@ChrisNitchie That was not the point of the question. The question was "Java equivalent to JavaScript's encodeURIComponent that produces identical output?", not "Generic Java encode-URI-component function?".

@ChrisNitchie a+b c is encoded to a%2Bb+c with java's URLEncoder and to a%2Bb%20c with js' encodeURIComponent.

Ravi Wallau

Using the javascript engine that is shipped with Java 6:


import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;

public class Wow
{
    public static void main(String[] args) throws Exception
    {
        ScriptEngineManager factory = new ScriptEngineManager();
        ScriptEngine engine = factory.getEngineByName("JavaScript");
        engine.eval("print(encodeURIComponent('\"A\" B ± \"'))");
    }
}

Output: %22A%22%20B%20%c2%b1%20%22

The case is different but it's closer to what you want.

Ah, sorry...I should have mentioned in the question that I'm on Java 1.4 moving to Java 5 shortly!

If javascript is the only solution you can try Rhino, but it's too much just for this small problem.

Even if he was using Java 6, I think this solution is WAY over the top. I don't think he's looking for a way to directly invoke the javascript method, just a way to emulate it.

Maybe. I think the easiest solution would be to write your own escape function if you can't find anything that does the trick for you. Just copy some method from the StringEscapeUtils class (Jakarta Commons Lang) and reimplement it with your needs.

This actually works, and if you're not worried about performance... I think it's good.

Chris Nitchie

I use java.net.URI#getRawPath(), e.g.

String s = "a+b c.html";
String fixed = new URI(null, null, s, null).getRawPath();

The value of fixed will be a+b%20c.html, which is what you want.

Post-processing the output of URLEncoder.encode() will obliterate any pluses that are supposed to be in the URI. For example

URLEncoder.encode("a+b c.html").replaceAll("\\+", "%20");

will give you a%20b%20c.html, which will be interpreted as a b c.html.

After thinking this should be the best answer, I tried it in practice with a few filenames, and it failed in at least two, one with cyrillic characters. So, no, this obviously hasn't been tested well enough.

doesn't work for strings like: http://a+b c.html , it will throw an error

Joe Mill

I came up with my own version of the encodeURIComponent, because the posted solution has one problem, if there was a + present in the String, which should be encoded, it will converted to a space.

So here is my class:

import java.io.UnsupportedEncodingException;
import java.util.BitSet;

public final class EscapeUtils
{
    /** used for the encodeURIComponent function */
    private static final BitSet dontNeedEncoding;

    static
    {
        dontNeedEncoding = new BitSet(256);

        // a-z
        for (int i = 97; i <= 122; ++i)
        {
            dontNeedEncoding.set(i);
        }
        // A-Z
        for (int i = 65; i <= 90; ++i)
        {
            dontNeedEncoding.set(i);
        }
        // 0-9
        for (int i = 48; i <= 57; ++i)
        {
            dontNeedEncoding.set(i);
        }

        // '()*
        for (int i = 39; i <= 42; ++i)
        {
            dontNeedEncoding.set(i);
        }
        dontNeedEncoding.set(33); // !
        dontNeedEncoding.set(45); // -
        dontNeedEncoding.set(46); // .
        dontNeedEncoding.set(95); // _
        dontNeedEncoding.set(126); // ~
    }

    /**
     * A Utility class should not be instantiated.
     */
    private EscapeUtils()
    {

    }

    /**
     * Escapes all characters except the following: alphabetic, decimal digits, - _ . ! ~ * ' ( )
     * 
     * @param input
     *            A component of a URI
     * @return the escaped URI component
     */
    public static String encodeURIComponent(String input)
    {
        if (input == null)
        {
            return input;
        }

        StringBuilder filtered = new StringBuilder(input.length());
        char c;
        for (int i = 0; i < input.length(); ++i)
        {
            c = input.charAt(i);
            if (dontNeedEncoding.get(c))
            {
                filtered.append(c);
            }
            else
            {
                final byte[] b = charToBytesUTF(c);

                for (int j = 0; j < b.length; ++j)
                {
                    filtered.append('%');
                    filtered.append("0123456789ABCDEF".charAt(b[j] >> 4 & 0xF));
                    filtered.append("0123456789ABCDEF".charAt(b[j] & 0xF));
                }
            }
        }
        return filtered.toString();
    }

    private static byte[] charToBytesUTF(char c)
    {
        try
        {
            return new String(new char[] { c }).getBytes("UTF-8");
        }
        catch (UnsupportedEncodingException e)
        {
            return new byte[] { (byte) c };
        }
    }
}

Thanks for a good solution! The others look totally... inefficient, IMO. Perhaps it'd be even better without the BitSet on today's hardware. Or two hard-coded longs for 0...127.

URLEncoder.encode("+", "UTF-8"); yields "%2B", which is the proper URL encoding, so your solution is, my apologies, totally unnecessary. Why on earth URLEncoder.encode doesn't turn spaces into %20 is beyond me.

sangupta

I came up with another implementation documented at, http://blog.sangupta.com/2010/05/encodeuricomponent-and.html. The implementation can also handle Unicode bytes.

balazs

for me this worked:

import org.apache.http.client.utils.URIBuilder;

String encodedString = new URIBuilder()
  .setParameter("i", stringToEncode)
  .build()
  .getRawQuery() // output: i=encodedString
  .substring(2);

or with a different UriBuilder

import javax.ws.rs.core.UriBuilder;

String encodedString = UriBuilder.fromPath("")
  .queryParam("i", stringToEncode)
  .toString()   // output: ?i=encodedString
  .substring(3);

In my opinion using a standard library is a better idea rather than post processing manually. Also @Chris answer looked good, but it doesn't work for urls, like "http://a+b c.html"

Using standard library is good... ...unless you are middle ware, and depend on a different version of a standard library, and then anyone using your code has to fiddle with dependencies, and then hope nothing breaks...

Would be great if this solution would work, but it does not behave the same way as the request encodeURIComponent. encodeURIComponent returns for ?& the result %3F%26%20, but your suggestion returns %3F%26+. I know this is mentioned multiple times in other questions and answers, but should be mentioned here, before people blindly trust it.

Mike Bryant

I have successfully used the java.net.URI class like so:

public static String uriEncode(String string) {
    String result = string;
    if (null != string) {
        try {
            String scheme = null;
            String ssp = string;
            int es = string.indexOf(':');
            if (es > 0) {
                scheme = string.substring(0, es);
                ssp = string.substring(es + 1);
            }
            result = (new URI(scheme, ssp, null)).toString();
        } catch (URISyntaxException usex) {
            // ignore and use string that has syntax error
        }
    }
    return result;
}

Nope, it is not fully successful this approach, but it is relatively ok. You still have problems though. For example the cardinal character # java will encode to %23 javascript will not encode it. See: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/… Javascript does not espace. A-Z a-z 0-9 ; , / ? : @ & = + $ - _ . ! ~ * ' ( ) # And for some of these java will espace.

The good thing by making a UNIT test with the following expression: ''' String charactersJavascriptDoesNotEspace = "A-Za-z0-9;,/?:@&=+$-_.!~*'()#"; ''' the cardinal is the only outlier. So fixing the algorithm above to make it compatible with javascript is trivial.

silver

This is a straightforward example Ravi Wallau's solution:

public String buildSafeURL(String partialURL, String documentName)
        throws ScriptException {
    ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
    ScriptEngine scriptEngine = scriptEngineManager
            .getEngineByName("JavaScript");

    String urlSafeDocumentName = String.valueOf(scriptEngine
            .eval("encodeURIComponent('" + documentName + "')"));
    String safeURL = partialURL + urlSafeDocumentName;

    return safeURL;
}

public static void main(String[] args) {
    EncodeURIComponentDemo demo = new EncodeURIComponentDemo();
    String partialURL = "https://www.website.com/document/";
    String documentName = "Tom & Jerry Manuscript.pdf";

    try {
        System.out.println(demo.buildSafeURL(partialURL, documentName));
    } catch (ScriptException se) {
        se.printStackTrace();
    }
}

Output: https://www.website.com/document/Tom%20%26%20Jerry%20Manuscript.pdf

It also answers the hanging question in the comments by Loren Shqipognja on how to pass a String variable to encodeURIComponent(). The method scriptEngine.eval() returns an Object, so it can converted to String via String.valueOf() among other methods.

Community

This is what I'm using:

private static final String HEX = "0123456789ABCDEF";

public static String encodeURIComponent(String str) {
    if (str == null) return null;

    byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder builder = new StringBuilder(bytes.length);

    for (byte c : bytes) {
        if (c >= 'a' ? c <= 'z' || c == '~' :
            c >= 'A' ? c <= 'Z' || c == '_' :
            c >= '0' ? c <= '9' :  c == '-' || c == '.')
            builder.append((char)c);
        else
            builder.append('%')
                   .append(HEX.charAt(c >> 4 & 0xf))
                   .append(HEX.charAt(c & 0xf));
    }

    return builder.toString();
}

It goes beyond Javascript's by percent-encoding every character that is not an unreserved character according to RFC 3986.

This is the oposite conversion:

public static String decodeURIComponent(String str) {
    if (str == null) return null;

    int length = str.length();
    byte[] bytes = new byte[length / 3];
    StringBuilder builder = new StringBuilder(length);

    for (int i = 0; i < length; ) {
        char c = str.charAt(i);
        if (c != '%') {
            builder.append(c);
            i += 1;
        } else {
            int j = 0;
            do {
                char h = str.charAt(i + 1);
                char l = str.charAt(i + 2);
                i += 3;

                h -= '0';
                if (h >= 10) {
                    h |= ' ';
                    h -= 'a' - '0';
                    if (h >= 6) throw new IllegalArgumentException();
                    h += 10;
                }

                l -= '0';
                if (l >= 10) {
                    l |= ' ';
                    l -= 'a' - '0';
                    if (l >= 6) throw new IllegalArgumentException();
                    l += 10;
                }

                bytes[j++] = (byte)(h << 4 | l);
                if (i >= length) break;
                c = str.charAt(i);
            } while (c == '%');
            builder.append(new String(bytes, 0, j, UTF_8));
        }
    }

    return builder.toString();
}

AlexN

I used String encodedUrl = new URI(null, url, null).toASCIIString(); to encode urls. To add parameters after the existing ones in the url I use UriComponentsBuilder

I've create a demo using this approach that i find the best, my use case was to encode a json being able to retrieve on js side by reading it from a data-attribute: repl.it/@raythurnevoid/URIEncodeJSON#Main.java

honzajde

I have found PercentEscaper class from google-http-java-client library, that can be used to implement encodeURIComponent quite easily.

PercentEscaper from google-http-java-client javadoc google-http-java-client home

aristotll

Guava library has PercentEscaper:

Escaper percentEscaper = new PercentEscaper("-_.*", false);

"-_.*" are safe characters

false says PercentEscaper to escape space with '%20', not '+'

Java equivalent to JavaScript's encodeURIComponent that produces identical output?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US