ChatGPT解决这个技术问题 Extra ChatGPT

Efficient way to remove ALL whitespace from String?

I'm calling a REST API and am receiving an XML response back. It returns a list of a workspace names, and I'm writing a quick IsExistingWorkspace() method. Since all workspaces consist of contiguous characters with no whitespace, I'm assuming the easiest way to find out if a particular workspace is in the list is to remove all whitespace (including newlines) and doing this (XML is the string received from the web request):

XML.Contains("<name>" + workspaceName + "</name>");

I know it's case-sensitive, and I'm relying on that. I just need a way to remove all whitespace in a string efficiently. I know RegEx and LINQ can do it, but I'm open to other ideas. I am mostly just concerned about speed.

Parsing XML with regex is almost as bad as parsing HTML with regex.
@henk holterman; See my answer below, regexp doesn't seem to be the fastest in all cases.
Regex doesn't seem to be the fastest at all. I have summarized the results from many different ways to remove whitespace from a string. The summary is in an answer below - stackoverflow.com/a/37347881/582061

N
NearHuscarl

This is fastest way I know of, even though you said you didn't want to use regular expressions:

Regex.Replace(XML, @"\s+", "");

Crediting @hypehuman in the comments, if you plan to do this more than once, create and store a Regex instance. This will save the overhead of constructing it every time, which is more expensive than you might think.

private static readonly Regex sWhitespace = new Regex(@"\s+");
public static string ReplaceWhitespace(string input, string replacement) 
{
    return sWhitespace.Replace(input, replacement);
}

I could use a regular expression, I'm just not sure if it's the fastest way.
Shouldn't that be Regex.Replace(XML, @"\s+", "")?
If you plan to do this more than once, create and store a Regex instance. This will save the overhead of constructing it every time, which is more expensive than you might think. private static readonly Regex sWhitespace = new Regex(@"\s+"); public static string ReplaceWhitespace(string input, string replacement) { return sWhitespace.Replace(input, replacement); }
use split/join combination as tested to be the fastest so far, see KernowCode answer below.
For those new to RegEx and looking for an explanation as to what this expression means, \s means "match any whitespace token", and + means "match one or more of the proceeding token". Also RegExr is a nice website to practice writing RegEx expressions with, if you want to experiment.
P
Peter Mortensen

I have an alternative way without regexp, and it seems to perform pretty good. It is a continuation on Brandon Moretz answer:

 public static string RemoveWhitespace(this string input)
 {
    return new string(input.ToCharArray()
        .Where(c => !Char.IsWhiteSpace(c))
        .ToArray());
 }

I tested it in a simple unit test:

[Test]
[TestCase("123 123 1adc \n 222", "1231231adc222")]
public void RemoveWhiteSpace1(string input, string expected)
{
    string s = null;
    for (int i = 0; i < 1000000; i++)
    {
        s = input.RemoveWhitespace();
    }
    Assert.AreEqual(expected, s);
}

[Test]
[TestCase("123 123 1adc \n 222", "1231231adc222")]
public void RemoveWhiteSpace2(string input, string expected)
{
    string s = null;
    for (int i = 0; i < 1000000; i++)
    {
        s = Regex.Replace(input, @"\s+", "");
    }
    Assert.AreEqual(expected, s);
}

For 1,000,000 attempts the first option (without regexp) runs in less then a second (700 ms on my machine), and the second takes 3.5 seconds.


.ToCharArray() is not necessary; you can use .Where() directly on a string.
Just to note here. Regex is slower... on small strings! If you say you had a digitized version of a Volume on US Tax Law (~million words?), with a handful of iterations, Regex is king, by far! Its not what is faster, but what should be used in which circumstance. You only proved half the equation here. -1 until you prove the second half of the test so that the answer provides more insight to when what should be used.
@ppumkin He asked for a single pass removal of whitespace. Not multiple iterations of other processing. I'm not going to make this single pass whitespace removal into an extended post about benchmarking text processing.
You said its preferred not to use to regex this time but didn't say why.
@ProgramFOX, in a different question (can't readily find it) I noticed that at least in some queries, using ToCharArray is faster than using .Where() directly on the string. This has something to do with the overhead into the IEnumerable<> in each iteration step, and the ToCharArray being very efficient (block-copy) and the compiler optimizes iteration over arrays. Why this difference exists, no-one has been able to explain me, but measure before you remove ToCharArray().
R
Rudey

Try the replace method of the string in C#.

XML.Replace(" ", string.Empty);

Doesn't remove tabs or newlines. If I do multiple removes now I'm making multiple passes over the string.
Downvote for not removing all whitespace, as slandau and Henk's answers do.
@MattSach why does it not remove ALL whitespace?
@Zapnologica It's only replacing space characters. The OP asked for replacement of newlines as well (which are "whitespace" characters, even though they're not a space character).
downvoting not reading OP question before answer: all white spaces, not just spaces
k
kernowcode

My solution is to use Split and Join and it is surprisingly fast, in fact the fastest of the top answers here.

str = string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));

Timings for 10,000 loop on simple string with whitespace inc new lines and tabs

split/join = 60 milliseconds

linq chararray = 94 milliseconds

regex = 437 milliseconds

Improve this by wrapping it up in method to give it meaning, and also make it an extension method while we are at it ...

public static string RemoveWhitespace(this string str) {
    return string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
}

I really like this solution, I've been using a similar one since pre-LINQ days. I'm actually impressed with LINQs performance, and somewhat surprised with regex. Maybe the code was not as optimal as it could have been for regex (you'll have to cache the regex object for example). But the crux of the problem is that the "quality" of the data will matter a lot. Maybe with long strings the regex will outperform the other options. It will be a fun benchmark to perform... :-)
How does default(string[]) == a list of all whitespace characters? I see it working, but I am not understanding how?
@kernowcode You mean the ambiguity between the the 2 overloads with string[] and char[]? you just have to specify which one you want e.g.: string.Join("", str.Split((string[])null, StringSplitOptions.RemoveEmptyEntries));. That is actually what your call to default does in this case since it returns null as well: it helps the compiler to decide which overload to pick. Hence my comment because the statement in your comment "Split needs a valid array and null will not do ..." is false. No big deal, just thought worth mentioning since Jake Drew asked how this worked. +1 for your answer
Cool idea ... but i would do it as follows: string.Concat("H \ne llo Wor ld".Split())
michaelkrisper solution is very readable. I did a test and 'split/join' (162 milliseconds) performed better than 'split/concat' (180 milliseconds) for 10,000 iterations of the same string.
S
Stian Standahl

Building on Henks answer I have created some test methods with his answer and some added, more optimized, methods. I found the results differ based on the size of the input string. Therefore, I have tested with two result sets. In the fastest method, the linked source has a even faster way. But, since it is characterized as unsafe I have left this out.

Long input string results:

InPlaceCharArray: 2021 ms (Sunsetquest's answer) - (Original source) String split then join: 4277ms (Kernowcode's answer) String reader: 6082 ms LINQ using native char.IsWhitespace: 7357 ms LINQ: 7746 ms (Henk's answer) ForLoop: 32320 ms RegexCompiled: 37157 ms Regex: 42940 ms

Short input string results:

InPlaceCharArray: 108 ms (Sunsetquest's answer) - (Original source) String split then join: 294 ms (Kernowcode's answer) String reader: 327 ms ForLoop: 343 ms LINQ using native char.IsWhitespace: 624 ms LINQ: 645ms (Henk's answer) RegexCompiled: 1671 ms Regex: 2599 ms

Code:

public class RemoveWhitespace
{
    public static string RemoveStringReader(string input)
    {
        var s = new StringBuilder(input.Length); // (input.Length);
        using (var reader = new StringReader(input))
        {
            int i = 0;
            char c;
            for (; i < input.Length; i++)
            {
                c = (char)reader.Read();
                if (!char.IsWhiteSpace(c))
                {
                    s.Append(c);
                }
            }
        }

        return s.ToString();
    }

    public static string RemoveLinqNativeCharIsWhitespace(string input)
    {
        return new string(input.ToCharArray()
            .Where(c => !char.IsWhiteSpace(c))
            .ToArray());
    }

    public static string RemoveLinq(string input)
    {
        return new string(input.ToCharArray()
            .Where(c => !Char.IsWhiteSpace(c))
            .ToArray());
    }

    public static string RemoveRegex(string input)
    {
        return Regex.Replace(input, @"\s+", "");
    }

    private static Regex compiled = new Regex(@"\s+", RegexOptions.Compiled);
    public static string RemoveRegexCompiled(string input)
    {
        return compiled.Replace(input, "");
    }

    public static string RemoveForLoop(string input)
    {
        for (int i = input.Length - 1; i >= 0; i--)
        {
            if (char.IsWhiteSpace(input[i]))
            {
                input = input.Remove(i, 1);
            }
        }
        return input;
    }

    public static string StringSplitThenJoin(this string str)
    {
        return string.Join("", str.Split(default(string[]), StringSplitOptions.RemoveEmptyEntries));
    }

    public static string RemoveInPlaceCharArray(string input)
    {
        var len = input.Length;
        var src = input.ToCharArray();
        int dstIdx = 0;
        for (int i = 0; i < len; i++)
        {
            var ch = src[i];
            switch (ch)
            {
                case '\u0020':
                case '\u00A0':
                case '\u1680':
                case '\u2000':
                case '\u2001':
                case '\u2002':
                case '\u2003':
                case '\u2004':
                case '\u2005':
                case '\u2006':
                case '\u2007':
                case '\u2008':
                case '\u2009':
                case '\u200A':
                case '\u202F':
                case '\u205F':
                case '\u3000':
                case '\u2028':
                case '\u2029':
                case '\u0009':
                case '\u000A':
                case '\u000B':
                case '\u000C':
                case '\u000D':
                case '\u0085':
                    continue;
                default:
                    src[dstIdx++] = ch;
                    break;
            }
        }
        return new string(src, 0, dstIdx);
    }
}

Tests:

[TestFixture]
public class Test
{
    // Short input
    //private const string input = "123 123 \t 1adc \n 222";
    //private const string expected = "1231231adc222";

    // Long input
    private const string input = "123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222123 123 \t 1adc \n 222";
    private const string expected = "1231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc2221231231adc222";

    private const int iterations = 1000000;

    [Test]
    public void RemoveInPlaceCharArray()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveInPlaceCharArray(input);
        }

        stopwatch.Stop();
        Console.WriteLine("InPlaceCharArray: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveStringReader()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveStringReader(input);
        }

        stopwatch.Stop();
        Console.WriteLine("String reader: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveLinqNativeCharIsWhitespace()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveLinqNativeCharIsWhitespace(input);
        }

        stopwatch.Stop();
        Console.WriteLine("LINQ using native char.IsWhitespace: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveLinq()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveLinq(input);
        }

        stopwatch.Stop();
        Console.WriteLine("LINQ: " + stopwatch.ElapsedMilliseconds + " ms");
        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveRegex()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveRegex(input);
        }

        stopwatch.Stop();
        Console.WriteLine("Regex: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveRegexCompiled()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveRegexCompiled(input);
        }

        stopwatch.Stop();
        Console.WriteLine("RegexCompiled: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }

    [Test]
    public void RemoveForLoop()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.RemoveForLoop(input);
        }

        stopwatch.Stop();
        Console.WriteLine("ForLoop: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }

    [TestMethod]
    public void StringSplitThenJoin()
    {
        string s = null;
        var stopwatch = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            s = RemoveWhitespace.StringSplitThenJoin(input);
        }

        stopwatch.Stop();
        Console.WriteLine("StringSplitThenJoin: " + stopwatch.ElapsedMilliseconds + " ms");

        Assert.AreEqual(expected, s);
    }
}

Edit: Tested a nice one liner from Kernowcode.


F
Foxfire

Just an alternative because it looks quite nice :) - NOTE: Henks answer is the quickest of these.

input.ToCharArray()
 .Where(c => !Char.IsWhiteSpace(c))
 .Select(c => c.ToString())
 .Aggregate((a, b) => a + b);

Testing 1,000,000 loops on "This is a simple Test"

This method = 1.74 seconds
Regex = 2.58 seconds
new String (Henks) = 0.82 seconds


Why was this downvoted? It's perfectly acceptable, meets the requirements, works faster than the RegEx option and is very readable?
because it can be written a lot shorter: new string(input.Where(c => !Char.IsWhiteSpace(c)).ToArray());
Might be true - but the answer still stands, is readable, faster than regex and produces the desired result. Many of the other answers are AFTER this one...therefore a downvote does not make sense.
Is there a unit for "0.82"? Or is it a relative measure (82%)? Can you edit your answer to make it more clear?
S
SunsetQuest

I found a nice write-up on this on CodeProject by Felipe Machado (with help by Richard Robertson)

He tested ten different methods. This one is the fastest safe version...

public static string TrimAllWithInplaceCharArray(string str) {

    var len = str.Length;
    var src = str.ToCharArray();
    int dstIdx = 0;

    for (int i = 0; i < len; i++) {
        var ch = src[i];

        switch (ch) {

            case '\u0020': case '\u00A0': case '\u1680': case '\u2000': case '\u2001':

            case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':

            case '\u2007': case '\u2008': case '\u2009': case '\u200A': case '\u202F':

            case '\u205F': case '\u3000': case '\u2028': case '\u2029': case '\u0009':

            case '\u000A': case '\u000B': case '\u000C': case '\u000D': case '\u0085':
                continue;

            default:
                src[dstIdx++] = ch;
                break;
        }
    }
    return new string(src, 0, dstIdx);
}

And the fastest unsafe version... (some inprovements by Sunsetquest 5/26/2021 )

public static unsafe void RemoveAllWhitespace(ref string str)
{
    fixed (char* pfixed = str)
    {
        char* dst = pfixed;
        for (char* p = pfixed; *p != 0; p++)
        {
            switch (*p)
            {
                case '\u0020': case '\u00A0': case '\u1680': case '\u2000': case '\u2001':
                case '\u2002': case '\u2003': case '\u2004': case '\u2005': case '\u2006':
                case '\u2007': case '\u2008': case '\u2009': case '\u200A': case '\u202F':
                case '\u205F': case '\u3000': case '\u2028': case '\u2029': case '\u0009':
                case '\u000A': case '\u000B': case '\u000C': case '\u000D': case '\u0085':
                continue;

                default:
                    *dst++ = *p;
                    break;
            }
        }

        uint* pi = (uint*)pfixed;
        ulong len = ((ulong)dst - (ulong)pfixed) >> 1;
        pi[-1] = (uint)len;
        pfixed[len] = '\0';
    }
}

There are also some nice independent benchmarks on Stack Overflow by Stian Standahl that also show how Felipe's function is about 300% faster than the next fastest function. Also, for the one I modified, I used this trick.


I've tried translating this to C++ but am a little stuck. Any ideas why my port might be failing? stackoverflow.com/questions/42135922/…
I can't resist. Look in the comments section of the article you refer to. You will find me as "Basketcase Software". He and worked on this together for a while. I had completely forgotten about this when this problem came back up again. Thanks for good memories. :)
And what if you want to remove extra WS only ? What about this stackoverflow.com/questions/17770202/… mod ?
Fastest is a bit slower ;-) String as container perfoms better here (in app 4:15 to 3:55 => 8.5% less, but when left string 3:30 => 21.4% less and profiller shows around 50% spent in this method). So in real live string should be around 40% faster comparing to (slow) array conversion used here.
The original string will be changed by the unsafe version!
P
Peter Mortensen

If you need superb performance, you should avoid LINQ and regular expressions in this case. I did some performance benchmarking, and it seems that if you want to strip white space from beginning and end of the string, string.Trim() is your ultimate function.

If you need to strip all white spaces from a string, the following method works fastest of all that has been posted here:

    public static string RemoveWhitespace(this string input)
    {
        int j = 0, inputlen = input.Length;
        char[] newarr = new char[inputlen];

        for (int i = 0; i < inputlen; ++i)
        {
            char tmp = input[i];

            if (!char.IsWhiteSpace(tmp))
            {
                newarr[j] = tmp;
                ++j;
            }
        }
        return new String(newarr, 0, j);
    }

I'd be curious to know the details of your benchmarkings--not that I am skeptical, but I'm curious about the overhead involved with Linq. How bad was it?
I haven't re-run all the tests, but I can remember this much: Everything that involved Linq was a lot slower than anything without it. All the clever usage of string/char functions and constructors made no percentual difference if Linq was used.
P
Peter Mortensen

Regex is overkill; just use extension on string (thanks Henk). This is trivial and should have been part of the framework. Anyhow, here's my implementation:

public static partial class Extension
{
    public static string RemoveWhiteSpace(this string self)
    {
        return new string(self.Where(c => !Char.IsWhiteSpace(c)).ToArray());
    }
}

this is basically an unnecessary answer (regex is overkill, but is a quicker solution than given one - and it is already accepted?)
How can you use Linq extension methods on a string? Can't figure out which using I am missing others than System.Linq
Ok looks like this is not available in PCL, IEnumerable<char> is conditional in Microsoft String implementation... And I am using Profile259 which does not support this :)
@GGirard strings are collections of char, so linq should work by default.
P
Peter Mortensen

Here is a simple linear alternative to the RegEx solution. I am not sure which is faster; you'd have to benchmark it.

static string RemoveWhitespace(string input)
{
    StringBuilder output = new StringBuilder(input.Length);

    for (int index = 0; index < input.Length; index++)
    {
        if (!Char.IsWhiteSpace(input, index))
        {
            output.Append(input[index]);
        }
    }
    return output.ToString();
}

u
user1325543

I needed to replace white space in a string with spaces, but not duplicate spaces. e.g., I needed to convert something like the following:

"a b   c\r\n d\t\t\t e"

to

"a b c d e"

I used the following method

private static string RemoveWhiteSpace(string value)
{
    if (value == null) { return null; }
    var sb = new StringBuilder();

    var lastCharWs = false;
    foreach (var c in value)
    {
        if (char.IsWhiteSpace(c))
        {
            if (lastCharWs) { continue; }
            sb.Append(' ');
            lastCharWs = true;
        }
        else
        {
            sb.Append(c);
            lastCharWs = false;
        }
    }
    return sb.ToString();
}

d
dtb

I assume your XML response looks like this:

var xml = @"<names>
                <name>
                    foo
                </name>
                <name>
                    bar
                </name>
            </names>";

The best way to process XML is to use an XML parser, such as LINQ to XML:

var doc = XDocument.Parse(xml);

var containsFoo = doc.Root
                     .Elements("name")
                     .Any(e => ((string)e).Trim() == "foo");

Once I verify that a particular tag has the proper value, I'm done. Wouldn't parsing the document have some overhead?
Sure, it has some overhead. But it has the benefit of being correct. A solution based e.g. on regex is much more difficult to get right. If you determine that a LINQ to XML solution is too slow, you can always replace it with something faster. But you should avoid hunting for the most efficient implementation before you know that the correct one is too slow.
This is going to be running in my employer's backend servers. Lightweight is what I'm looking for. I don't want something that "just works" but is optimal.
LINQ to XML is one of the most lightweight ways to correctly work with XML in .NET
T
Tarik BENARAB

We can use:

    public static string RemoveWhitespace(this string input)
    {
        if (input == null)
            return null;
        return new string(input.ToCharArray()
            .Where(c => !Char.IsWhiteSpace(c))
            .ToArray());
    }

This is almost exactly the same as Henk's answer above. The only difference is that you check for null.
Yes, check for null is importente
Maybe this should have just been a comment on his answer. I am glad you brought it up though. I didn't know extension methods could be called on null objects.
K
Kewin Remy

Using Linq, you can write a readable method this way :

    public static string RemoveAllWhitespaces(this string source)
    {
        return string.IsNullOrEmpty(source) ? source : new string(source.Where(x => !char.IsWhiteSpace(x)).ToArray());
    }

l
larsemil

I think alot of persons come here for removing spaces. :

string s = "my string is nice";
s = s.replace(" ", "");

The problem with this, is that a space can be written in many different ways as mentioned in other answers. This replace will work for 90%~ of the cases or so.
F
Fred

Here is yet another variant:

public static string RemoveAllWhitespace(string aString)
{
  return String.Join(String.Empty, aString.Where(aChar => aChar !Char.IsWhiteSpace(aChar)));
}

As with most of the other solutions, I haven't performed exhaustive benchmark tests, but this works well enough for my purposes.


h
hvanbrug

I have found different results to be true. I am trying to replace all whitespace with a single space and the regex was extremely slow.

return( Regex::Replace( text, L"\s+", L" " ) );

What worked the most optimally for me (in C++ cli) was:

String^ ReduceWhitespace( String^ text )
{
  String^ newText;
  bool    inWhitespace = false;
  Int32   posStart = 0;
  Int32   pos      = 0;
  for( pos = 0; pos < text->Length; ++pos )
  {
    wchar_t cc = text[pos];
    if( Char::IsWhiteSpace( cc ) )
    {
      if( !inWhitespace )
      {
        if( pos > posStart ) newText += text->Substring( posStart, pos - posStart );
        inWhitespace = true;
        newText += L' ';
      }
      posStart = pos + 1;
    }
    else
    {
      if( inWhitespace )
      {
        inWhitespace = false;
        posStart = pos;
      }
    }
  }

  if( pos > posStart ) newText += text->Substring( posStart, pos - posStart );

  return( newText );
}

I tried the above routine first by replacing each character separately, but had to switch to doing substrings for the non-space sections. When applying to a 1,200,000 character string:

the above routine gets it done in 25 seconds

the above routine + separate character replacement in 95 seconds

the regex aborted after 15 minutes.