ChatGPT解决这个技术问题 Extra ChatGPT

\d 效率低于 [0-9]

我昨天对有人在正则表达式中使用 [0123456789] 而不是 [0-9]\d 的答案发表了评论。我说过使用范围或数字说明符可能比字符集更有效。

我决定今天测试一下,我惊讶地发现(至少在 c# regex 引擎中)\d 似乎比其他两个似乎没有太大差异的效率低。这是我的测试输出超过 10000 个随机字符串的 1000 个随机字符,其中 5077 实际上包含一个数字:

Regex \d           took 00:00:00.2141226 result: 5077/10000
Regex [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regex [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

有两个原因让我感到惊讶,如果有人能解释一下,我会很感兴趣:

我原以为范围会比集合更有效地实施。我不明白为什么 \d 比 [0-9] 差。除了 [0-9] 的简写之外,\d 还有更多吗?

这是测试代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
{
    class Program
    {
        static void Main(string[] args)
        {
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
            {
                //generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                {
                    //add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                }
                //in roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                {
                    //replace 1 char with a digit 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                }
                strings.Add(sb.ToString());
            }

            var baseTime = testPerfomance(strings, @"\d");
            Console.WriteLine();
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }

        private static TimeSpan testPerfomance(List<string> strings, string regex)
        {
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            sw.Start();
            foreach (var str in strings)
            {
                if (rex.Match(str).Success)
                {
                    successes++;
                }
            }
            sw.Stop();

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;
        }
    }
}
可能 \d 处理语言环境。例如,希伯来语使用字母表示数字。
这是一个有趣的问题,正是因为 \d 在不同语言中的意思不同。在 Java 中,例如 \d does indeed match 0-9 only
@Barmar 希伯来语通常不使用字母表示数字,而是使用相同的拉丁数字数字 [0-9]。字母可以代替数字,但这是一种罕见的用途,并保留用于特殊术语。我不希望正则表达式解析器匹配 כ"ג יורדי סירה (用 כ"ג 代替 23)。此外,从 Sina Iravanian 的回答中可以看出,希伯来字母不会显示为 \d 的有效匹配项。
将 Weston 的代码移植到 Java 产生: -- Regex \d 占用 00:00:00.043922 结果:4912/10000 -- Regex [0-9] 占用 00:00:00.073658 结果:4912/10000 167% of first -- Regex [ 0123456789] 00:00:00.085799 结果:4912/10000 195% 的第一个

d
dakab

\d 检查所有 Unicode 数字,而 [0-9] 仅限于这 10 个字符。例如,Persian 位、۱۲۳۴۵۶۷۸۹ 是与 \d 匹配但与 [0-9] 不匹配的 Unicode 数字示例。

您可以使用以下代码生成所有此类字符的列表:

var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, @"\d"))
        sb.Append(str);
}
Console.WriteLine(sb.ToString());

生成:

0123456789 ٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८ ९୦୧୨୩୪୫୬୭୮୯᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧑᧒᧓᧔᧕᧖᧗᧘᧙᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789


以下是非 0-9 的更完整数字列表:fileformat.info/info/unicode/category/Nd/list.htm
@weston Unicode 有 17 个平面,每个平面 16 位。最重要的字符在基本平面,但一些特殊字符,主要是中文,在补充平面。在 C# 中处理这些有点烦人。
@RobertMcKee:Nitpick:完整的 unicode 字符集实际上是 21 位(17 个平面,每个平面 16 位)。但是当然 21 位数据类型是不切实际的,所以如果您使用 2 的幂数据类型,那么您确实需要 32 位。
根据 this Wikipedia article,Unicode 联盟声明 1,114,112 个代码点(0 到 0x010FFFF)的限制永远不会改变。它链接到 unicode.org,但我没有在那里找到该声明(我可能只是错过了它)。
它永远不会改变——直到他们需要改变它。
w
weston

感谢 ByteBlast 在文档中注意到这一点。只需更改正则表达式构造函数:

var rex = new Regex(regex, RegexOptions.ECMAScript);

给出新的时间:

Regex \d           took 00:00:00.1355787 result: 5077/10000
Regex [0-9]        took 00:00:00.1360403 result: 5077/10000  100.34 % of first
Regex [0123456789] took 00:00:00.1362112 result: 5077/10000  100.47 % of first

RegexOptions.ECMAScript 有什么作用?
来自 Regular Expression Options:“为表达式启用符合 ECMAScript 的行为。”
@0xFE:不完全是。 Unicode 转义在 ECMAScript (\u1234) 中仍然有效。它“只是”改变含义的速记字符类(如 \d)和消失的 Unicode 属性/脚本速记(如 \p{N})。
这不是对“为什么”部分的回答。这是一个“解决症状”的答案。还是有价值的信息。
通常,Regrex 支持 unicode 匹配。但 ECMAScript 没有。因此,在使用 RegexOptions.ECMAScript 时,它只匹配 ascii,即 0-9。
C
Community

Does “\d” in regex mean a digit?

[0-9] 不等于 \d。 [0-9] 仅匹配 0123456789 个字符,而 \d 匹配 [0-9] 和其他数字字符,例如东方阿拉伯数字 ٠١٢٣٤٥٦٧٨٩


根据:msdn.microsoft.com/en-us/library/20bw873z.aspx If ECMAScript-compliant behavior is specified, \d is equivalent to [0-9].
呵呵,我错了还是链接中的这句话正好相反。 “\d 匹配任何十进制数字。它相当于 \p{Nd} 正则表达式模式,它包括标准的十进制数字 0-9 以及许多其他字符集的十进制数字。”
@ByteBlast 谢谢,使用构造函数: var rex = new Regex(regex, RegexOptions.ECMAScript); 使它们在性能方面几乎无法区分。
哦,无论如何,谢谢大家。这个问题对我来说是一个很好的学习。
请不要“只是复制”其他问题的答案。如果问题是重复的,请将其标记为这样。
S
Sebastian

除了 Sina Iravianian 中的 top answer,这里是他的代码的 .NET 4.5 版本(因为只有那个版本支持 UTF16 输出,参见前三行),它使用了全范围的 Unicode 代码点。由于缺乏对高级 Unicode 平面的适当支持,许多人没有意识到总是检查并包含高级 Unicode 平面。尽管如此,它们有时确实包含一些重要的字符。

更新

由于 \d 不支持正则表达式中的非 BMP 字符(感谢 xanatos),这里有一个使用 Unicode 字符数据库的版本

更新 2

感谢 damilola-adegunwa,我添加了对 UCD 的缺失引用(通过 NuGet 包 UnicodeInformation)。还更新到最新的 .NET 核心版本和 UTF-8 输出。

// reference https://www.nuget.org/packages/UnicodeInformation/
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.Unicode;
                    
public class Program
{
    public static void Main()
    {
        var unicodeEncoding = new UTF8Encoding(false);
        Console.OutputEncoding = unicodeEncoding;

        var numberCategories = new HashSet<UnicodeCategory>(new []{
            UnicodeCategory.DecimalDigitNumber,
            UnicodeCategory.LetterNumber,
            UnicodeCategory.OtherNumber
        });
        var numberLikeChars =
            from codePoint in Enumerable.Range(0, 0x10ffff)
            where codePoint > UInt16.MaxValue 
                || (!char.IsLowSurrogate((char) codePoint) && !char.IsHighSurrogate((char) codePoint))
            let charInfo = UnicodeInfo.GetCharInfo(codePoint)
            where numberCategories.Contains(charInfo.Category)
            let codePointString = char.ConvertFromUtf32(codePoint)
            select (codePoint, charInfo, codePointString);

        foreach (var (codePoint, charInfo, codePointString) in numberLikeChars)
        {
            Console.Write("U+{0} ", codePoint.ToString("X6"));
            Console.Write(" {0,-4}", codePointString);
            Console.Write(" {0,-40}", charInfo.Name ?? charInfo.OldName);
            Console.Write(" {0,-6}", CharUnicodeInfo.GetNumericValue(codePointString, 0));
            Console.Write(" {0,-6}", CharUnicodeInfo.GetDigitValue(codePointString, 0));
            Console.Write(" {0,-6}", CharUnicodeInfo.GetDecimalDigitValue(codePointString, 0));
            Console.WriteLine(" {0}", charInfo.Category);
        }
    }
}

产生以下输出:

U+000030  0    DIGIT ZERO                               0      0      0      DecimalDigitNumber
U+000031  1    DIGIT ONE                                1      1      1      DecimalDigitNumber
U+000032  2    DIGIT TWO                                2      2      2      DecimalDigitNumber
U+000033  3    DIGIT THREE                              3      3      3      DecimalDigitNumber
U+000034  4    DIGIT FOUR                               4      4      4      DecimalDigitNumber
U+000035  5    DIGIT FIVE                               5      5      5      DecimalDigitNumber
U+000036  6    DIGIT SIX                                6      6      6      DecimalDigitNumber
U+000037  7    DIGIT SEVEN                              7      7      7      DecimalDigitNumber
U+000038  8    DIGIT EIGHT                              8      8      8      DecimalDigitNumber
U+000039  9    DIGIT NINE                               9      9      9      DecimalDigitNumber
U+0000B2  ²    SUPERSCRIPT TWO                          2      2      -1     OtherNumber
U+0000B3  ³    SUPERSCRIPT THREE                        3      3      -1     OtherNumber
U+0000B9  ¹    SUPERSCRIPT ONE                          1      1      -1     OtherNumber
U+0000BC  ¼    VULGAR FRACTION ONE QUARTER              0.25   -1     -1     OtherNumber
U+0000BD  ½    VULGAR FRACTION ONE HALF                 0.5    -1     -1     OtherNumber
U+0000BE  ¾    VULGAR FRACTION THREE QUARTERS           0.75   -1     -1     OtherNumber
U+000660  ٠    ARABIC-INDIC DIGIT ZERO                  0      0      0      DecimalDigitNumber
U+000661  ١    ARABIC-INDIC DIGIT ONE                   1      1      1      DecimalDigitNumber
U+000662  ٢    ARABIC-INDIC DIGIT TWO                   2      2      2      DecimalDigitNumber
U+000663  ٣    ARABIC-INDIC DIGIT THREE                 3      3      3      DecimalDigitNumber
U+000664  ٤    ARABIC-INDIC DIGIT FOUR                  4      4      4      DecimalDigitNumber
U+000665  ٥    ARABIC-INDIC DIGIT FIVE                  5      5      5      DecimalDigitNumber
U+000666  ٦    ARABIC-INDIC DIGIT SIX                   6      6      6      DecimalDigitNumber
U+000667  ٧    ARABIC-INDIC DIGIT SEVEN                 7      7      7      DecimalDigitNumber
U+000668  ٨    ARABIC-INDIC DIGIT EIGHT                 8      8      8      DecimalDigitNumber
U+000669  ٩    ARABIC-INDIC DIGIT NINE                  9      9      9      DecimalDigitNumber
U+0006F0  ۰    EXTENDED ARABIC-INDIC DIGIT ZERO         0      0      0      DecimalDigitNumber
U+0006F1  ۱    EXTENDED ARABIC-INDIC DIGIT ONE          1      1      1      DecimalDigitNumber
U+0006F2  ۲    EXTENDED ARABIC-INDIC DIGIT TWO          2      2      2      DecimalDigitNumber
U+0006F3  ۳    EXTENDED ARABIC-INDIC DIGIT THREE        3      3      3      DecimalDigitNumber
U+0006F4  ۴    EXTENDED ARABIC-INDIC DIGIT FOUR         4      4      4      DecimalDigitNumber
U+0006F5  ۵    EXTENDED ARABIC-INDIC DIGIT FIVE         5      5      5      DecimalDigitNumber
U+0006F6  ۶    EXTENDED ARABIC-INDIC DIGIT SIX          6      6      6      DecimalDigitNumber
U+0006F7  ۷    EXTENDED ARABIC-INDIC DIGIT SEVEN        7      7      7      DecimalDigitNumber
U+0006F8  ۸    EXTENDED ARABIC-INDIC DIGIT EIGHT        8      8      8      DecimalDigitNumber
U+0006F9  ۹    EXTENDED ARABIC-INDIC DIGIT NINE         9      9      9      DecimalDigitNumber
U+0007C0  ߀    NKO DIGIT ZERO                           0      0      0      DecimalDigitNumber
U+0007C1  ߁    NKO DIGIT ONE                            1      1      1      DecimalDigitNumber
U+0007C2  ߂    NKO DIGIT TWO                            2      2      2      DecimalDigitNumber
U+0007C3  ߃    NKO DIGIT THREE                          3      3      3      DecimalDigitNumber
U+0007C4  ߄    NKO DIGIT FOUR                           4      4      4      DecimalDigitNumber
U+0007C5  ߅    NKO DIGIT FIVE                           5      5      5      DecimalDigitNumber
U+0007C6  ߆    NKO DIGIT SIX                            6      6      6      DecimalDigitNumber
U+0007C7  ߇    NKO DIGIT SEVEN                          7      7      7      DecimalDigitNumber
U+0007C8  ߈    NKO DIGIT EIGHT                          8      8      8      DecimalDigitNumber
U+0007C9  ߉    NKO DIGIT NINE                           9      9      9      DecimalDigitNumber
U+000966  ०    DEVANAGARI DIGIT ZERO                    0      0      0      DecimalDigitNumber
U+000967  १    DEVANAGARI DIGIT ONE                     1      1      1      DecimalDigitNumber
U+000968  २    DEVANAGARI DIGIT TWO                     2      2      2      DecimalDigitNumber
U+000969  ३    DEVANAGARI DIGIT THREE                   3      3      3      DecimalDigitNumber
U+00096A  ४    DEVANAGARI DIGIT FOUR                    4      4      4      DecimalDigitNumber
U+00096B  ५    DEVANAGARI DIGIT FIVE                    5      5      5      DecimalDigitNumber
U+00096C  ६    DEVANAGARI DIGIT SIX                     6      6      6      DecimalDigitNumber
U+00096D  ७    DEVANAGARI DIGIT SEVEN                   7      7      7      DecimalDigitNumber
U+00096E  ८    DEVANAGARI DIGIT EIGHT                   8      8      8      DecimalDigitNumber
U+00096F  ९    DEVANAGARI DIGIT NINE                    9      9      9      DecimalDigitNumber
U+0009E6  ০    BENGALI DIGIT ZERO                       0      0      0      DecimalDigitNumber
U+0009E7  ১    BENGALI DIGIT ONE                        1      1      1      DecimalDigitNumber
U+0009E8  ২    BENGALI DIGIT TWO                        2      2      2      DecimalDigitNumber
U+0009E9  ৩    BENGALI DIGIT THREE                      3      3      3      DecimalDigitNumber
U+0009EA  ৪    BENGALI DIGIT FOUR                       4      4      4      DecimalDigitNumber
U+0009EB  ৫    BENGALI DIGIT FIVE                       5      5      5      DecimalDigitNumber
U+0009EC  ৬    BENGALI DIGIT SIX                        6      6      6      DecimalDigitNumber
U+0009ED  ৭    BENGALI DIGIT SEVEN                      7      7      7      DecimalDigitNumber
U+0009EE  ৮    BENGALI DIGIT EIGHT                      8      8      8      DecimalDigitNumber
U+0009EF  ৯    BENGALI DIGIT NINE                       9      9      9      DecimalDigitNumber
U+0009F4  ৴    BENGALI CURRENCY NUMERATOR ONE           0.0625 -1     -1     OtherNumber
U+0009F5  ৵    BENGALI CURRENCY NUMERATOR TWO           0.125  -1     -1     OtherNumber
U+0009F6  ৶    BENGALI CURRENCY NUMERATOR THREE         0.1875 -1     -1     OtherNumber
U+0009F7  ৷    BENGALI CURRENCY NUMERATOR FOUR          0.25   -1     -1     OtherNumber
U+0009F8  ৸    BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR 0.75   -1     -1     OtherNumber
U+0009F9  ৹    BENGALI CURRENCY DENOMINATOR SIXTEEN     16     -1     -1     OtherNumber
U+000A66  ੦    GURMUKHI DIGIT ZERO                      0      0      0      DecimalDigitNumber
U+000A67  ੧    GURMUKHI DIGIT ONE                       1      1      1      DecimalDigitNumber
U+000A68  ੨    GURMUKHI DIGIT TWO                       2      2      2      DecimalDigitNumber
U+000A69  ੩    GURMUKHI DIGIT THREE                     3      3      3      DecimalDigitNumber
U+000A6A  ੪    GURMUKHI DIGIT FOUR                      4      4      4      DecimalDigitNumber
U+000A6B  ੫    GURMUKHI DIGIT FIVE                      5      5      5      DecimalDigitNumber
U+000A6C  ੬    GURMUKHI DIGIT SIX                       6      6      6      DecimalDigitNumber
U+000A6D  ੭    GURMUKHI DIGIT SEVEN                     7      7      7      DecimalDigitNumber
U+000A6E  ੮    GURMUKHI DIGIT EIGHT                     8      8      8      DecimalDigitNumber
U+000A6F  ੯    GURMUKHI DIGIT NINE                      9      9      9      DecimalDigitNumber
U+000AE6  ૦    GUJARATI DIGIT ZERO                      0      0      0      DecimalDigitNumber
U+000AE7  ૧    GUJARATI DIGIT ONE                       1      1      1      DecimalDigitNumber
U+000AE8  ૨    GUJARATI DIGIT TWO                       2      2      2      DecimalDigitNumber
U+000AE9  ૩    GUJARATI DIGIT THREE                     3      3      3      DecimalDigitNumber
U+000AEA  ૪    GUJARATI DIGIT FOUR                      4      4      4      DecimalDigitNumber
U+000AEB  ૫    GUJARATI DIGIT FIVE                      5      5      5      DecimalDigitNumber
U+000AEC  ૬    GUJARATI DIGIT SIX                       6      6      6      DecimalDigitNumber
U+000AED  ૭    GUJARATI DIGIT SEVEN                     7      7      7      DecimalDigitNumber
U+000AEE  ૮    GUJARATI DIGIT EIGHT                     8      8      8      DecimalDigitNumber
U+000AEF  ૯    GUJARATI DIGIT NINE                      9      9      9      DecimalDigitNumber
U+000B66  ୦    ORIYA DIGIT ZERO                         0      0      0      DecimalDigitNumber
U+000B67  ୧    ORIYA DIGIT ONE                          1      1      1      DecimalDigitNumber
U+000B68  ୨    ORIYA DIGIT TWO                          2      2      2      DecimalDigitNumber
U+000B69  ୩    ORIYA DIGIT THREE                        3      3      3      DecimalDigitNumber
U+000B6A  ୪    ORIYA DIGIT FOUR                         4      4      4      DecimalDigitNumber
U+000B6B  ୫    ORIYA DIGIT FIVE                         5      5      5      DecimalDigitNumber
U+000B6C  ୬    ORIYA DIGIT SIX                          6      6      6      DecimalDigitNumber
U+000B6D  ୭    ORIYA DIGIT SEVEN                        7      7      7      DecimalDigitNumber
U+000B6E  ୮    ORIYA DIGIT EIGHT                        8      8      8      DecimalDigitNumber
U+000B6F  ୯    ORIYA DIGIT NINE                         9      9      9      DecimalDigitNumber
U+000B72  ୲    ORIYA FRACTION ONE QUARTER               0.25   -1     -1     OtherNumber
U+000B73  ୳    ORIYA FRACTION ONE HALF                  0.5    -1     -1     OtherNumber
U+000B74  ୴    ORIYA FRACTION THREE QUARTERS            0.75   -1     -1     OtherNumber
U+000B75  ୵    ORIYA FRACTION ONE SIXTEENTH             0.0625 -1     -1     OtherNumber
U+000B76  ୶    ORIYA FRACTION ONE EIGHTH                0.125  -1     -1     OtherNumber
U+000B77  ୷    ORIYA FRACTION THREE SIXTEENTHS          0.1875 -1     -1     OtherNumber
U+000BE6  ௦    TAMIL DIGIT ZERO                         0      0      0      DecimalDigitNumber
U+000BE7  ௧    TAMIL DIGIT ONE                          1      1      1      DecimalDigitNumber
U+000BE8  ௨    TAMIL DIGIT TWO                          2      2      2      DecimalDigitNumber
U+000BE9  ௩    TAMIL DIGIT THREE                        3      3      3      DecimalDigitNumber
U+000BEA  ௪    TAMIL DIGIT FOUR                         4      4      4      DecimalDigitNumber
U+000BEB  ௫    TAMIL DIGIT FIVE                         5      5      5      DecimalDigitNumber
U+000BEC  ௬    TAMIL DIGIT SIX                          6      6      6      DecimalDigitNumber
U+000BED  ௭    TAMIL DIGIT SEVEN                        7      7      7      DecimalDigitNumber
U+000BEE  ௮    TAMIL DIGIT EIGHT                        8      8      8      DecimalDigitNumber
U+000BEF  ௯    TAMIL DIGIT NINE                         9      9      9      DecimalDigitNumber
U+000BF0  ௰    TAMIL NUMBER TEN                         10     -1     -1     OtherNumber
U+000BF1  ௱    TAMIL NUMBER ONE HUNDRED                 100    -1     -1     OtherNumber
U+000BF2  ௲    TAMIL NUMBER ONE THOUSAND                1000   -1     -1     OtherNumber
U+000C66  ౦    TELUGU DIGIT ZERO                        0      0      0      DecimalDigitNumber
U+000C67  ౧    TELUGU DIGIT ONE                         1      1      1      DecimalDigitNumber
U+000C68  ౨    TELUGU DIGIT TWO                         2      2      2      DecimalDigitNumber
U+000C69  ౩    TELUGU DIGIT THREE                       3      3      3      DecimalDigitNumber
U+000C6A  ౪    TELUGU DIGIT FOUR                        4      4      4      DecimalDigitNumber
U+000C6B  ౫    TELUGU DIGIT FIVE                        5      5      5      DecimalDigitNumber
U+000C6C  ౬    TELUGU DIGIT SIX                         6      6      6      DecimalDigitNumber
U+000C6D  ౭    TELUGU DIGIT SEVEN                       7      7      7      DecimalDigitNumber
U+000C6E  ౮    TELUGU DIGIT EIGHT                       8      8      8      DecimalDigitNumber
U+000C6F  ౯    TELUGU DIGIT NINE                        9      9      9      DecimalDigitNumber
U+000C78  ౸    TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR 0      -1     -1     OtherNumber
U+000C79  ౹    TELUGU FRACTION DIGIT ONE FOR ODD POWERS OF FOUR 1      -1     -1     OtherNumber
U+000C7A  ౺    TELUGU FRACTION DIGIT TWO FOR ODD POWERS OF FOUR 2      -1     -1     OtherNumber
U+000C7B  ౻    TELUGU FRACTION DIGIT THREE FOR ODD POWERS OF FOUR 3      -1     -1     OtherNumber
U+000C7C  ౼    TELUGU FRACTION DIGIT ONE FOR EVEN POWERS OF FOUR 1      -1     -1     OtherNumber
U+000C7D  ౽    TELUGU FRACTION DIGIT TWO FOR EVEN POWERS OF FOUR 2      -1     -1     OtherNumber
U+000C7E  ౾    TELUGU FRACTION DIGIT THREE FOR EVEN POWERS OF FOUR 3      -1     -1     OtherNumber
U+000CE6  ೦    KANNADA DIGIT ZERO                       0      0      0      DecimalDigitNumber
U+000CE7  ೧    KANNADA DIGIT ONE                        1      1      1      DecimalDigitNumber
U+000CE8  ೨    KANNADA DIGIT TWO                        2      2      2      DecimalDigitNumber
U+000CE9  ೩    KANNADA DIGIT THREE                      3      3      3      DecimalDigitNumber
U+000CEA  ೪    KANNADA DIGIT FOUR                       4      4      4      DecimalDigitNumber
U+000CEB  ೫    KANNADA DIGIT FIVE                       5      5      5      DecimalDigitNumber
U+000CEC  ೬    KANNADA DIGIT SIX                        6      6      6      DecimalDigitNumber
U+000CED  ೭    KANNADA DIGIT SEVEN                      7      7      7      DecimalDigitNumber
U+000CEE  ೮    KANNADA DIGIT EIGHT                      8      8      8      DecimalDigitNumber
U+000CEF  ೯    KANNADA DIGIT NINE                       9      9      9      DecimalDigitNumber
U+000D58  ൘    MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH 0.00625 -1     -1     OtherNumber
U+000D59  ൙    MALAYALAM FRACTION ONE FORTIETH          0.025  -1     -1     OtherNumber
U+000D5A  ൚    MALAYALAM FRACTION THREE EIGHTIETHS      0.0375 -1     -1     OtherNumber
U+000D5B  ൛    MALAYALAM FRACTION ONE TWENTIETH         0.05   -1     -1     OtherNumber
U+000D5C  ൜    MALAYALAM FRACTION ONE TENTH             0.1    -1     -1     OtherNumber
U+000D5D  ൝    MALAYALAM FRACTION THREE TWENTIETHS      0.15   -1     -1     OtherNumber
U+000D5E  ൞    MALAYALAM FRACTION ONE FIFTH             0.2    -1     -1     OtherNumber
U+000D66  ൦    MALAYALAM DIGIT ZERO                     0      0      0      DecimalDigitNumber
U+000D67  ൧    MALAYALAM DIGIT ONE                      1      1      1      DecimalDigitNumber
U+000D68  ൨    MALAYALAM DIGIT TWO                      2      2      2      DecimalDigitNumber
U+000D69  ൩    MALAYALAM DIGIT THREE                    3      3      3      DecimalDigitNumber
U+000D6A  ൪    MALAYALAM DIGIT FOUR                     4      4      4      DecimalDigitNumber
U+000D6B  ൫    MALAYALAM DIGIT FIVE                     5      5      5      DecimalDigitNumber
U+000D6C  ൬    MALAYALAM DIGIT SIX                      6      6      6      DecimalDigitNumber
U+000D6D  ൭    MALAYALAM DIGIT SEVEN                    7      7      7      DecimalDigitNumber
U+000D6E  ൮    MALAYALAM DIGIT EIGHT                    8      8      8      DecimalDigitNumber
U+000D6F  ൯    MALAYALAM DIGIT NINE                     9      9      9      DecimalDigitNumber
U+000D70  ൰    MALAYALAM NUMBER TEN                     10     -1     -1     OtherNumber
U+000D71  ൱    MALAYALAM NUMBER ONE HUNDRED             100    -1     -1     OtherNumber
U+000D72  ൲    MALAYALAM NUMBER ONE THOUSAND            1000   -1     -1     OtherNumber
U+000D73  ൳    MALAYALAM FRACTION ONE QUARTER           0.25   -1     -1     OtherNumber
U+000D74  ൴    MALAYALAM FRACTION ONE HALF              0.5    -1     -1     OtherNumber
U+000D75  ൵    MALAYALAM FRACTION THREE QUARTERS        0.75   -1     -1     OtherNumber
U+000D76  ൶    MALAYALAM FRACTION ONE SIXTEENTH         0.0625 -1     -1     OtherNumber
U+000D77  ൷    MALAYALAM FRACTION ONE EIGHTH            0.125  -1     -1     OtherNumber
U+000D78  ൸    MALAYALAM FRACTION THREE SIXTEENTHS      0.1875 -1     -1     OtherNumber
U+000DE6  ෦    SINHALA LITH DIGIT ZERO                  0      0      0      DecimalDigitNumber
U+000DE7  ෧    SINHALA LITH DIGIT ONE                   1      1      1      DecimalDigitNumber
U+000DE8  ෨    SINHALA LITH DIGIT TWO                   2      2      2      DecimalDigitNumber
U+000DE9  ෩    SINHALA LITH DIGIT THREE                 3      3      3      DecimalDigitNumber
U+000DEA  ෪    SINHALA LITH DIGIT FOUR                  4      4      4      DecimalDigitNumber
U+000DEB  ෫    SINHALA LITH DIGIT FIVE                  5      5      5      DecimalDigitNumber
U+000DEC  ෬    SINHALA LITH DIGIT SIX                   6      6      6      DecimalDigitNumber
U+000DED  ෭    SINHALA LITH DIGIT SEVEN                 7      7      7      DecimalDigitNumber
U+000DEE  ෮    SINHALA LITH DIGIT EIGHT                 8      8      8      DecimalDigitNumber
U+000DEF  ෯    SINHALA LITH DIGIT NINE                  9      9      9      DecimalDigitNumber
U+000E50  ๐    THAI DIGIT ZERO                          0      0      0      DecimalDigitNumber
U+000E51  ๑    THAI DIGIT ONE                           1      1      1      DecimalDigitNumber
U+000E52  ๒    THAI DIGIT TWO                           2      2      2      DecimalDigitNumber
U+000E53  ๓    THAI DIGIT THREE                         3      3      3      DecimalDigitNumber
U+000E54  ๔    THAI DIGIT FOUR                          4      4      4      DecimalDigitNumber
U+000E55  ๕    THAI DIGIT FIVE                          5      5      5      DecimalDigitNumber
U+000E56  ๖    THAI DIGIT SIX                           6      6      6      DecimalDigitNumber
U+000E57  ๗    THAI DIGIT SEVEN                         7      7      7      DecimalDigitNumber
U+000E58  ๘    THAI DIGIT EIGHT                         8      8      8      DecimalDigitNumber
U+000E59  ๙    THAI DIGIT NINE                          9      9      9      DecimalDigitNumber
U+000ED0  ໐    LAO DIGIT ZERO                           0      0      0      DecimalDigitNumber
U+000ED1  ໑    LAO DIGIT ONE                            1      1      1      DecimalDigitNumber
U+000ED2  ໒    LAO DIGIT TWO                            2      2      2      DecimalDigitNumber
U+000ED3  ໓    LAO DIGIT THREE                          3      3      3      DecimalDigitNumber
U+000ED4  ໔    LAO DIGIT FOUR                           4      4      4      DecimalDigitNumber
U+000ED5  ໕    LAO DIGIT FIVE                           5      5      5      DecimalDigitNumber
U+000ED6  ໖    LAO DIGIT SIX                            6      6      6      DecimalDigitNumber
U+000ED7  ໗    LAO DIGIT SEVEN                          7      7      7      DecimalDigitNumber
U+000ED8  ໘    LAO DIGIT EIGHT                          8      8      8      DecimalDigitNumber
U+000ED9  ໙    LAO DIGIT NINE                           9      9      9      DecimalDigitNumber
...
U+01F10B  🄋   DINGBAT CIRCLED SANS-SERIF DIGIT ZERO    0      -1     -1     OtherNumber
U+01F10C  🄌   DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO 0      -1     -1     OtherNumber
U+01FBF0  🯰   SEGMENTED DIGIT ZERO                     -1     -1     -1     DecimalDigitNumber
U+01FBF1  🯱   SEGMENTED DIGIT ONE                      -1     -1     -1     DecimalDigitNumber
U+01FBF2  🯲   SEGMENTED DIGIT TWO                      -1     -1     -1     DecimalDigitNumber
U+01FBF3  🯳   SEGMENTED DIGIT THREE                    -1     -1     -1     DecimalDigitNumber
U+01FBF4  🯴   SEGMENTED DIGIT FOUR                     -1     -1     -1     DecimalDigitNumber
U+01FBF5  🯵   SEGMENTED DIGIT FIVE                     -1     -1     -1     DecimalDigitNumber
U+01FBF6  🯶   SEGMENTED DIGIT SIX                      -1     -1     -1     DecimalDigitNumber
U+01FBF7  🯷   SEGMENTED DIGIT SEVEN                    -1     -1     -1     DecimalDigitNumber
U+01FBF8  🯸   SEGMENTED DIGIT EIGHT                    -1     -1     -1     DecimalDigitNumber
U+01FBF9  🯹   SEGMENTED DIGIT NINE                     -1     -1     -1     DecimalDigitNumber

可悲的是 Win32 控制台不显示星体字符
如果我没记错的话,遗憾的是 .NET Regex 不支持非 BMP 字符。所以最后检查字符&gt;带有正则表达式的 0xffff 是没用的。
这段代码“charInfo[category]”显示错误! (可能是错字)
谢谢@DamilolaAdegunwa,我添加了缺失的部分,并将代码格式化为更“现代”
d
dengkai

\d 检查所有 Unicode,而 [0-9] 仅限于这 10 个字符。如果只有 10 位数字,你应该使用。其他的我推荐使用\d,因为写的少。