ChatGPT解决这个技术问题 Extra ChatGPT

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?

Paul, yes I can use perl
/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]

P
Peter Mortensen

This will match a single non-ASCII character:

[^\x00-\x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

[[:ascii:]] - matches a single ASCII char

[^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**


@adrianm: No, ^ is valid in PCRE.
That's exactly right. However you have to use pcregrep, not standard grep. [^[:print:]] won't work if your terminal is set up in UTF8.
@Rory, why :print: won't work in a UTF8 terminal? This works for me in pry in a UTF8 terminal: 27.chr =~ /[^[:print:]]/
This is really nice for fixing bad filenames - rename 's/[^\x00-\x7F]//g' * (you can use -n to check the renames are ok first).
How do I match any character that is non-UTF8 and any other specific characters?
P
Peter Mortensen

No, [^\x20-\x7E] is not ASCII.

This is real ASCII:

 [^\x00-\x7F]

Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!


R
Rubens Farias

You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:

\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.

M
Mike Laren

You can use this regex:

[^\w \xC0-\xFF]

Case ask, the options is Multiline.


u
user1133275

[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.


O
Othman Mahmoud

To Validate Text Box Accept Ascii Only use this Pattern

[\x00-\x7F]+


M
Matthijs

I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.


t
tripleee

You don't really need a regex.

printf "%s\n" *[!\ -~]*

This will show file names with control characters in their names, too, but I consider that a feature.

If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)


Belatedly, I can observe that this does work correctly if you actually have some files which match this pattern. The behavior where the pattern prints itself when there are no matches is slightly surprising but actually correct. I edited the answer to hopefully clarify this.
Note that the behaviour depends on the current bash settings. I would recommend shopt -s nullglob dotglob globasciiranges to skip the non-matching patterns, to include the dotted filenames like .tmp§ and not to depend on the current locale. I mean setting it temporarily just for this particular command, otherwise the default settings are fine.
D
Don Turnblade

This turned out to be very flexible and extensible. $field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.