ChatGPT解决这个技术问题 Extra ChatGPT

Test if characters are in a string

I'm trying to determine if a string is a subset of another string. For example:

chars <- "test"
value <- "es"

I want to return TRUE if "value" appears as part of the string "chars". In the following scenario, I would want to return false:

chars <- "test"
value <- "et"
The accepted answer is wrong, you need to add fixed=TRUE, otherwise you're treating it as a regex instead of a string. See my answer from October 2016.
@JoshuaCheek Unless you have special characters in your pattern, regex will return the same result as fixed.
Sure, but you can only know that if you're passing it a literal. Otherwise, you won't know what characters are in the pattern, so you either use fixed=TRUE or you have a bug that will quietly and subtly mess up your data.

A
Abel Callejo

Use the grepl function

grepl( needle, haystack, fixed = TRUE)

like so:

grepl(value, chars, fixed = TRUE)
# TRUE

Use ?grepl to find out more.


For this simple case adding fixed=TRUE may improve performance (assuming that you will be doing a lot of these computations).
@Josh O'brien, that post compared finding (counting) all the matches in a single long string, try finding 1 match in a bunch of shorter strings: vec <- replicate(100000, paste( sample(letters, 10, replace=TRUE), collapse='') ).
@GregSnow -- Tried system.time(a <- grepl("abc", vec)) and system.time(a <- grepl("abc", vec, fixed=TRUE)), and fixed=TRUE is still, if anything slightly slower. The difference isn't appreciable with these short strings, but fixed=TRUE still doesn't seem to be faster. Thanks for pointing out, though, that it's on long strings that fixed=TRUE takes the real hit.
grepl(pattern, x) at least in 2017
This should not be the accepted answer, because value will be interpreted as a regex pattern. fixed=TRUE should always be used unless you know the string you are searching for will not happen to look like a regex pattern. Joshua Creek's answer below has a very clear explanation of this, and should be the accepted answer.
J
Joshua Cheek

Answer

Sigh, it took me 45 minutes to find the answer to this simple question. The answer is: grepl(needle, haystack, fixed=TRUE)

# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE

# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE

Interpretation

grep is named after the linux executable, which is itself an acronym of "Global Regular Expression Print", it would read lines of input and then print them if they matched the arguments you gave. "Global" meant the match could occur anywhere on the input line, I'll explain "Regular Expression" below, but the idea is it's a smarter way to match the string (R calls this "character", eg class("abc")), and "Print" because it's a command line program, emitting output means it prints to its output string.

Now, the grep program is basically a filter, from lines of input, to lines of output. And it seems that R's grep function similarly will take an array of inputs. For reasons that are utterly unknown to me (I only started playing with R about an hour ago), it returns a vector of the indexes that match, rather than a list of matches.

But, back to your original question, what we really want is to know whether we found the needle in the haystack, a true/false value. They apparently decided to name this function grepl, as in "grep" but with a "Logical" return value (they call true and false logical values, eg class(TRUE)).

So, now we know where the name came from and what it's supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character "a", the regex a* matches the character "a" 0 or more times, and the regex a+ would match the character "a" 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means "one or more 1 followed by a 2"... but ours is followed by a plus!

https://i.stack.imgur.com/cTye4.png

So, if you used the grepl without setting fixed, your needles would accidentally be haystacks, and that would accidentally work quite often, we can see it even works for the OP's example. But that's a latent bug! We need to tell it the input is a string, not a regex, which is apparently what fixed is for. Why fixed? No clue, bookmark this answer b/c you're probably going to have to look it up 5 more times before you get it memorized.

A few final thoughts

The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn't need to be an argument), the docs list 9 arguments here, which means there's at least 2^9=512 ways to invoke it, that's a lot of work to write, test, and remember... decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don't give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn't exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it's a design flaw that things like TRUE and 0 and "abc" are all vectors.


Cheers for your explanation! It appears R evolved over a long period of time and is stuck with some weird design choices (see eg. answers to this question on value types). However, returning a vector of match indices seems appropriate in this case, as grep is filtering rows, not cells.
"fixed" refers to the characters matching a "fixed" sequence.
TL;DR: "fixed" means that the search pattern should not be treated as a regular expression.
If you read the `?grep`` page you find that value=TRUE make it behave as you naively expected. Going on and on about how it doesn't behave as you expect just means you didn't read the docs before starting your rant.
J
Justin

You want grepl:

> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE

S
Surya

Also, can be done using "stringr" library:

> library(stringr)
> chars <- "test"
> value <- "es"
> str_detect(chars, value)
[1] TRUE

### For multiple value case:
> value <- c("es", "l", "est", "a", "test")
> str_detect(chars, value)
[1]  TRUE FALSE  TRUE FALSE  TRUE

g
gagolews

Use this function from stringi package:

> stri_detect_fixed("test",c("et","es"))
[1] FALSE  TRUE

Some benchmarks:

library(stringi)
set.seed(123L)
value <- stri_rand_strings(10000, ceiling(runif(10000, 1, 100))) # 10000 random ASCII strings
head(value)

chars <- "es"
library(microbenchmark)
microbenchmark(
   grepl(chars, value),
   grepl(chars, value, fixed=TRUE),
   grepl(chars, value, perl=TRUE),
   stri_detect_fixed(value, chars),
   stri_detect_regex(value, chars)
)
## Unit: milliseconds
##                               expr       min        lq    median        uq       max neval
##                grepl(chars, value) 13.682876 13.943184 14.057991 14.295423 15.443530   100
##  grepl(chars, value, fixed = TRUE)  5.071617  5.110779  5.281498  5.523421 45.243791   100
##   grepl(chars, value, perl = TRUE)  1.835558  1.873280  1.956974  2.259203  3.506741   100
##    stri_detect_fixed(value, chars)  1.191403  1.233287  1.309720  1.510677  2.821284   100
##    stri_detect_regex(value, chars)  6.043537  6.154198  6.273506  6.447714  7.884380   100

C
C. Zeng

Just in case you would also like check if a string (or a set of strings) contain(s) multiple sub-strings, you can also use the '|' between two substrings.

>substring="as|at"
>string_vector=c("ass","ear","eye","heat") 
>grepl(substring,string_vector)

You will get

[1]  TRUE FALSE FALSE  TRUE

since the 1st word has substring "as", and the last word contains substring "at"


The OR operator was exactly what I needed! +1
C
Chris

Use grep or grepl but be aware of whether or not you want to use regular expressions.

By default, grep and related take a regular expression to match, not a literal substring. If you're not expecting that, and you try to match on an invalid regex, it doesn't work:

> grep("[", "abc[")
Error in grep("[", "abc[") : 
  invalid regular expression '[', reason 'Missing ']''

To do a true substring test, use fixed = TRUE.

> grep("[", "abc[", fixed = TRUE)
[1] 1

If you do want regex, great, but that's not what the OP appears to be asking.


n
nico

You can use grep

grep("es", "Test")
[1] 1
grep("et", "Test")
integer(0)

A
Alex L

Similar problem here: Given a string and a list of keywords, detect which, if any, of the keywords are contained in the string.

Recommendations from this thread suggest stringr's str_detect and grepl. Here are the benchmarks from the microbenchmark package:

Using

map_keywords = c("once", "twice", "few")
t = "yes but only a few times"

mapper1 <- function (x) {
  r = str_detect(x, map_keywords)
}

mapper2 <- function (x) {
  r = sapply(map_keywords, function (k) grepl(k, x, fixed = T))
}

and then

microbenchmark(mapper1(t), mapper2(t), times = 5000)

we find

Unit: microseconds
       expr    min     lq     mean  median      uq      max neval
 mapper1(t) 26.401 27.988 31.32951 28.8430 29.5225 2091.476  5000
 mapper2(t) 19.289 20.767 24.94484 23.7725 24.6220 1011.837  5000

As you can see, over 5,000 iterations of the keyword search using str_detect and grepl over a practical string and vector of keywords, grepl performs quite a bit better than str_detect.

The outcome is the boolean vector r which identifies which, if any, of the keywords are contained in the string.

Therefore, I recommend using grepl to determine if any keywords are in a string.