How can I match "anything up until this sequence of characters" in a regular expression?

regex

Take this regular expression: /^[^abc]/. This will match any single character at the beginning of a string, except a, b, or *c.

If you add a * after it – /^[^abc]*/ – the regular expression will continue to add each subsequent character to the result, until it meets either an a, or b, or c.

For example, with the source string "qwerty qwerty whatever abc hello", the expression will match up to "qwerty qwerty wh".

But what if I wanted the matching string to be "qwerty qwerty whatever "?

In other words, how can I match everything up to (but not including) the exact sequence "abc"?

What do you mean by match but not including ?

I mean I want to match "qwerty qwerty whatever " – not including the "abc". In other words, I don't want the resulting match to be "qwerty qwerty whatever abc".

In javascript you can just do string.split('abc')[0]. Certainly not an official answer to this problem, but I find it more straightforward than regex.

sidyll

You didn't specify which flavor of regex you're using, but this will work in any of the most popular ones that can be considered "complete".

/.+?(?=abc)/

How it works

The .+? part is the un-greedy version of .+ (one or more of anything). When we use .+, the engine will basically match everything. Then, if there is something else in the regex it will go back in steps trying to match the following part. This is the greedy behavior, meaning as much as possible to satisfy.

When using .+?, instead of matching all at once and going back for other conditions (if any), the engine will match the next characters by step until the subsequent part of the regex is matched (again if any). This is the un-greedy, meaning match the fewest possible to satisfy.

/.+X/  ~ "abcXabcXabcX"        /.+/  ~ "abcXabcXabcX"
          ^^^^^^^^^^^^                  ^^^^^^^^^^^^

/.+?X/ ~ "abcXabcXabcX"        /.+?/ ~ "abcXabcXabcX"
          ^^^^                          ^

Following that we have (?={contents}), a zero width assertion, a look around. This grouped construction matches its contents, but does not count as characters matched (zero width). It only returns if it is a match or not (assertion).

Thus, in other terms the regex /.+?(?=abc)/ means:

Match any characters as few as possible until a "abc" is found, without counting the "abc".

This will probably not work with line breaks, if they are supposed to be captured.

What's the difference between .+? and .*?

@robbie0630 + means 1 or more, where * means 0 or more. The inclusion/exclusion of the ? will make it greedy or non-greedy.

@testerjoe2 /.+?(?=abc|xyz)/

I have noticed that this fails to select anything if the pattern your looking for does not exist, instead if you use ^(?:(?!abc)(?!def).)* you can chain to exclude patterns you don't want and it will still grab everything as needed even if the pattern does not exist

Jared Ng

If you're looking to capture everything up to "abc":

/^(.*?)abc/

Explanation:

( ) capture the expression inside the parentheses for access using $1, $2, etc.

^ match start of line

.* match anything, ? non-greedily (match the minimum number of characters required) - [1]

[1] The reason why this is needed is that otherwise, in the following string:

whatever whatever something abc something abc

by default, regexes are greedy, meaning it will match as much as possible. Therefore /^.*abc/ would match "whatever whatever something abc something ". Adding the non-greedy quantifier ? makes the regex only match "whatever whatever something ".

Thanks, but your one does include the abc in the match. In other words the resulting match is "whatever whatever something abc".

Could you explain what you're ultimately trying to do? If your scenario is: (A) You want to get everything leading up to "abc" -- just use parentheses around what you want to capture. (B) You want to match the string up to the "abc" -- you have to check the abc anyway, so it needs to be part of the regex regardless. How else can you check that it's there?

sed doesn't seem to support non-greedy matching, nor does it support look-around ((?=...)). What else can I do? Example command: echo "ONE: two,three, FOUR FIVE, six,seven" | sed -n -r "s/^ONE: (.+?), .*/\1/p" returns two,three, FOUR FIVE, but I expect two,three...

@CoDEmanX You should probably post that as your own separate question rather than a comment, especially since it's specifically about sed. That being said, to address your question: you may want to look at the answers to this question. Also note that in your example, a non-greedy aware interpreter would return just two, not two,three.

This is how EVERY regexp answer should look - example and explanation of all parts...

Peter Mortensen

As Jared Ng and @Issun pointed out, the key to solve this kind of regular expression like "matching everything up to a certain word or substring" or "matching everything after a certain word or substring" is called "lookaround" zero-length assertions. Read more about them here.

In your particular case, it can be solved by a positive look ahead: .+?(?=abc)

A picture is worth a thousand words. See the detailed explanation in the screenshot.

https://i.stack.imgur.com/cxm8d.png

.+?(?=abc) copy-pastable regex is worth more.

What about excluding leading spaces?

shareable link also is worth more than screenshot, just kidding, thanks for the answer

Who is "Issun"? What answer does it refer to?

Issun's account no longer exists. But they are referring to "look around" - see the links I provided in the answer.

Paul Masri-Stone

Solution

/[\s\S]*?(?=abc)/

This will match

everything up to (but not including) the exact sequence "abc"

as the OP asked, even if the source string contains newlines and even if the sequence begins with abc. However be sure to include the multiline flag m, if the source string may contain newlines.

How it works

\s means any whitespace character (e.g. space, tab, newline)

\S means any non-whitespace character; i.e. opposite to \s

Together [\s\S] means any character. This is almost the same as . except that . doesn't match newline.

* means 0+ occurrences of the preceding token. I've used this instead of + in case the source string starts with abc.

(?= is known as positive lookahead. It requires a match to the string in the parentheses, but stops just before it, so (?=abc) means "up to but not including abc, but abc must be present in the source string".

? between [\s\S]* and (?=abc) means lazy (aka non greedy). i.e. stop at the first abc. Without this it would capture every character up to the final occurrence of abc if abc occurred more than once.

Works like a charm! This should be the accepted answer.

Peter Mortensen

You need a look around assertion, like .+? (?=abc).

See: Lookahead and Lookbehind Zero-Length Assertions

Be aware that [abc] isn't the same as abc. Inside brackets it's not a string - each character is just one of the possibilities. Outside the brackets it becomes the string.

Peter Mortensen

For regex in Java, and I believe also in most regex engines, if you want to include the last part this will work:

.+?(abc)

For example, in this line:

I have this very nice senabctence

Select all characters until "abc" and also include abc.

Using our regex, the result will be: I have this very nice senabc

Test this out: https://regex101.com/r/mX51ru/1

Peter Mortensen

In Python:

.+?(?=abc) works for the single line case.

[^]+?(?=abc) does not work, since python doesn't recognize [^] as valid regex. To make multiline matching work, you'll need to use the re.DOTALL option, for example:

re.findall('.+?(?=abc)', data, re.DOTALL)

Peter Mortensen

So I had to improvise... after some time I managed to reach the regex I needed:

https://i.stack.imgur.com/jgsdL.png

As you can see, I needed up to one folder ahead of "grp-bps" folder, without including the last dash. And it was required to have at least one folder after the "grp-bps" folder.

The text version for copy-paste (change 'grp-bps' for your text):

.*\/grp-bps\/[^\/]+

I ended in this Stack Overflow question after looking for help to solve my problem, but I didn't find any solution to it :(

No text version? 🙄

Peter Mortensen

This will make sense about regex.

The exact word can be got from the following regex command:

("(.*?)")/g

Here, we can get the exact word globally which is belonging inside the double quotes.

For example, if our search text is

This is the example for "double quoted" words

then we will get "double quoted" from that sentence.

Welcome to StackOverflow and thanks for your attempt to help. I find it however hard to see how this helps the goal stated in the question. Can you elaborate? Can you apply it to the given examples? You seem to focus on handling of ", which to me seems irrelevant for the question.

Hi, I have explained how to get the word or sentences in between the special characters. Here our question is also "anything until the sequence of special characters". so I tried with double quotes and explained it here. Thanks.

Peter Mortensen

I would like to extend the answer from sidyll for the case insensitive version of the regex.

If you want to match abc/Abc/ABC... case insensitively, which I needed to do, use the following regex.

.+?(?=(?i)abc)

Explanation:

(?i) - This will make the following abc match case insensitively.

The other explanation of the regex remains same as sidyll pointed out.

proseosoc

Match From Start Till "Before ABC" or "Line End" if no ABC

(1) Matches whole string if string does not contain ABC anywhere

(2) Does not match empty string

(Not checked for strings with line breaks)

^.+?(?=ABC|$)

Peter Mortensen

I believe you need subexpressions. You can use the normal () brackets for subexpressions.

This part is from the grep manual:

Back References and Subexpressions The back-reference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.

Doing something like ^[^(abc)] should do the trick.

Sorry, that doesn't work. Putting the abc in parentheses doesn't seem to make any difference. They are still treated as "a OR b OR c".

[^...] means "not any of the characters within the square brackets, rather than "not the following token", so this doesn't do the trick.

Peter Mortensen

The $ marks the end of a string, so something like this should work: [[^abc]*]$ where you're looking for anything not ending in any iteration of abc, but it would have to be at the end

Also if you're using a scripting language with regex (like PHP or JavaScript), they have a search function that stops when it first encounters a pattern (and you can specify start from the left or start from the right, or with php, you can do an implode to mirror the string).

Peter Mortensen

Try this:

.+?efg

Query:

select REGEXP_REPLACE ('abcdefghijklmn','.+?efg', '') FROM dual;

Output:

hijklmn

How can I match "anything up until this sequence of characters" in a regular expression?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US