ChatGPT解决这个技术问题 Extra ChatGPT

How to find patterns across multiple lines using grep?

I want to find files that have "abc" AND "efg" in that order, and those two strings are on different lines in that file. Eg: a file with content:

blah blah..
blah blah..
blah abc blah
blah blah..
blah blah..
blah blah..
blah efg blah blah
blah blah..
blah blah..

Should be matched.

:) come to think of it .. in our world nothing stays the same over a period of time. So there may be better threads than this somewhere down the line

r
rogerdpack

Grep is an awkward tool for this operation.

pcregrep which is found in most of the modern Linux systems can be used as

pcregrep -M  'abc.*(\n|.)*efg' test.txt

where -M, --multiline allow patterns to match more than one line

There is a newer pcre2grep also. Both are provided by the PCRE project.

pcre2grep is available for Mac OS X via Mac Ports as part of port pcre2:

% sudo port install pcre2 

and via Homebrew as:

% brew install pcre

or for pcre2

% brew install pcre2

pcre2grep is also available on Linux (Ubuntu 18.04+)

$ sudo apt install pcre2-utils # PCRE2
$ sudo apt install pcregrep    # Older PCRE

@StevenLu -M, --multiline - Allow patterns to match more than one line.
Note that .*(\n|.)* is equivalent to (\n|.)* and the latter is shorter. Moreover on my system, "pcre_exec() error -8" occurs when I run the longer version. So try 'abc(\n|.)*efg' instead!
You need to make the expression non-greedy in that case example : 'abc.*(\n|.)*?efg'
and you can omit the first .* -> 'abc(\n|.)*?efg' to make the regex shorter (and to be pedantic)
pcregrep does make things easier, but grep will work too. For example, see stackoverflow.com/a/7167115/123695
t
timss

I'm not sure if it is possible with grep, but sed makes it very easy:

sed -e '/abc/,/efg/!d' [file-with-content]

This doesn't find files, it returns the matching part from a single file
@Lj. please can you explain this command? I'm familiar with sed, but if have never seen such an expression before.
@Anthony, It's documented in the man page of sed, under address. It's important to realise that /abc/ & /efg/ is an address.
I suspect this answer would've been helpful if it had a bit more explanation, and in that case, I would've up-voted it one more time. I know a bit of sed, but not enough to use this answer to produce a meaningful exit code after half an hour of fiddling. Tip: 'RTFM' rarely gets up-votes on StackOverflow, as your previous comment shows.
Quick explanation by example: sed '1,5d' : delete lines between 1 and 5. sed '1,5!d' : delete lines not between 1 and 5 (i.e. keep the lines between) then instead of a number, you can search for a line with /pattern/. See also the simpler one below: sed -n '/abc/,/efg/p' p is for print and the -n flag don't display all lines
r
rogerdpack

Here is a solution inspired by this answer:

if 'abc' and 'efg' can be on the same line: grep -zl 'abc.*efg'

if 'abc' and 'efg' must be on different lines: grep -Pzl '(?s)abc.*\n.*efg'

Params:

-P Use perl compatible regular expressions (PCRE).

-z Treat the input as a set of lines, each terminated by a zero byte instead of a newline. i.e. grep treats the input as a one big line. Note that if you don't use -l it will display matches followed by a NUL char, see comments.

-l list matching filenames only.

(?s) activate PCRE_DOTALL, which means that '.' finds any character or newline.


@syntaxerror No, I think it's just a lower-case l. AFAIK there is no number -1 option.
Seems you're right after all, maybe I had made a typo when testing. In any case sorry for laying a false trail.
This is excellent. I just have one question regarding this. If the -z options specifies grep to treat newlines as zero byte characters then why do we need the (?s) in the regex ? If it is already a non-newline character, shouldn't . be able to match it directly?
-z (aka --null-data) and (?s) are exactly what you need to match multi-line with a standard grep. People on MacOS, please leave comments about availability of -z or --null-data options on your systems!
-z definitely not available on MacOS
K
Kara

sed should suffice as poster LJ stated above,

instead of !d you can simply use p to print:

sed -n '/abc/,/efg/p' file

s
sage

I relied heavily on pcregrep, but with newer grep you do not need to install pcregrep for many of its features. Just use grep -P.

In the example of the OP's question, I think the following options work nicely, with the second best matching how I understand the question:

grep -Pzo "abc(.|\n)*efg" /tmp/tes*
grep -Pzl "abc(.|\n)*efg" /tmp/tes*

I copied the text as /tmp/test1 and deleted the 'g' and saved as /tmp/test2. Here is the output showing that the first shows the matched string and the second shows only the filename (typical -o is to show match and typical -l is to show only filename). Note that the 'z' is necessary for multiline and the '(.|\n)' means to match either 'anything other than newline' or 'newline' - i.e. anything:

user@host:~$ grep -Pzo "abc(.|\n)*efg" /tmp/tes*
/tmp/test1:abc blah
blah blah..
blah blah..
blah blah..
blah efg
user@host:~$ grep -Pzl "abc(.|\n)*efg" /tmp/tes*
/tmp/test1

To determine if your version is new enough, run man grep and see if something similar to this appears near the top:

   -P, --perl-regexp
          Interpret  PATTERN  as a Perl regular expression (PCRE, see
          below).  This is highly experimental and grep -P may warn of
          unimplemented features.

That is from GNU grep 2.10.


G
Gavin S. Yancey

This can be done easily by first using tr to replace the newlines with some other character:

tr '\n' '\a' | grep -o 'abc.*def' | tr '\a' '\n'

Here, I am using the alarm character, \a (ASCII 7) in place of a newline. This is almost never found in your text, and grep can match it with a ., or match it specifically with \a.


This was my approach but I was using \0 and thus needed grep -a and matching on \x00… You have helped me simplify! echo $log | tr '\n' '\0' | grep -aoE "Error: .*?\x00Installing .*? has failed\!" | tr '\0' '\n' is now echo $log | tr '\n' '\a' | grep -oE "Error: .*?\aInstalling .*? has failed\!" | tr '\a' '\n'
Use grep -o .
f
fedorqui

awk one-liner:

awk '/abc/,/efg/' [file-with-content]

This will happily print from abc through to end of file if the ending pattern is not present in the file, or the last ending pattern is missing. You can fix that but it will complicate the script rather significantly.
How to exclude /efg/ from output?
a
agouge

If you are willing to use contexts, this could be achieved by typing

grep -A 500 abc test.txt | grep -B 500 efg

This will display everything between "abc" and "efg", as long as they are within 500 lines of each other.


S
Sundar R

You can do that very easily if you can use Perl.

perl -ne 'if (/abc/) { $abc = 1; next }; print "Found in $ARGV\n" if ($abc && /efg/); }' yourfilename.txt

You can do that with a single regular expression too, but that involves taking the entire contents of the file into a single string, which might end up taking up too much memory with large files. For completeness, here is that method:

perl -e '@lines = <>; $content = join("", @lines); print "Found in $ARGV\n" if ($content =~ /abc.*efg/s);' yourfilename.txt

Found second answer was useful to extract a whole multi-line block with matches on a couple of lines - had to use non-greedy matching (.*?) to get minimal match.
L
Leo

I don't know how I would do that with grep, but I would do something like this with awk:

awk '/abc/{ln1=NR} /efg/{ln2=NR} END{if(ln1 && ln2 && ln1 < ln2){print "found"}else{print "not found"}}' foo

You need to be careful how you do this, though. Do you want the regex to match the substring or the entire word? add \w tags as appropriate. Also, while this strictly conforms to how you stated the example, it doesn't quite work when abc appears a second time after efg. If you want to handle that, add an if as appropriate in the /abc/ case etc.


M
Mariano Ruiz

If you need both words are close each other, for example no more than 3 lines, you can do this:

find . -exec grep -Hn -C 3 "abc" {} \; | grep -C 3 "efg"

Same example but filtering only *.txt files:

find . -name *.txt -exec grep -Hn -C 3 "abc" {} \; | grep -C 3 "efg"

And also you can replace grep command with egrep command if you want also find with regular expressions.


e
eyllanesc

I released a grep alternative a few days ago that does support this directly, either via multiline matching or using conditions - hopefully it is useful for some people searching here. This is what the commands for the example would look like:

Multiline:

sift -lm 'abc.*efg' testfile

Conditions:

sift -l 'abc' testfile --followed-by 'efg'

You could also specify that 'efg' has to follow 'abc' within a certain number of lines:

sift -l 'abc' testfile --followed-within 5:'efg'

You can find more information on sift-tool.org.


I don't think the first example sift -lm 'abc.*efg' testfile works, because the match is greedy and gobbles up all lines until the last efg in the file.
K
Kaleb Pederson

Sadly, you can't. From the grep docs:

grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN.


what about grep -Pz
Yeah -z tricks it into thinking the file is all one big line, since it doesn't have any NUL chars that would be interpreted as line breaks...
r
rogerdpack

While the sed option is the simplest and easiest, LJ's one-liner is sadly not the most portable. Those stuck with a version of the C Shell (instead of bash) will need to escape their bangs:

sed -e '/abc/,/efg/\!d' [file]

Which line unfortunately does not work in bash et al.


r
rogerdpack

With silver searcher:

ag 'abc.*(\n|.)*efg' your_filename

similar to ring bearer's answer, but with ag instead. Speed advantages of silver searcher could possibly shine here.


This does not seem to work. (echo abctest; echo efg)|ag 'abc.*(\n|.)*efg' does not match
It does work with files, probably an ag bug? github.com/ggreer/the_silver_searcher/issues/1417
r
rogerdpack

Possible with ripgrep:

$ rg --multiline 'abc(\n|.)+?efg' test.txt
3:blah abc blah
4:blah abc blah
5:blah blah..
6:blah blah..
7:blah blah..
8:blah efg blah blah

Or some other incantations.

If you want . to count as a newline:

$ rg --multiline '(?s)abc.+?efg' test.txt
3:blah abc blah
4:blah abc blah
5:blah blah..
6:blah blah..
7:blah blah..
8:blah efg blah blah

Or equivalent to having the (?s) would be rg --multiline --multiline-dotall

And to answer the original question, where they have to be on separate lines:

$ rg --multiline 'abc.*[\n](\n|.)*efg' test.txt

And if you want it "non greedy" so you don't just get the first abc with the last efg (separate them into pairs):

$ rg --multiline 'abc.*[\n](\n|.)*?efg' test.txt

https://til.hashrocket.com/posts/9zneks2cbv-multiline-matches-with-ripgrep-rg


g
ghostdog74
#!/bin/bash
shopt -s nullglob
for file in *
do
 r=$(awk '/abc/{f=1}/efg/{g=1;exit}END{print g&&f ?1:0}' file)
 if [ "$r" -eq 1 ];then
   echo "Found pattern in $file"
 else
   echo "not found"
 fi
done

T
Taryn

you can use grep incase you are not keen in the sequence of the pattern.

grep -l "pattern1" filepattern*.* | xargs grep "pattern2"

example

grep -l "vector" *.cpp | xargs grep "map"

grep -l will find all the files which matches the first pattern, and xargs will grep for the second pattern. Hope this helps.


That would ignore the order "pattern1" and "pattern2" appear in the file, though - OP specifically specifies that only files where "pattern2" appears AFTER "pattern1" should be matched.
e
eyllanesc

If you have some estimation about the distance between the 2 strings 'abc' and 'efg' you are looking for, you might use:

grep -r . -e 'abc' -A num1 -B num2 | grep 'efg'

That way, the first grep will return the line with the 'abc' plus #num1 lines after it, and #num2 lines after it, and the second grep will sift through all of those to get the 'efg'. Then you'll know at which files they appear together.


D
Dr. Alex RE

With ugrep released a few months ago:

ugrep 'abc(\n|.)+?efg'

This tool is highly optimized for speed. It's also GNU/BSD/PCRE-grep compatible.

Note that we should use a lazy repetition +?, unless you want to match all lines with efg together until the last efg in the file.


Bills itself as faster than ripgrep, nice!
r
rogerdpack

You have at least a couple options --

DOTALL method

use (?s) to DOTALL the . character to include \n

you can also use a lookahead (?=\n) -- won't be captured in match

example-text:

true
match me

false
match me one

false
match me two

true
match me three
third line!!
{BLANK_LINE}

command:

grep -Pozi '(?s)true.+?\n(?=\n)' example-text

-p for perl regular expressions

-o to only match pattern, not whole line

-z to allow line breaks

-i makes case-insensitive

output:

true                                                  
match me                                              
true                                                  
match me three                                        
third line!!

notes:

- +? makes modifier non-greedy so matches shortest string instead of largest (prevents from returning one match containing entire text)

you can use the oldschool O.G. manual method using \n

command:

grep -Pozi 'true(.|\n)+?\n(?=\n)'

output:

true                                                  
match me                                              
true                                                  
match me three                                        
third line!!

r
rogerdpack

I used this to extract a fasta sequence from a multi fasta file using the -P option for grep:

grep -Pzo ">tig00000034[^>]+"  file.fasta > desired_sequence.fasta

P for perl based searches

z for making a line end in 0 bytes rather than newline char

o to just capture what matched since grep returns the whole line (which in this case since you did -z is the whole file).

The core of the regexp is the [^>] which translates to "not the greater than symbol"


E
Emil Lundberg

As an alternative to Balu Mohan's answer, it is possible to enforce the order of the patterns using only grep, head and tail:

for f in FILEGLOB; do tail $f -n +$(grep -n "pattern1" $f | head -n1 | cut -d : -f 1) 2>/dev/null | grep "pattern2" &>/dev/null && echo $f; done

This one isn't very pretty, though. Formatted more readably:

for f in FILEGLOB; do
    tail $f -n +$(grep -n "pattern1" $f | head -n1 | cut -d : -f 1) 2>/dev/null \
    | grep -q "pattern2" \
    && echo $f
done

This will print the names of all files where "pattern2" appears after "pattern1", or where both appear on the same line:

$ echo "abc
def" > a.txt
$ echo "def
abc" > b.txt
$ echo "abcdef" > c.txt; echo "defabc" > d.txt
$ for f in *.txt; do tail $f -n +$(grep -n "abc" $f | head -n1 | cut -d : -f 1) 2>/dev/null | grep -q "def" && echo $f; done
a.txt
c.txt
d.txt

Explanation

tail -n +i - print all lines after the ith, inclusive

grep -n - prepend matching lines with their line numbers

head -n1 - print only the first row

cut -d : -f 1 - print the first cut column using : as the delimiter

2>/dev/null - silence tail error output that occurs if the $() expression returns empty

grep -q - silence grep and return immediately if a match is found, since we are only interested in the exit code


Can anyone please explain the &>? I'm using it too, but I never saw it documented anywhere. BTW, why do we have to silence grep that way, actually? grep -q won't do the trick as well?
&> tells bash to redirect both standard output and standard error, see REDIRECTION in the bash manual. You're very right in that we could just as well do grep -q ... instead of grep ... &>/dev/null, good catch!
Thought so. Will take away the pain of lots of awkward extra typing. Thanks for the explanation - so I must have skipped a bit in the manual. (Looked up something remotely related in it some time ago.)---You might even consider changing it in your answer.:)
b
bastelflp

This should work too?!

perl -lpne 'print $ARGV if /abc.*?efg/s' file_list

$ARGV contains the name of the current file when reading from file_list /s modifier searches across newline.


u
user unknown

The filepattern *.sh is important to prevent directories to be inspected. Of course some test could prevent that too.

for f in *.sh
do
  a=$( grep -n -m1 abc $f )
  test -n "${a}" && z=$( grep -n efg $f | tail -n 1) || continue 
  (( ((${z/:*/}-${a/:*/})) > 0 )) && echo $f
done

The

grep -n -m1 abc $f 

searches maximum 1 matching and returns (-n) the linenumber. If a match was found (test -n ...) find the last match of efg (find all and take the last with tail -n 1).

z=$( grep -n efg $f | tail -n 1)

else continue.

Since the result is something like 18:foofile.sh String alf="abc"; we need to cut away from ":" till end of line.

((${z/:*/}-${a/:*/}))

Should return a positive result if the last match of the 2nd expression is past the first match of the first.

Then we report the filename echo $f.


M
Mark Hartnady

To search recursively across all files (across multiple lines within each file) with BOTH strings present (i.e. string1 and string2 on different lines and both present in same file):

grep -r -l 'string1' * > tmp; while read p; do grep -l 'string2' $p; done < tmp; rm tmp 

To search recursively across all files (across multiple lines within each file) with EITHER string present (i.e. string1 and string2 on different lines and either present in same file):

grep -r -l 'string1\|string2' * 

Works on macOS with zsh and grep version "(BSD grep) 2.5.1-FreeBSD"
r
rogerdpack

Here's a way by using two greps in a row:

egrep -o 'abc|efg' $file | grep -A1 abc | grep efg | wc -l

returns 0 or a positive integer.

egrep -o (Only shows matches, trick: multiple matches on the same line produce multi-line output as if they are on different lines)

grep -A1 abc (print abc and the line after it)

grep efg | wc -l (0-n count of efg lines found after abc on the same or following lines, result can be used in an 'if")

grep can be changed to egrep etc. if pattern matching is needed


a
arghtype

This should work:

cat FILE | egrep 'abc|efg'

If there is more than one match you can filter out using grep -v


Whilst this code snippet is welcome, and may provide some help, it would be greatly improved if it included an explanation of how and why this solves the problem. Remember that you are answering the question for readers in the future, not just the person asking now! Please edit your answer to add explanation, and give an indication of what limitations and assumptions apply.
That doesn't actually search across multiple lines, as stated in the question.