How to use sed/grep to extract text between two words?

string bash sed grep

I am trying to output a string that contains everything between two words of a string:

input:

"Here is a String"

output:

"is a"

Using:

sed -n '/Here/,/String/p'

includes the endpoints, but I don't want to include them.

What should be the result if the input is Here is a Here String? Or I Hereby Dub Thee Sir Stringy?

FYI. Your command means to print everything between the line that has the word Here and the line that has the word String -- not what you want.

The other common sed FAQ is "how can I extract text between particular lines"; this is stackoverflow.com/questions/16643288/…

anishsane

GNU grep can also support positive & negative look-ahead & look-back: For your case, the command would be:

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

If there are multiple occurrences of Here and string, you can choose whether you want to match from the first Here and last string or match them individually. In terms of regex, it is called as greedy match (first case) or non-greedy match (second case)

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another

Note that GNU grep's -P option does not exist in the grep included in *BSD, or the ones that come with any SVR4 (Solaris, etc). In FreeBSD, you can install the devel/pcre port which includes pcregrep, which supports PCRE (and look-ahead/behind). Older versions of OSX used GNU grep, but in OSX Mavericks, -P is derived from FreeBSD's version, which does not include the option.

Hi, How do I extract distinct content only ?

This doesn't work because if your ending string "string" occurs more than once, it will get the last occurrence, not the next occurrence.

In case of Here is a string a string, both " is a " and " is a string a " are valid answers (ignore the quotes), as per the question requirements. It depends on you which one of these you want and then answer can be different accordingly. Anyway, for your requirement, this will work: echo "Here is a string a string" | grep -o -P '(?<=Here).*?(?=string)'

@BND, you need to enable multi-line search feature of pcregrep. echo $'Here is \na string' | grep -zoP '(?<=Here)(?s).*(?=string)'

Brian Campbell

sed -e 's/Here\(.*\)String/\1/'

Thanks! What if I wanted to find everything between "one is" and "String" in "Here is a one is a String"? (sed -e 's/one is(.*)String/\1/' ?

@user1190650 That would work if you want to see the "Here is a" as well. You can test it out: echo "Here is a one is a String" | sed -e 's/one is$.*$String/\1/'. If you just want the part between "one is" and "String", then you need to make the regex match the whole line: sed -e 's/.*one is$.*$String.*/\1/'. In sed, s/pattern/replacement/ say "substitute 'replacement' for 'pattern' on each line". It will only change anything that matches "pattern", so if you want it to replace the whole line, you need to make "pattern" match the whole line.

This breaks when the input is Here is a String Here is a String

Would be great to see the solution for a case : "Here is a blah blah String Here is 1 a blah blah String Here is 2 a blash blash String" output should pick up only the first substring between Here and String"

@JayD sed does not support non-greedy matching, see this question for some recommended alternatives.

wheeler

The accepted answer does not remove text that could be before Here or after String. This will:

sed -e 's/.*Here\(.*\)String.*/\1/'

The main difference is the addition of .* immediately before Here and after String.

Your answer is promising. One issue though. How can I extract it to the first seen String if there are multiple String in the same line? Thanks

@MianAsbatAhmad You would want to make the * quantifier, between Here and String, non-greedy (or lazy). However, the type of regex used by sed does not support lazy quantifiers (a ? immediately after .*) according to this Stackoverflow question. Usually to implement a lazy quantifier you would just match against everything except the token you didn't want to match, but in this case, there isn't just a single token, instead its a whole string, String.

Thanks, I got the answer using awk, stackoverflow.com/questions/51041463/…

Unfortunately this doesn't work if the string has line breaks

It's not supposed to. . doesn't match line breaks. If you want to match line breaks, you can replace . with something like [\s\s].

ghoti

You can strip strings in Bash alone:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

And if you have a GNU grep that includes PCRE, you can use a zero-width assertion:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

why is this method so slow? when stripping a large html page using this method it takes like 10 seconds.

@AdamJohns, which method? The PCRE one? PCRE is fairly complex to parse, but 10 seconds seems extreme. If you're concerned, I recommend you pose a question including example code, and see what the experts say.

I think it was so slow for me because it was holding a very large html file's source in a variable. When I wrote contents to file and then parsed the file the speed dramatically increased.

Should be the accepted answer, because it uses pure Bash.

Juve

If you have a long file with many multi-line ocurrences, it is useful to first print number lines:

cat -n file | sed -n '/Here/,/String/p'

Thanks! This is the only solution which worked in my case (multiple line text file, rather than a single string with no line breaks). Obviously, to have it without line numbering, the -n option in cat must be omitted.

... in which case cat can be entirely omitted; sed knows how to read a file or standard input.

Avinash Raj

Through GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a

grep with -P(perl-regexp) parameter supports \K, which helps in discarding the previously matched characters. In our case , the previously matched string was Here so it got discarded from the final output.

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a

If you want the output to be is a then you could try the below,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

This does not work for: echo "Here is a string dfdsf Here is a string" | awk -v FS="(Here|string)" '{print $2}', it only returns is a instead of should be is a is a@Avinash Raj

Ivan

You can use two s commands

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a

Also works

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

potong

This might work for you (GNU sed):

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

This presents each representation of text between two markers (in this instance Here and String) on a newline and preserves newlines within the text.

佚

佚名

To understand sed command, we have to build it step by step.

Here is your original text

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

Let's try to remove Here string with substition option in sed

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

At this point, I believe you would be able to remove String as well

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

But this is not your desired output.

To combine two sed commands, use -e option

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$

Hope this helps

Gary Dean

All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

Peter Mortensen

You can use \1 (refer to http://www.grymoire.com/Unix/Sed.html#uh-4):

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

The contents that is inside the brackets will be stored as \1.

This removes strings instead of output something in between. Try removing "Hello" with "is" in the sed command and it will output "Hello a"

Victoria Stuart

Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, "works" as long as the matched text does not contain a newline:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

However, despite trying numerous variants (.+?; /s; ...), I could not get these to work:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

Solution 1.

Per Extract text between two strings on different lines

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

Solution 2.*

Per How can I replace a newline (\n) using sed?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

will replace newlines with a space.

Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

which gives

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]]

This variant removes double spaces:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

giving

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

nice adventure :))

kenorb

ripgrep

Here is the example using rg:

$ echo Here is a String | rg 'Here\s(.*)\sString' -r '$1'
is a

How to use sed/grep to extract text between two words?

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US