如何使用 sed/grep 提取两个单词之间的文本？

string bash sed grep

我正在尝试输出一个字符串，其中包含字符串的两个单词之间的所有内容：

输入：

"Here is a String"

输出：

"is a"

使用：

sed -n '/Here/,/String/p'

包括端点，但我不想包括它们。

如果输入是 Here is a Here String，结果应该是什么？还是I Hereby Dub Thee Sir Stringy？

供参考。您的命令意味着打印包含单词 Here 的行和包含单词 String 的行之间的所有内容 - 而不是您想要的。

另一个常见的 sed FAQ 是“如何在特定行之间提取文本”；这是stackoverflow.com/questions/16643288/…

anishsane

GNU grep 还可以支持正负前瞻和回溯：对于您的情况，命令将是：

echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

如果 Here 和 string 多次出现，您可以选择是要从第一个 Here 和最后一个 string 匹配还是单独匹配它们。就正则表达式而言，它被称为 greedy match (first case) 或 non-greedy match (second case)

$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
 is a string, and Here is another 
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
 is a 
 is another

请注意，GNU grep 的 -P 选项不存在于 *BSD 中包含的 grep 或任何 SVR4（Solaris 等）附带的选项中。在 FreeBSD 中，您可以安装包含 pcregrep 的 devel/pcre 端口，它支持 PCRE（和前瞻/后视）。旧版本的 OSX 使用 GNU grep，但在 OSX Mavericks 中，-P 派生自 FreeBSD 的版本，不包含该选项。

嗨，我如何只提取不同的内容？

这不起作用，因为如果您的结束字符串“string”多次出现，它将获得最后一次出现，而不是下一次出现。

对于 Here is a string a string，根据问题要求， " is a " 和 " is a string a " 都是有效答案（忽略引号）。这取决于您您想要哪一个，然后答案可能会有所不同。无论如何，根据您的要求，这将起作用：echo "Here is a string a string" | grep -o -P '(?<=Here).*?(?=string)'

@BND，您需要启用 multi-line search feature of pcregrep。 echo $'Here is \na string' | grep -zoP '(?<=Here)(?s).*(?=string)'

Brian Campbell

sed -e 's/Here\(.*\)String/\1/'

谢谢！如果我想在“Here is a one is a String”中找到“one is”和“String”之间的所有内容怎么办？ (sed -e 's/one is(.*)String/\1/' ?

@user1190650 如果您还想看到“这里是一个”，那也可以。您可以对其进行测试：echo "Here is a one is a String" | sed -e 's/one is$.*$String/\1/'。如果您只想要“one is”和“String”之间的部分，那么您需要使正则表达式匹配整行：sed -e 's/.*one is$.*$String.*/\1/'。在 sed 中，s/pattern/replacement/ 说“在每一行用 'replacement' 替换 'pattern'”。它只会改变任何匹配“模式”的东西，所以如果你想让它替换整行，你需要让“模式”匹配整行。

当输入为 Here is a String Here is a String 时会中断

很高兴看到一个案例的解决方案：“Here is a blah blah String Here is 1 a blah blah String Here is 2 a blash blash String”输出应该只拾取 Here 和 String 之间的第一个子字符串”

@JayD sed 不支持非贪婪匹配，请参阅 this question 了解一些推荐的替代方案。

wheeler

接受的答案不会删除可能在 Here 之前或 String 之后的文本。这将：

sed -e 's/.*Here\(.*\)String.*/\1/'

主要区别是在 Here 之前和 String 之后添加了 .*。

你的回答很有希望。一个问题。如果同一行中有多个字符串，如何将其提取到第一个看到的字符串？谢谢

@MianAsbatAhmad 您希望使 Here 和 String 之间的 * 量词不贪婪（或懒惰）。但是，根据 this Stackoverflow 问题，sed 使用的正则表达式类型不支持惰性量词（紧跟在 .* 之后的 ?）。通常要实现惰性量词，您只需匹配除您不想匹配的标记之外的所有内容，但在这种情况下，不只是一个标记，而是一个完整的字符串 String。

谢谢，我使用 awk 得到了答案，stackoverflow.com/questions/51041463/…

不幸的是，如果字符串有换行符，这不起作用

这是不应该的。 . 不匹配换行符。如果要匹配换行符，可以将 . 替换为 [\s\s]。

ghoti

您可以单独在 Bash 中剥离字符串：

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

如果您有一个包含 PCRE 的 GNU grep，则可以使用零宽度断言：

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

为什么这个方法这么慢？使用此方法剥离大型 html 页面时，大约需要 10 秒。

@AdamJohns，哪种方法？ PCRE一个？ PCRE 解析起来相当复杂，但 10 秒似乎很极端。如果您担心，我建议您pose a question包含示例代码，然后看看专家怎么说。

我认为这对我来说太慢了，因为它在变量中保存了一个非常大的 html 文件源。当我将内容写入文件然后解析文件时，速度显着提高。

应该是公认的答案，因为它使用纯 Bash。

Juve

如果您有一个包含许多多行出现的长文件，则首先打印数字行很有用：

cat -n file | sed -n '/Here/,/String/p'

谢谢！这是在我的情况下唯一有效的解决方案（多行文本文件，而不是没有换行符的单个字符串）。显然，要让它没有行号，必须省略 cat 中的 -n 选项。

...在这种情况下，cat 可以完全省略； sed 知道如何读取文件或标准输入。

Avinash Raj

通过 GNU awk，

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a

带有 -P(perl-regexp) 参数的 grep 支持 \K，这有助于丢弃以前匹配的字符。在我们的例子中，之前匹配的字符串是 Here，所以它从最终输出中被丢弃了。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a

如果您希望输出为 is a，那么您可以尝试以下操作，

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

这不适用于：echo "Here is a string dfdsf Here is a string" | awk -v FS="(Here|string)" '{print $2}'，它只返回 is a 而不是 is a is a@Avinash Raj

Ivan

您可以使用两个 s 命令

$ echo "Here is a String" | sed 's/.*Here//; s/String.*//'
 is a

也有效

$ echo "Here is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

$ echo "Here is a StringHere is a StringHere is a StringHere is a String" | sed 's/.*Here//; s/String.*//'
 is a

potong

这可能对您有用（GNU sed）：

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file

这会在换行符上显示两个标记（在本例中为 Here 和 String）之间的每个文本表示，并在文本中保留换行符。

佚

佚名

要理解 sed 命令，我们必须逐步构建它。

这是你的原文

user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$

让我们尝试使用 sed 中的 substition 选项删除 Here 字符串

user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$

此时，我相信您也可以删除 String

user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$

但这不是您想要的输出。

要组合两个 sed 命令，请使用 -e 选项

user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$

希望这可以帮助

Gary Dean

上述所有解决方案都存在缺陷，即最后一个搜索字符串在字符串的其他地方重复。我发现最好写一个 bash 函数。

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

Peter Mortensen

您可以使用 \1（请参阅 http://www.grymoire.com/Unix/Sed.html#uh-4）：

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

括号内的内容将存储为 \1。

这将删除字符串而不是输出介于两者之间的内容。尝试在 sed 命令中用“is”删除“Hello”，它会输出“Hello a”

Victoria Stuart

问题。我存储的爪子邮件消息包装如下，我正在尝试提取主题行：

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

根据此线程中的 A2，只要匹配的文本不包含换行符，How to use sed/grep to extract text between two words? 下面的第一个表达式“有效”：

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

但是，尽管尝试了许多变体 (.+?; /s; ...)，但我无法让这些变体起作用：

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

解决方案 1。

每Extract text between two strings on different lines

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

这使

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

解决方案 2.*

每How can I replace a newline (\n) using sed?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

将用空格替换换行符。

将其与 How to use sed/grep to extract text between two words? 中的 A2 链接起来，我们得到：

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

这使

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]]

此变体删除了双空格：

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

给予

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

不错的冒险:))

kenorb

ripgrep

以下是使用 rg 的示例：

$ echo Here is a String | rg 'Here\s(.*)\sString' -r '$1'
is a

如何使用 sed/grep 提取两个单词之间的文本？

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

联系我们