ChatGPT解决这个技术问题 Extra ChatGPT

XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode

I have a small problem with XPath contains with dom4j ...

Let's say my XML is

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

Let's say I want to find all the nodes that have ABC in the text given the root Element...

So the XPath that I would needed to write would be

//*[contains(text(),'ABC')]

However this is not what dom4j returns .... is this a dom4j problem or my understanding how XPath works, since that query returns only the Street element and not the Comment element?

The DOM makes the Comment element a composite element with four tags two

[Text = 'XYZ'][BR][BR][Text = 'ABC'] 

I would assume that the query should still return the element since it should find the element and run contains on it, but it doesn't ...

The following query returns the element, but it returns far more then just the element – it returns the parent elements as well, which is undesirable to the problem.

//*[contains(text(),'ABC')]

Does any one know the XPath query that would return just the elements <Street/> and <Comment/> ?

As far as I can tell, //*[contains(text(),'ABC')] returns only the <Street> element. It doesn't return any ancestors of <Street> or <Comment>.
None of the answers addressed the different behavior found in new versions of XPath (versions 2.0 and above, starting in ~2007) so I've added an updated answer below to explain the difference.

K
Ken Bloom

The <Comment> tag contains two text nodes and two <br> nodes as children.

Your xpath expression was

//*[contains(text(),'ABC')]

To break this down,

* is a selector that matches any element (i.e. tag) -- it returns a node-set. The [] are a conditional that operates on each individual node in that node set. It matches if any of the individual nodes it operates on match the conditions inside the brackets. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set. contains is a function that operates on a string. If it is passed a node set, the node set is converted into a string by returning the string-value of the node in the node-set that is first in document order. Hence, it can match only the first text node in your element -- namely BLAH BLAH BLAH. Since that doesn't match, you don't get a in your results.

You need to change this to

//*[text()[contains(.,'ABC')]]

* is a selector that matches any element (i.e. tag) -- it returns a node-set. The outer [] are a conditional that operates on each individual node in that node set -- here it operates on each element in the document. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set. The inner [] are a conditional that operates on each node in that node set -- here each individual text node. Each individual text node is the starting point for any path in the brackets, and can also be referred to explicitly as . within the brackets. It matches if any of the individual nodes it operates on match the conditions inside the brackets. contains is a function that operates on a string. Here it is passed an individual text node (.). Since it is passed the second text node in the tag individually, it will see the 'ABC' string and be able to match it.


Awesome im a little bit of an xpath noob, so let me get this, text() is a function that takes the expression contains(.,'ABC'), Is there a chance you can explain so i don't do this kinda stupid stuff again ;)
I've edited my answer to provide a long explanation. I don't really know that much about XPath myself -- I just experimented a bit until I stumbled on that combination. Once I had a working combination, I made a guess what was going on and looked in the XPath standard to confirm what I thought was going on and write the explanation.
How would you make this a case insensitive search?
I know this is an old thread, but can anyone comment on if there is a fundamental difference, preferably with some simple test cases between the answer given by Ken Bloom and //*[contains(., 'ABC')]. I had always used the pattern given by Mike Milkin, thinking it was more appropriate, but just doing contains on the current context seems to actually be what I want more often.
...//*[text()[contains(.,'ABC')]] means any element for which text()[contains(.,'ABC')] is true. text()[contains(.,'ABC')] is a node-set of all text node children of the context node for which contains(.,'ABC') is true. Since text()[contains(.,'ABC')] is a node-set, it's converted to boolean by boolean() function. For a node-set, boolean() returns true if it's not empty.
C
Community

The XML document:

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

The XPath expression:

//*[contains(text(), 'ABC')]

//* matches any descendant element of the root node. That is, any element but the root node.

[...] is a predicate, it filters the node-set. It returns nodes for which ... is true:

A predicate filters a node-set [...] to produce a new node-set. For each node in the node-set to be filtered, the PredicateExpr is evaluated [...]; if PredicateExpr evaluates to true for that node, the node is included in the new node-set; otherwise, it is not included.

contains('haystack', 'needle') returns true if haystack contains needle:

Function: boolean contains(string, string) The contains function returns true if the first argument string contains the second argument string, and otherwise returns false.

But contains() takes a string as its first parameter. And it's passed nodes. To deal with that every node or node-set passed as the first parameter is converted to a string by the string() function:

An argument is converted to type string as if by calling the string function.

string() function returns string-value of the first node:

A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.

string-value of an element node:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

string-value of a text node:

The string-value of a text node is the character data.

So, basically string-value is all text that is contained in a node (concatenation of all descendant text nodes).

text() is a node test that matches any text node:

The node test text() is true for any text node. For example, child::text() will select the text node children of the context node.

Having that said, //*[contains(text(), 'ABC')] matches any element (but the root node), the first text node of which contains ABC. Since text() returns a node-set that contains all child text nodes of the context node (relative to which an expression is evaluated). But contains() takes only the first one. So for the document above the path matches the Street element.

The following expression //*[text()[contains(., 'ABC')]] matches any element (but the root node), that has at least one child text node, that contains ABC. . represents the context node. In this case, it's a child text node of any element but the root node. So for the document above the path matches the Street, and the Comment elements.

Now then, //*[contains(., 'ABC')] matches any element (but the root node) that contains ABC (in the concatenation of the descendant text nodes). For the document above it matches the Home, the Addr, the Street, and the Comment elements. As such, //*[contains(., 'BLAH ABC')] matches the Home, the Addr, and the Comment elements.


Like the accepted answer, this answer relates only to XPath 1.0. The situation with XPath 2.0 (released 2007) and later versions is different.
T
Toby Speight

[contains(text(),'')] only returns true or false. It won't return any element results.


this wont work if i had ' ' or ' ' how can we trim ?
contains(text(),'JB-') is not work! conatains takes two strings as arguments - contains(**string**, **string**)! text() is not string, is a function!
R
Roger Veciana

The accepted answer will return all the parent nodes too. To get only the actual nodes with ABC even if the string is after :

//*[text()[contains(.,'ABC')]]/text()[contains(.,"ABC")]

Incase if someone curios to get parent element of text node instead; post-fix query with /.. like so: //*[text()[contains(.,'ABC')]]/text()[contains(.,"ABC")]/.. Thanks! @roger
l
learningIsFun
//*[text()='ABC'] 

returns

<street>ABC</street>
<comment>BLAH BLAH BLAH <br><br>ABC</comment>

When adding an answer to a nine year old question with five existing answers it is very important to point out what unique new aspect of the question your answer addresses.
Answer I posted was very simple. So thought like sharing, which may help beginners like me.
k
kjhughes

Modern answer that covers XPath 1.0 vs XPath 2.0+ behavior ...

This XPath,

//*[contains(text(),'ABC')]

behaves differently with XPath 1.0 and later versions of XPath (2.0+).

Common behavior

//* selects all elements within a document.

[] filters those elements according to the predicate expressed therein.

contains(string, substring) within the predicate will filter those elements to those for which substring is a substring in string.

XPath 1.0 behavior

contains(string, substring) will convert a node set to a string by taking the string value of the first node in the node set.

For //*[contains(text(),'ABC')] that node set will be all child text nodes of each element in the document.

Since only the first text node child is used, the expectation that all child text nodes are tested for 'ABC' substring containment is violated.

This leads to counter-intuitive results to anyone unfamiliar with the above conversion rules.

XPath 1.0 online example shows that only one 'ABC' is selected.

XPath 2.0+ behavior

It is an error to call contains(string, substring) with a sequence of more than one item as the first argument.

This corrected the counter-intuitive behavior described above in XPath 1.0.

XPath 2.0 online example shows a typical error message due to the conversion error particular to XPath 2.0+.

Common solutions

If you wish to include descendent elements (beyond children), test against the string value of an element as a single string, rather than the individual string values of the child text nodes, this XPath, //*[contains(.,'ABC')] selects your targeted Street and Comment elements and also their Addr and Home ancestor elements because those too have 'ABC' as substrings of their string values. Online example shows ancestors being selected too. If you wish to exclude descendent elements (beyond children), this XPath, //*[text()[contains(.,'ABC')]] selects only your targeted Street and Comment because only those elements have text node children whose string values contain the 'ABC' substring. This will be true for all versions of XPath Online example shows only Street and Comment being selected.


E
Eaten by a Grue

Here is an alternate way to match nodes which contain a given text string. First query for the text node itself, then get the parent:

//text()[contains(., "ABC")]/..

For me this is easy to read and understand.


p
phuongauto

This is the best answer for the topic question:

//*[text()[contains(.,'ABC')]]/text()[contains(.,"ABC")]

An example: example case

Xpath to get bon dua madam

//h3[text()='Contact Information']/parent::div/following-sibling::div/p[text()[contains(.,'bon dua madam')]]/text()[contains(.,'bon dua madam')]

z
zagoo2000

It took me a little while but finally figured out. Custom xpath that contains some text below worked perfectly for me.

//a[contains(text(),'JB-')]

contains(text(),'JB-') is not work! conatains takes two strings as arguments - contains(**string**, **string**)! text() is not string, is a function!
@AtachiShadow The function's result is a string

关注公众号,不定期副业成功案例分享
Follow WeChat

Success story sharing

Want to stay one step ahead of the latest teleworks?

Subscribe Now