ChatGPT解决这个技术问题 Extra ChatGPT

Fastest way to find second (third...) highest/lowest value in vector or column

R offers max and min, but I do not see a really fast way to find another value in the order, apart from sorting the whole vector and then picking a value x from this vector.

Is there a faster way to get the second highest value, for example?

The package kit on CRAN has a topn function which is faster than sort, order and nth. Look at the documentation.
@Suresh_Patel could you provide examples benchmarking it against the examples provided by Rfast::nth? If it realy is faster when fairly compared to Rfast::nth then it should be the accepted answer
@Stefanos, I posted the benchmark below ...based on your benchmark
I just did a second run with kit::topn(hasna=F)...I believe I provided the best answer now, didn't I?

R
Rob Hyndman

Use the partial argument of sort(). For the second highest value:

n <- length(x)
sort(x,partial=n-1)[n-1]

What is the advantage of this method as opposed to sort(x, TRUE)[2] as described in @Abrar's answer, apart from not satisfying the constraint in the question?
I used this method, but get the following error: Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) : index 4705 outside bounds Any idea what might the issue be? Some details: My x is a numeric vector of length 4706 with some NAs in the data. I tried to get the second highest value in the vector using the exact same code as @RobHyndman suggested.
The descreasing argument is not compatible with partial sorting.
@sriramn I know you asked this 3.5 years ago, but this solution won't work with missing values because sort removes the missing values. One fix would be n <- sum(!is.na(x)); sort(x,partial=n-1)[n-1]
Though the decreasing argument is not compatible with partial sorting, you could always -sort(-x, partial=n-1)[n-1]; it is logically the same thing and takes considerably less time than sort(x, decreasing=TRUE)[n-1].
P
Paolo

Slightly slower alternative, just for the records:

x <- c(12.45,34,4,0,-234,45.6,4)
max( x[x!=max(x)] )
min( x[x!=min(x)] )

It would seem surprising if this was any faster than sorting the whole vector and taking the n-1th value!
@jwg This is O(n) so it has to be faster than sorting on large datasets.
It seems to me you can get some considerable speed improvement with a small modification: max(x[-which.max(x)])
This answer produces an error if all values are the same, unless you use @sindri_baldur's answer (and there are at least 2 items, of course)
S
Stefanos

Rfast has a function called nth_element that does exactly what you ask.

Further the methods discussed above that are based on partial sort, don't support finding the k smallest values

Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755

Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.

Rfast::nth(x, 5, descending = T)

Will return the 5th largest element of x, while

Rfast::nth(x, 5, descending = F)

Will return the 5th smallest element of x

Benchmarks below against most popular answers.

For 10 thousand numbers:

N = 10000
x = rnorm(N)

maxN <- function(x, N=2){
    len <- length(x)
    if(N>len){
        warning('N greater than length(x).  Setting N=length(x)')
        N <- length(x)
    }
    sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: microseconds
  expr      min       lq      mean   median        uq       max neval
 Rfast  160.364  179.607  202.8024  194.575  210.1830   351.517   100
  maxN  396.419  423.360  559.2707  446.452  487.0775  4949.452   100
 order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148   100

For 1 million numbers:

N = 1e6
x = rnorm(N)

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]]) 

Unit: milliseconds
  expr      min        lq      mean   median        uq       max neval
 Rfast  89.7722  93.63674  114.9893 104.6325  120.5767  204.8839   100
  maxN 150.2822 207.03922  235.3037 241.7604  259.7476  336.7051   100
 order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129   100

Nice! Normally when I see a relatively low-rep user add an answer to a popular old question it's pretty low-quality. This, on the other hand, is an excellent addition. I made a couple readability edits, but it looks great!
It bears mentioning that Rfast::nth can return multiple elements (e.g. 8th and 9th largest elements) as well as the indices of those elements.
What I like about the Rfast solution is that the package also has an easily implemented solution for doing this for each row or column.
There is a bug in nth for integer values. I know it and I will fix it for future update of package. For now you can just use Rfast::nth(as.numeric(1:10), 2). Although, I don't really think that Rfast::nth(1:10, 2) is a great example. If you have a sorted array why do you want to use nth? It is a lot faster to check if it is sorted and then extract the value or even better extract the value itself.
Z
Zach

I wrapped Rob's answer up into a slightly more general function, which can be used to find the 2nd, 3rd, 4th (etc.) max:

maxN <- function(x, N=2){
  len <- length(x)
  if(N>len){
    warning('N greater than length(x).  Setting N=length(x)')
    N <- length(x)
  }
  sort(x,partial=len-N+1)[len-N+1]
}

maxN(1:10)

Cool. This usage is particularly useful maxN(1:10, 1:3) (I would have set the default N to 1)
Why not have the main line in the fx as sort(x, decreasing=T, partial=N)[N]?
D
Davit Sargsyan

Here is an easy way to find the indices of N smallest/largest values in a vector(Example for N = 3):

N <- 3

N Smallest:

ndx <- order(x)[1:N]

N Largest:

ndx <- order(x, decreasing = T)[1:N]

So you can extract the values as:

x[ndx]

This runs in L log L time, where L is the length of x. I think the user was hoping for a method that runs in log L time.
This might be the second fastest way if the methods were ordered by time and the fastest N extracted. I also like it because it is very clear code compared to the accepted solution.
The theoretical best and the accepted method (hopefully) runs in O(L) time, not O(log L). This one runs in O(L log L).
D
DaveShaw

For nth highest value,

sort(x, TRUE)[n]

The OP already said in his post that this was a solution he did not want to use: "apart from sorting the whole vector and than picking value x from this vector".
Handy as one can easily grab the three (four, whatever) highest sort(x, TRUE)[1:3]
S
Suresh_Patel

Here you go... kit is the obvious winner!

N = 1e6
x = rnorm(N)

maxN <- function(x, N=2){
  len <- length(x)
  if(N>len){
    warning('N greater than length(x).  Setting N=length(x)')
    N <- length(x)
  }
  sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
  Rfast = Rfast::nth(x,5,descending = T),
  maxN = maxN(x,5),
  order = x[order(x, decreasing = T)[5]],
  kit = x[kit::topn(x, 5L,decreasing = T)[5L]]
) 
# Unit: milliseconds
# expr       min        lq     mean    median        uq        max neval
# Rfast 12.311168 12.473771 16.36982 12.702134 16.110779 102.749873   100
# maxN  12.922118 13.124358 17.49628 18.977537 20.053139  28.928694   100
# order 50.443100 50.926975 52.54067 51.270163 52.323116  66.561606   100
# kit    1.177202  1.216371  1.29542  1.240228  1.297286   2.771715   100

Edit: I forgot that kit::topn has hasna option...let's do another run.

microbenchmark::microbenchmark(
  Rfast = Rfast::nth(x,5,descending = T),
  maxN = maxN(x,5),
  order = x[order(x, decreasing = T)[5]],
  kit = x[kit::topn(x, 5L,decreasing = T)[5L]],
  kit2 = x[kit::topn(x, 5L,decreasing = T,hasna = F)[5L]],
  unit = "ms"
) 
# Unit: milliseconds
# expr       min        lq       mean     median        uq       max neval
# Rfast 13.194314 13.358787 14.7227116 13.4560340 14.551194 24.524105   100
# maxN   7.378960  7.527661 10.0747803  7.7119715 12.217756 67.409526   100
# order 50.088927 50.488832 52.4714347 50.7415680 52.267003 70.062662   100
# kit    1.180698  1.217237  1.2975441  1.2429790  1.278243  3.263202   100
# kit2   0.842354  0.876329  0.9398055  0.9109095  0.944407  2.135903   100

V
Vin

Here is the simplest way I found,

num <- c(5665,1615,5154,65564,69895646)

num <- sort(num, decreasing = F)

tail(num, 1)                           # Highest number
head(tail(num, 2),1)                   # Second Highest number
head(tail(num, 3),1)                   # Third Highest number
head(tail(num, n),1)                   # Generl equation for finding nth Highest number

J
John Jiang

I found that removing the max element first and then do another max runs in comparable speed:

system.time({a=runif(1000000);m=max(a);i=which.max(a);b=a[-i];max(b)})
   user  system elapsed 
  0.092   0.000   0.659 

system.time({a=runif(1000000);n=length(a);sort(a,partial=n-1)[n-1]})
   user  system elapsed 
  0.096   0.000   0.653 

N
Noale

dplyr has the function nth, where the first argument is the vector and the second is which place you want. This goes for repeating elements as well. For example:

x = c(1,2, 8, 16, 17, 20, 1, 20)

Finding the second largest value:

 nth(unique(x),length(unique(x))-1)

[1] 17

is this fast ... ?
internally this uses x[[order(order_by)[[n]]]] - so it requires sorting the whole vector. So it won't be as fast as the accepted answer.
but it uses sort with the partial= argument (which changes everything)
@BenBolker which implies Paolo's or Rob's answer could be used to improve dplyr::nth()? bench::mark(max(x[-which.max(x)]), x[[order(-x)[[2]]]] ), nth()seems almost 10 times slower, where length(x) is 3 million.
D
Donarus

When I was recently looking for an R function returning indexes of top N max/min numbers in a given vector, I was surprised there is no such a function.

And this is something very similar.

The brute force solution using base::order function seems to be the easiest one.

topMaxUsingFullSort <- function(x, N) {
  sort(x, decreasing = TRUE)[1:min(N, length(x))]
}

But it is not the fastest one in case your N value is relatively small compared to length of the vector x.

On the other side if the N is really small, you can use base::whichMax function iteratively and in each iteration you can replace found value by -Inf

# the input vector 'x' must not contain -Inf value 
topMaxUsingWhichMax <- function(x, N) {
  vals <- c()
  for(i in 1:min(N, length(x))) {
    idx      <- which.max(x)
    vals     <- c(vals, x[idx]) # copy-on-modify (this is not an issue because idxs is relative small vector)
    x[idx]   <- -Inf            # copy-on-modify (this is the issue because data vector could be huge)
  }
  vals
}

I believe you see the problem - the copy-on-modify nature of R. So this will perform better for very very very small N (1,2,3) but it will rapidly slow down for larger N values. And you are iterating over all elements in vector x N times.

I think the best solution in clean R is to use partial base::sort.

topMaxUsingPartialSort <- function(x, N) {
  N <- min(N, length(x))
  x[x >= -sort(-x, partial=N)[N]][1:N]
}

Then you can select the last (Nth) item from the result of functions defiend above.

Note: functions defined above are just examples - if you want to use them, you have to check/sanity inputs (eg. N > length(x)).

I wrote a small article about something very similar (get indexes of top N max/min values of a vector) at http://palusga.cz/?p=18 - you can find here some benchmarks of similar functions I defined above.


R
Robert

head(sort(x),..) or tail(sort(x),...) should work


R
Ralph

This will find the index of the N'th smallest or largest value in the input numeric vector x. Set bottom=TRUE in the arguments if you want the N'th from the bottom, or bottom=FALSE if you want the N'th from the top. N=1 and bottom=TRUE is equivalent to which.min, N=1 and bottom=FALSE is equivalent to which.max.

FindIndicesBottomTopN <- function(x=c(4,-2,5,-77,99),N=1,bottom=FALSE)
{

  k1 <- rank(x)
  if(bottom==TRUE){
    Nindex <- which(k1==N)
    Nindex <- Nindex[1]
  }

  if(bottom==FALSE){
    Nindex <- which(k1==(length(x)+1-N))
    Nindex <- Nindex[1]
  }

  return(Nindex)
}

v
vdc320
topn = function(vector, n){
  maxs=c()
  ind=c()
  for (i in 1:n){
    biggest=match(max(vector), vector)
    ind[i]=biggest
    maxs[i]=max(vector)
    vector=vector[-biggest]
  }
  mat=cbind(maxs, ind)
  return(mat)
}

this function will return a matrix with the top n values and their indices. hope it helps VDevi-Chou


p
perror

You can identify the next higher value with cummax(). If you want the location of the each new higher value for example you can pass your vector of cummax() values to the diff() function to identify locations at which the cummax() value changed. say we have the vector

v <- c(4,6,3,2,-5,6,8,12,16)
cummax(v) will give us the vector
4  6  6  6  6  6  8 12 16

Now, if you want to find the location of a change in cummax() you have many options I tend to use sign(diff(cummax(v))). You have to adjust for the lost first element because of diff(). The complete code for vector v would be:

which(sign(diff(cummax(v)))==1)+1

I think you misunderstand the question. The goal is to find, say, the second highest value. How does this help to get you from v to 12... and for the third highest to 8?
a
alko989

You can use the sort keyword like this:

sort(unique(c))[1:N]

Example:

c <- c(4,2,44,2,1,45,34,2,4,22,244)
sort(unique(c), decreasing = TRUE)[1:5]

will give the first 5 max numbers.