> For the complete documentation index, see [llms.txt](https://evarga.gitbook.io/sh-intro-to-algs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://evarga.gitbook.io/sh-intro-to-algs/part-ii-sorting-and-order-statistics/9.-medians-and-order-statistics.md).

# 9. Medians and Order Statistics

## Exercises

### 9.1-1

**Available in the latest revision of the IM.**

### 9.1-2

**Available in the latest revision of the IM.**

### 9.1-3

The minimum number of races it takes to determine the fastest three horses out of 25 is 7. Here is the procedure.

{% stepper %}
{% step %}

### Determine the fastest horse in each group

Organize 5 races to know the fastest horse in each group.
{% endstep %}

{% step %}

### Determine the fastest horse overall

Organize the 6th race to see which horse is the fastest overall. This is one of the three we are looking for.
{% endstep %}

{% step %}

### Determine the other two horses

Using the result of the previous race we can select candidates for the final race. The forth and fifth placed horse, together with their groups, can be immediately eliminated. They cannot be candidates for the second and third place. Furthermore, the rest of the horses from a group of the third placed horse can be excluded, too. We need to take 2 horses from the winner's group and 1 horse from the group of the second placed horse. Now, we have 5 horses for the 7th race. The first and second horse from this race are the additional two we are looking for.
{% endstep %}
{% endstepper %}

### ★ 9.1-4

Let $$A$$ represent the set of numbers which are potentially the minimum and let $$B$$ denote the set of numbers which are potentially the maximum. Initially, all $$n$$ numbers are potentially either the minimum or maximum, so $$|A|=|B|=n$$. There are two types of comparisons:

1. Between two numbers that are members of both sets.
2. Between two numbers from the same set.

The first type of comparison reduces the size of both $$A$$ and $$B$$ by one. The reason is that a smaller of the two numbers cannot anymore be a member of $$B$$ and the larger of the two numbers cannot anymore remain in $$A$$. The second type of comparison only reduces the matching set by one without having any impact on the other. So, to attain the lower bound we must leverage the first type of comparison as much as possible.

As with proving the upper bound, we process elements in pairs. We can form $$\lfloor n/2 \rfloor$$ pairs, so after the first stage $$|A|=|B|=\lceil n/2 \rceil$$. Observe that these sets can have at most one element in common, which happens when $$n$$ is odd. At any rate, the minimum must be searched among members of $$A$$ and the maximum among those from $$B$$. We already know the lower bound on the amount of comparisons needed, which is $$m-1$$ for a set of $$m$$ elements. This is where we can only use the second type of comparison (see above).

The total number of comparisons required is

$$
\lfloor n/2 \rfloor+2(\lceil n/2 \rceil-1)=n+\lceil n/2 \rceil-2=\lceil 3n/2 \rceil-2.
$$

### 9.2-1

**Available in the latest revision of the IM.**

### 9.2-2

**Available in the latest revision of the IM.**

### 9.2-3

In each partitioning the maximum of the current subarray should be selected as a pivot. For example, the first iteration should choose 9, the second 8, and so on. The sequence stops when the last subarray becomes \[0].

### 9.2-4

**Available in the latest revision of the IM.**

### 9.3-1

**Available in the latest revision of the IM.**

### 9.3-2

Let $$n=r-p+1$$, thus $$g=\lfloor n/5 \rfloor \le n/5$$. The subarray $$A\[5g+1:r]$$ may contain at most 4 elements, that should be added to both sides of a partition. The recurrence becomes

$$
T(n) \le T(n/5)+T(7n/10 + 4)+\Theta(n).
$$

The driving function satisfies the polynomial-growth condition, so according to the Akra-Bazzi theorem, replacing $$T(7n/10)$$ with $$T(7n/10+4)$$, where $$4=O(n/\lg^{1+\epsilon}n)$$ for some constant $$\epsilon>0$$ and sufficiently large $$n$$, leaves the asymptotic solution unaffected. Therefore, we still have $$\Theta(n)$$ running time.

### 9.3-3

**Available in the latest revision of the IM.**

{% hint style="warning" %}
The solution in the IM is inaccurate, since the ith order statistic of a median is $$i=\lceil (r-p+1)/2 \rceil$$.
{% endhint %}

### ★ 9.3-4

Assume that all elements are distinct, otherwise, augment them with positional information to make them distinct. We apply the decision tree model to prove the statement from the book:

* Each internal node represents a comparison between two elements.&#x20;
* Each leaf corresponds to a conclusion reached by the algorithm after a sequence of comparisons.&#x20;
* For the algorithm to correctly identify the $$i$$th smallest element $$x$$, every input that leads to a leaf where $$x$$ is output must have $$x$$ as the $$i$$th smallest element.

Assume the algorithm reaches a leaf where it outputs $$x$$ as the $$i$$th smallest element. Suppose, for the sake of contradiction, that there exists an element $$y \neq x$$ such that the relation between $$x$$ and $$y$$ is not determined, either directly or via transitive inference, by the comparisons made along the path to that leaf. Then, there exist two inputs sharing the same path toward $$x$$:

* In one $$y\<x$$
* In the other $$y>x$$.

But this is a contradiction. If $$y\<x$$, then the $$i$$th smallest element cannot be $$x$$ given that $$x$$ is also the $$i$$th smallest element when $$y>x$$. Hence, the relation between $$x$$ and every other element $$y$$ must be determined by the comparisons already made. No need for performing any additional comparisons.

### 9.3-5

You can find the solution [here](https://stackoverflow.com/a/481333/4931661).

### 9.3-6

**Available in the latest revision of the IM.**

### 9.3-7

{% hint style="warning" %}
Available in the latest revision of the IM, but it is needlessly overcomplicated.
{% endhint %}

Here is a simple geometric proof that the median of the $$y$$-coordinates minimizes the total spur length. Observe that the $$x$$-coordinates have no significance on the result. Form pairs of wells based on their $$y$$ coordinates in the following manner:

1. Pair the absolute highest well with the absolute lowest well. To minimize their spur lengths, the pipeline must be between them.
2. Pair the second highest with the second lowest. The pipeline must be between them, too.
3. Continue this inward. By doing this, you create a set of nested intervals. To minimize the total distance, the pipeline must be placed in the intersection of all these intervals. The intersection of all these nested pairs naturally collapses precisely onto the median.

Use the linear time algorithm from the book to find the median of $$y$$ coordinates.

### 9.3-8

**Available in the latest revision of the IM.**

{% hint style="danger" %}
The solution in the IM contains errors regarding order statistics, that are fixed below. Notice that we using a generic expression for $$k$$ irrespectively of its parity.

Find the $$\left(\frac{\lfloor k/2 \rfloor}{k}n\right)$$th order statistic using the `SELECT` procedure. Then recursively find the $$\lfloor k/2 \rfloor$$ quantiles of the elements less than and the $$\lceil k/2 \rceil$$ quantiles of the elements greater than the $$\left(\frac{\lfloor k/2 \rfloor}{k}n\right)$$th order statistic.
{% endhint %}

### 9.3-9

**Available in the latest revision of the IM.**

### 9.3-10

**Available in the latest revision of the IM.**

#### More Generic LeetCode Example

Below is the solution in C++ for the [problem](https://leetcode.com/problems/median-of-two-sorted-arrays/) associated with this exercise.

```cpp
class Solution {
public:
    double findMedianSortedArrays(vector<int>& nums1, vector<int>& nums2) {
        // The median must be either in nums1 or nums2.
        double median = findMedianSortedArrays(nums1, nums2, 0, nums1.size() - 1);
        if (isnan(median))
            median = findMedianSortedArrays(nums2, nums1, 0, nums2.size() - 1);
        return median;
    }

private:
    double findMedianSortedArrays(vector<int>& a, vector<int>& b, int low, int high) {
        const int m = a.size();
        const int n = b.size();
        const bool isEven = ((m + n) & 1) == 0;
        const int half = (m + n) >> 1;

        if (n == 0)
            return isEven ? (a[half - 1] + a[half]) / 2.0 : a[half];

        // Use binary search over A to find the median, if possible.
        while (low <= high) {        
            // a[k] is the so-called lower median.
            const int k = (low + high) >> 1;
            // Number of elements in A left of the median.
            int left = k + int(isEven);
            // Index into B immediately left of the median.
            const int q = half - left - 1;
                        
            if (left == half and a[k] <= b.front() or 
                    q >= 0 and q < n and b[q] <= a[k] and (q == n - 1 or a[k] <= b[q + 1]))
                if (isEven) {
                    const double neighborA = (k < m - 1) ? a[k + 1] : INFINITY;
                    const double neighborB = (q < n - 1) ? b[q + 1] : INFINITY;
                    return (a[k] + min(neighborA, neighborB)) / 2.0;
                } else
                    return a[k];
        
            if (q < 0 or q < n - 1 and a[k] > b[q + 1])
                high = k - 1;
            else
                low = k + 1;
        }
        return NAN;
    }
};
```

## Problems

### 9-1 Largest $$i$$ numbers in sorted order

**Available in the latest revision of the IM.**

### 9-2 Variant of randomized selection

**Available in the latest revision of the IM.**

### 9-3 Weighted median

**Available in the latest revision of the IM.**

{% hint style="warning" %}
There are several issues with the solution in the IM, as noted below.

**a.** The "lower" sum of weights are always less than ½. If $$n$$ is even, then we have $$\sum\_{x\_i\<x\_k}w\_i=\frac{(n/2)-1}{n}<1/2$$, since the lower median is at $$k=n/2$$. If $$n$$ is odd, then $$k=\lceil n/2 \rceil$$, so we have $$\lfloor n/2 \rfloor < n/2$$ elements less than the median. Again, the sum is less than ½.

**b.** The reasoning about the stoppage criteria is imprecise. If $$\sum\_{x\_i\<x\_k}w\_i=1/2$$, then we would not have a proper condition for the total (it must be less than ½). Therefore, we must stop just before the accumulated sum becomes equal or exceeds ½, rather than wait until it exceeds ½.

**c.** In the pseudocode, we should have `if`  $$W\_L < 1/2$$ `and` $$W\_G \le 1/2$$ as well as `elseif` $$W\_L \ge 1/2$$. The text preceeding it should be altered accordingly.
{% endhint %}

### 9-4 Small order statistics

**Available in the latest revision of the IM.**

{% hint style="info" %}
The idea in part (a) is essentially simple, as depicted in the following example with array `[5,2,7,6,1,9,3,8,4,0]`, $$n=10$$ and $$i=2$$. The mappings substitute the need to simultaneously apply reorderings in the section containing larger elements from pairwise comparisons (as instructed in the solution from the IM). Such cascading updates are difficult to implement.

```
Level 1: Recursive case
Splitting: 5 2 7 6 1 9 3 8 4 0 -> 9 3 8 6 1 | 5 2 7 4 0
Mappings: (5,9), (2,3), (7,8), (4,6), (0,1)

    Level 2: Recursive case
    Splitting: 5 2 7 4 0 -> 7 4 | 5 2 0
    Mappings: (5,7), (2,4), (0,0)
	
		    Level 3: Base case
	      Select({5,2,0}, i=2) -> {0,2}
	
    Select({0, 2, 4}, i=2) -> {0,2}
	
Select({0,2,3,1}, i=2) -> {0,1}
Return: 1
```

{% endhint %}

### 9-5 Alternative analysis of randomized selection

**Available in the latest revision of the IM.**

### 9-6 Select with groups of 3

#### a.

We follow the template from the book for any odd constant $$k > 3$$. The number $$g$$ of groups is at most $$n/k$$. There are at least $$\lceil k/2 \rceil(\lfloor g/2 \rfloor +1) \ge \lceil k/2 \rceil g/2$$ elements greater than or equal to the pivot, and at least $$\lceil k/2 \rceil \lceil g/2 \rceil \ge \lceil k/2 \rceil g/2$$ elements less than or equal to the pivot. This leaves at most

$$
kg-\lceil k/2 \rceil g/2 \le n-\frac{n\lceil k/2 \rceil}{2k}=n \cdot \frac{2k-\lceil k/2 \rceil}{2k}
$$

elements in the recursive call. The recurrence becomes

$$
T(n) \le T(n/k) + T\left(n \cdot \frac{2k-\lceil k/2 \rceil}{2k}\right)+O(n).
$$

Since $$k$$ is odd, $$\lceil k/2 \rceil=(k+1)/2$$. We have

$$
\begin{align\*}
T(n) &\le T(n/k) + T\left(n \cdot \frac{4k-(k+1)}{4k}\right)+O(n) \\
&= T(n/k) + T\left(n \cdot \frac{3k-1}{4k}\right)+O(n).
\end{align\*}
$$

We prove that $$T(n)\le cn$$ using the substitution method for some constant $$c>0$$ and $$n>1$$.&#x20;

$$
\begin{align\*}
T(n) &\le c(n/k) + cn \cdot \frac{3k-1}{4k}+O(n) \\
&=cn \cdot \frac{4+3k-1}{4k}+O(n) \\
&=cn \cdot \frac{3k+3}{4k}+O(n) \\
&=cn-cn\left(1- \frac{3k+3}{4k}\right)+O(n) && \text{$\left(k>3 \implies \cfrac{3k+3}{4k} < 1\right)$} \\
&\le cn.
\end{align\*}
$$

We should choose $$c$$ to dominate the constant hidden in the $$O(n)$$. Therefore, $$T(n)=\Theta(n)$$.

#### b.

When $$k=3$$ the recurrence becomes $$T(n) \le T(n/3)+T(2n/3)+O(n)$$, which evidently resolves to $$T(n)=O(n\lg n)$$ using the Akra-Bazzi method with $$p=1$$ and a driving function $$f(n)=n$$.

#### c.

Read about the repeated step algorithm [here](https://arxiv.org/abs/1409.3600) (see Section 3 in the paper). It contains a visual representation of what is going on, explanation how it differs from grouping by 9, and contains an analysis of the running time.

{% hint style="info" %}
I highly recommend reading the full paper. It contains many interesting ideas and variants of the `SELECT` algorithm not mentioned in the book.
{% endhint %}

#### d.

See part (c).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://evarga.gitbook.io/sh-intro-to-algs/part-ii-sorting-and-order-statistics/9.-medians-and-order-statistics.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.