Exploring data percentiles with Amazon redshift’s PERCENT_RANK() function

AWS Redshift @ Freshers.in

Amazon Redshift, a powerful data warehousing solution, offers a variety of analytical functions to help you gain insights into your data. One such function is PERCENT_RANK(). In this article, we will delve into the details of the PERCENT_RANK() function with practical examples. We will also explore where this function can be applied for valuable insights.

DDL Example: Creating a Sample Table

Let’s start by creating a sample table called freshers_in_webstats to work with.

CREATE TABLE freshers_in_webstats (
date DATE,
pageviews INT,
visitors INT
);

In this table, we have three columns: date, pageviews, and visitors, representing the date of the statistics, the number of pageviews, and the number of visitors, respectively.

DML Example: Populating Data

Next, let’s insert some sample data into our table to demonstrate the PERCENT_RANK() function effectively:

INSERT INTO freshers_in_webstats (date, pageviews, visitors)
VALUES
    ('2023-09-01', 1000, 500),
    ('2023-09-02', 1200, 600),
    ('2023-09-03', 1500, 750),
    ('2023-09-04', 800, 400),
    ('2023-09-05', 1300, 650);

Now, we have a dataset with five days of web statistics.

Understanding PERCENT_RANK()

The PERCENT_RANK() function is used to calculate the relative rank of a row within a result set as a percentage. It can provide insights into how a particular row’s value compares to others in the dataset.

The formula for PERCENT_RANK() is as follows:

PERCENT_RANK() OVER (ORDER BY column_name)

Practical Example: Using PERCENT_RANK()

Suppose we want to calculate the PERCENT_RANK() for the number of visitors each day in our freshers_in_webstats table. We can use the following SQL query:

SELECT
    date,
    visitors,
    PERCENT_RANK() OVER (ORDER BY visitors) AS visitors_percent_rank
FROM
    freshers_in_webstats;

In this query:

We select the date and visitors columns.
We apply the PERCENT_RANK() function over the visitors column, ordering the rows in ascending order based on the number of visitors.
We alias the result of PERCENT_RANK() as visitors_percent_rank.

Result Set:

    date       | visitors | visitors_percent_rank
--------------+----------+-----------------------
  2023-09-04  |   400    |        0.00
  2023-09-01  |   500    |        0.25
  2023-09-02  |   600    |        0.5
  2023-09-05  |   650    |        0.75
  2023-09-03  |   750    |        1.00

Read more on Redshift
Read more on Hive
Read more on Snowflake

Author: user

Leave a Reply