Skip to main content
Skip to main content
Edit this page

New York Taxi Data

The New York taxi data consists of 3+ billion taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009. The dataset can be obtained in a couple of ways:

  • insert the data directly into ClickHouse Cloud from S3 or GCS
  • download prepared partitions

Create the table trips

Start by creating a table for the taxi rides:

Load the Data directly from Object Storage

Let's grab a small subset of the data for getting familiar with it. The data is in TSV files in object storage, which is easily streamed into ClickHouse Cloud using the s3 table function.

The same data is stored in both S3 and GCS; choose either tab.

The following command streams three files from a GCS bucket into the trips table (the {0..2} syntax is a wildcard for the values 0, 1, and 2):

Sample Queries

Let's see how many rows were inserted:

Each TSV file has about 1M rows, and the three files have 3,000,317 rows. Let's look at a few rows:

Notice there are columns for the pickup and dropoff dates, geo coordinates, fare details, New York neighborhoods, and more:

Let's run a few queries. This query shows us the top 10 neighborhoods that have the most frequent pickups:

The result is:

This query shows the average fare based on the number of passengers:

Here's a correlation between the number of passengers and the distance of the trip:

The first part of the result is:

Download of Prepared Partitions

Note

The following steps provide information about the original dataset, and a method for loading prepared partitions into a self-managed ClickHouse server environment.

See https://github.com/toddwschneider/nyc-taxi-data and http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html for the description of a dataset and instructions for downloading.

Downloading will result in about 227 GB of uncompressed data in CSV files. The download takes about an hour over a 1 Gbit connection (parallel downloading from s3.amazonaws.com recovers at least half of a 1 Gbit channel). Some of the files might not download fully. Check the file sizes and re-download any that seem doubtful.

Info

If you will run the queries described below, you have to use the full table name, datasets.trips_mergetree.

Results on Single Server

Q1:

0.490 seconds.

Q2:

1.224 seconds.

Q3:

2.104 seconds.

Q4:

3.593 seconds.

The following server was used:

Two Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz, 16 physical cores total, 128 GiB RAM, 8x6 TB HD on hardware RAID-5

Execution time is the best of three runs. But starting from the second run, queries read data from the file system cache. No further caching occurs: the data is read out and processed in each run.

Creating a table on three servers:

On each server:

On the source server:

The following query redistributes data:

This takes 2454 seconds.

On three servers:

Q1: 0.212 seconds. Q2: 0.438 seconds. Q3: 0.733 seconds. Q4: 1.241 seconds.

No surprises here, since the queries are scaled linearly.

We also have the results from a cluster of 140 servers:

Q1: 0.028 sec. Q2: 0.043 sec. Q3: 0.051 sec. Q4: 0.072 sec.

In this case, the query processing time is determined above all by network latency. We ran queries using a client located in a different datacenter than where the cluster was located, which added about 20 ms of latency.

Summary

serversQ1Q2Q3Q4
1, E5-2650v20.4901.2242.1043.593
3, E5-2650v20.2120.4380.7331.241
1, AWS c5n.4xlarge0.2491.2791.7383.527
1, AWS c5n.9xlarge0.1300.5840.7771.811
3, AWS c5n.9xlarge0.0570.2310.2850.641
140, E5-2650v20.0280.0430.0510.072