Recipes Dataset
The RecipeNLG dataset is available for download here. It contains 2.2 million recipes. The size is slightly less than 1 GB.
Download and Unpack the Dataset
- Go to the download page https://recipenlg.cs.put.poznan.pl/dataset.
- Accept Terms and Conditions and download zip file.
- Option: Using the
md5sum dataset.zip
to validate the zip file and it should be equal to3a168dfd0912bb034225619b3586ce76
. - Unpack the zip file with
unzip dataset.zip
. You will get thefull_dataset.csv
file in thedataset
directory.
Create a Table
Run clickhouse-client and execute the following CREATE query:
Insert the Data
Run the following command:
This is a showcase how to parse custom CSV, as it requires multiple tunes.
Explanation:
- The dataset is in CSV format, but it requires some preprocessing on insertion; we use table function input to perform preprocessing;
- The structure of CSV file is specified in the argument of the table function
input
; - The field
num
(row number) is unneeded - we parse it from file and ignore; - We use
FORMAT CSVWithNames
but the header in CSV will be ignored (by command line parameter--input_format_with_names_use_header 0
), because the header does not contain the name for the first field; - File is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the
--format_csv_allow_single_quote 0
parameter; - Some strings from CSV cannot parse, because they contain
\M/
sequence at the beginning of the value; the only value starting with backslash in CSV can be\N
that is parsed as SQL NULL. We add--input_format_allow_errors_num 10
parameter and up to ten malformed records can be skipped; - There are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.
Validate the Inserted Data
By checking the row count:
Query:
Result:
Example Queries
Top Components by the Number of Recipes:
In this example we learn how to use arrayJoin function to expand an array into a set of rows.
Query:
Result:
The Most Complex Recipes with Strawberry
Result:
In this example, we involve has function to filter by array elements and sort by the number of directions.
There is a wedding cake that requires the whole 126 steps to produce! Show that directions:
Query:
Result:
Online Playground
The dataset is also available in the Online Playground.