Download

Functionality to easily download data to your environment.

0. BaseIO

There are common methods needed for Downloaders and Submittors. BaseIO implements this functionality and allows us to make abstract base classes. Namely, BaseDownloader and BaseSubmitter (implemented in submission section).


source

BaseIO

 BaseIO (directory_path:str)

Basic functionality for IO (downloading and uploading).

:param directory_path: Base folder for IO. Will be created if it does not exist.

1. BaseDownloader

BaseDownloader is an object which implements logic common to all downloaders.

To implement a new Downloader, you should inherit from BaseDownloader and be sure to implement at least methods for .download_training_data and .download_inference_data.


source

BaseDownloader

 BaseDownloader (directory_path:str)

Abstract base class for downloaders.

:param directory_path: Base folder to download files to.

2. Numerai Classic


source

NumeraiClassicDownloader

 NumeraiClassicDownloader (directory_path:str, *args, **kwargs)

WARNING: Versions 1 and 2 (legacy data) are deprecated. Only supporting version 3+.

Downloading from NumerAPI for Numerai Classic data.

:param directory_path: Base folder to download files to.

All *args, **kwargs will be passed to NumerAPI initialization.

test_dir_classic = "test_numclassic_general"
numer_classic_downloader = NumeraiClassicDownloader(test_dir_classic)

# Test building class
assert isinstance(numer_classic_downloader.dir, PosixPath)
assert numer_classic_downloader.dir.is_dir()

# Test is_empty
(numer_classic_downloader.dir / "test.txt").write_text("test")
rich_print(f"Directory contents:\n{numer_classic_downloader.get_all_files}")
assert not numer_classic_downloader.is_empty

# Downloading example data
numer_classic_downloader.download_example_data("test/", version="4.1")

# Features
feature_stats_test = numer_classic_downloader.get_classic_features()
assert isinstance(feature_stats_test, dict)
assert len(feature_stats_test["feature_sets"]["medium"]) == 641

# Remove contents
numer_classic_downloader.remove_base_directory()
assert not os.path.exists(test_dir_classic)
No existing directory found at 'test_numclassic_general'. Creating directory...
Directory contents:
[Path('test_numclassic_general/test.txt')]
πŸ“ Downloading 'v4.1/live_example_preds.parquet' πŸ“
2023-03-14 17:28:11,220 INFO numerapi.utils: starting download
test_numclassic_general/test/live_example_preds.parquet: 135kB [00:00, 542kB/s]                            
πŸ“ Downloading 'v4.1/validation_example_preds.parquet' πŸ“
2023-03-14 17:28:12,139 INFO numerapi.utils: starting download
test_numclassic_general/test/validation_example_preds.parquet: 57.6MB [00:02, 26.4MB/s]                            
πŸ“ Downloading 'v4.1/features.json' πŸ“
2023-03-14 17:28:15,216 INFO numerapi.utils: starting download
test_numclassic_general/features.json: 703kB [00:00, 1.69MB/s]                           
⚠ Deleting directory for 'NumeraiClassicDownloader' ⚠
Path: '/home/clepelaars/numerblox/nbs/test_numclassic_general'

2.1. Example usage

This section will explain how to quickly get started with NumeraiClassicDownloader.

The more advanced use case of working with GCS (Google Cloud Storage) is discussed in edu_nbs/google_cloud_storage.ipynb.

2.1.1. Training data

Training + validation data for Numerai Classic can be downloaded with effectively 2 lines of code. Feature stats and overview can be downloaded with .get_classic_features().

# Initialization
train_base_directory = "test_numclassic_train"
numer_classic_downloader = NumeraiClassicDownloader(train_base_directory)

# Uncomment line below to download training and validation data
# numer_classic_downloader.download_training_data("train_val", int8=False)

# Get feature overview (dict)
numer_classic_downloader.get_classic_features()

# Remove contents (To clean up environment)
numer_classic_downloader.remove_base_directory()
No existing directory found at 'test_numclassic_train'. Creating directory...
πŸ“ Downloading 'v4.1/features.json' πŸ“
2023-03-14 17:28:17,711 INFO numerapi.utils: starting download
test_numclassic_train/features.json: 703kB [00:00, 1.65MB/s]                           
⚠ Deleting directory for 'NumeraiClassicDownloader' ⚠
Path: '/home/clepelaars/numerblox/nbs/test_numclassic_train'

For the training example the directory structure will be:

πŸ“ test_numclassic_train (base_directory)
┣━━ πŸ“„ features.json
┗━━ πŸ“ train_val
    ┣━━ πŸ“„ numerai_training_data.parquet
    ┗━━ πŸ“„ numerai_validation_data.parquet

2.1.2. Inference data

Inference data for the most recent round of Numerai Classic can be downloaded with effectively 2 lines of code. It can also easily be deleted after you are done with inference by calling .remove_base_directory.

# Initialization
inference_base_dir = "test_numclassic_inference"
numer_classic_downloader = NumeraiClassicDownloader(directory_path=inference_base_dir)

# Download tournament (inference) data
numer_classic_downloader.download_inference_data("inference", version="4.1", int8=True)

# Download meta model predictions
numer_classic_downloader.download_meta_model_preds("inference")

# Remove folder when done with inference
numer_classic_downloader.remove_base_directory()
No existing directory found at 'test_numclassic_inference'. Creating directory...
πŸ“ Downloading 'v4.1/live_int8.parquet' πŸ“
2023-03-14 17:28:19,878 INFO numerapi.utils: starting download
test_numclassic_inference/inference/live_int8.parquet: 4.49MB [00:00, 6.51MB/s]                            
πŸ“ Downloading 'v4.1/meta_model.parquet' πŸ“
2023-03-14 17:28:21,155 INFO numerapi.utils: starting download
test_numclassic_inference/inference/meta_model.parquet: 20.0MB [00:01, 17.0MB/s]                            
⚠ Deleting directory for 'NumeraiClassicDownloader' ⚠
Path: '/home/clepelaars/numerblox/nbs/test_numclassic_inference'

For the inference example the directory structure will be:

πŸ“ test_numclassic_inference (base_directory)
┗━━ πŸ“ inference
    ┣━━ πŸ“„ meta_model.parquet
    ┗━━ πŸ“„ numerai_tournament_data.parquet

3. KaggleDownloader (Numerai Signals)

The Numerai community maintains some excellent datasets on Kaggle for Numerai Signals.

For example, Katsu1110 maintains a dataset with yfinance price data on Kaggle that is updated daily. KaggleDownloader allows you to easily pull data through the Kaggle API. We will be using this dataset in an example below.

In this case, download_inference_data and download_training_data have the same functionality as we can’t make the distinction beforehand for an arbitrary dataset on Kaggle.


source

KaggleDownloader

 KaggleDownloader (directory_path:str)

Download awesome financial data from Kaggle.

For authentication, make sure you have a directory called .kaggle in your home directory with therein a kaggle.json file. kaggle.json should have the following structure:

{"username": USERNAME, "key": KAGGLE_API_KEY}

More info on authentication: github.com/Kaggle/kaggle-api#api-credentials

More info on the Kaggle Python API: kaggle.com/donkeys/kaggle-python-api

:param directory_path: Base folder to download files to.

The link to Katsu1110’s yfinance price dataset is https://www.kaggle.com/code1110/yfinance-stock-price-data-for-numerai-signals. In .download_training_data we define the slug after kaggle.com (code1110/yfinance-stock-price-data-for-numerai-signals) as an argument. The full Kaggle dataset is downloaded and unzipped.

home_directory = "test_kaggle_downloader"
kd = KaggleDownloader(home_directory)
kd.download_training_data("code1110/yfinance-stock-price-data-for-numerai-signals")
No existing directory found at 'test_kaggle_downloader'. Creating directory...

This Kaggle dataset contains one file called "full_data.parquet".

list(kd.dir.iterdir())
[Path('test_kaggle_downloader/full_data.parquet')]
df = pd.read_parquet(f"{home_directory}/full_data.parquet")
df.head(2)
ticker date close raw_close high low open volume
0 000060 KS 20020103 534.924377 1248.795166 1248.795166 1248.795166 1248.795166 0.0
1 000060 KS 20020104 566.944519 1323.546997 1363.121460 1213.617798 1275.178223 3937763.0

Folder can be cleaned up when done with inference.

kd.remove_base_directory()
⚠ Deleting directory for 'KaggleDownloader' ⚠
Path: '/home/clepelaars/numerblox/nbs/test_kaggle_downloader'

4. EODDownloader

EOD Historical data is an affordable Financial data APIs that offers a large range of global stock tickers. Very convenient for Numerai Signals modeling. We will use a Python API build on top of EOD Historical data to download stock ticker data for training and inference.


source

EODDownloader

 EODDownloader (directory_path:str, key:str, tickers:list,
                frequency:str='d')

Download data from EOD historical data.

More info: https://eodhistoricaldata.com/

:param directory_path: Base folder to download files to.

:param key: Valid EOD client key.

:param tickers: List of valid EOD tickers (Bloomberg ticker format).

:param frequency: Choose from [d, w, m].

Daily data by default.

key = BaseDownloader._load_json("test_assets/keys.json")['eod_key'] # YOUR_EOD_KEY_HERE
eodd = EODDownloader(directory_path="eod_test", key=key, tickers=['AAPL.US', 'MSFT.US', 'COIN.US', 'NOT_A_TICKER'])
No existing directory found at 'eod_test'. Creating directory...

If no starting date is passed in download_training_data this downloader will take the earliest date available. That is why the starting date in the filename is the 1st Unix timestamp (January 1st 1970).

eodd.download_inference_data()
eodd.download_training_data()
⚠ WARNING: Date pull failed on ticker: 'NOT_A_TICKER'. ⚠ Exception: 404 Client Error: Not Found for url: 
https://eodhistoricaldata.com/api/eod/NOT_A_TICKER?period=d&to=2023-03-14&fmt=json&api_token=621661e8653533.2141337
4&from=2022-03-14
⚠ WARNING: Date pull failed on ticker: 'NOT_A_TICKER'. ⚠ Exception: 404 Client Error: Not Found for url: 
https://eodhistoricaldata.com/api/eod/NOT_A_TICKER?period=d&to=2023-03-14&fmt=json&api_token=621661e8653533.2141337
4&from=1970-01-01
today = dt.now().strftime("%Y%m%d")
df = pd.read_parquet(f"eod_test/eod_19700101_{today}.parquet")
df.head(2)
open high low close adjusted_close volume ticker
date
2021-04-09 381.0 381.0 381.0 381.0 250.0 0 COIN.US
2021-04-12 381.0 381.0 381.0 381.0 250.0 0 COIN.US

Live data with a custom starting date can be retrieved as a NumerFrame directly with get_live_data. The starting date can be either in datetime, pd.Timestamp or string format.

live_dataf = eodd.get_live_data(start=pd.Timestamp(year=2021, month=1, day=1))
live_dataf.head(2)
⚠ WARNING: Date pull failed on ticker: 'NOT_A_TICKER'. ⚠ Exception: 404 Client Error: Not Found for url: 
https://eodhistoricaldata.com/api/eod/NOT_A_TICKER?period=d&to=2023-03-14&fmt=json&api_token=621661e8653533.2141337
4&from=2021-01-01+00%3A00%3A00
open high low close adjusted_close volume ticker
date
2021-04-09 381.0 381.0 381.0 381.0 250.0 0 COIN.US
2021-04-12 381.0 381.0 381.0 381.0 250.0 0 COIN.US
live_dataf[live_dataf['ticker'] == "AAPL.US"]['close'].plot(figsize=(15, 6), title="AAPL from January 2021");

eodd.remove_base_directory()
⚠ Deleting directory for 'EODDownloader' ⚠
Path: '/home/clepelaars/numerblox/nbs/eod_test'

5. Custom Downloader

We invite the Numerai Community to implement new downloaders for this project using interesting APIs.

These are especially important for creating innovative Numerai Signals models.

A new Downloader can be created by inheriting from BaseDownloader. You should implement methods for .download_inference_data and .download_training_data so every downloader has a common interface. Below you will find a template for a new downloader.


source

AwesomeCustomDownloader

 AwesomeCustomDownloader (directory_path:str)

TEMPLATE - Download awesome financial data from who knows where.

:param directory_path: Base folder to download files to.