Datasets¶
esnfed.datasets provides synthetic reservoir-computing benchmarks, a bundled real-world series, and loaders to bring your own data.
Synthetic benchmarks¶
from esnfed import datasets
u, y = datasets.narma10(3000, rng=0) # NARMA-10 (memory + nonlinearity)
u, y = datasets.mackey_glass(3000, seed=0) # Mackey-Glass (mildly chaotic)
u, y = datasets.lorenz(3000, seed=0) # Lorenz x-coordinate (normalised)
Real data: counterparty risk¶
A real series ships with the package — the TED spread, the gap between the 3-month interbank rate and the 3-month Treasury bill, a classic gauge of interbank/counterparty credit risk (daily, 1986–2022, source: FRED TEDRATE).
u, y = datasets.load_ted_spread() # normalised one-step-ahead task
raw = datasets.load_ted_spread(raw=True) # the raw spread, in percentage points
Bring your own data¶
# any 1-D series -> one-step-ahead (or next-change) forecasting task
u, y = datasets.from_array(my_series, predict="next", normalize=True)
# a column of a CSV file
u, y = datasets.load_csv("prices.csv", column="close")
# any FRED series, downloaded on demand and cached
u, y = datasets.load_fred("BAMLH0A0HYM2") # US high-yield credit spread
High-dimensional multivariate (FRED panel)¶
load_fred_matrix aligns several FRED series on their common dates into a multivariate task — forecast the next value of a target series from the whole panel. This exercises high-dimensional input scaling (a larger reservoir helps as d grows).
# forecast counterparty risk (TED spread) from a 4-series financial panel
u, y = datasets.load_fred_matrix(["TEDRATE", "VIXCLS", "DGS10", "DFF"])
print(u.shape) # (T-1, 4) -> d_in = 4
Benchmark classification datasets¶
Two standard sequence-classification benchmarks, each with a natural client split, are downloaded on demand and cached. They return a SequenceDataset (X_* lists of (T_i, n_features) sequences, integer labels, and groups_* giving the federation unit). See Sequence classification.
jv = datasets.load_japanese_vowels() # UCI 128: 9 speakers, 12-d cepstra
har = datasets.load_har() # UCI 240: 30 subjects, 6 activities, 9 channels
# one client per natural group (speaker / subject)
clients = datasets.group_clients(jv.X_train, jv.y_train, jv.groups_train)
| Loader | Task | Natural clients | Heterogeneity |
|---|---|---|---|
load_japanese_vowels | speaker ID (9 classes) | 9 speakers (= labels) | extreme label skew |
load_har | activity (6 classes) | 30 subjects | feature non-i.i.d. |
Splitting and partitioning¶
u_tr, y_tr, u_te, y_te = datasets.split(u, y, train_frac=0.7) # chronological
parts = datasets.partition_iid(u_tr, y_tr, n_clients=10) # contiguous blocks
Contiguous blocks keep each client's slice a valid time series for state harvesting. See the API reference.