🗃️Data
The Data: Primary collection, Analysis, and Processing
On-chain data is the information stored directly on blockchain. It encompasses various types of information, such as addresses and transaction data, consensus and blocks data, token distribution, long and short time holders, exchanges volumes, inflows/outflows, miners and validators activity, et al. On-chain data is transparent, immutable, and decentralized, meaning that once information is recorded on the blockchain, it is practically impossible to alter or tamper with. Thus this provides our analysis with a high level of trust and precision.
CREAN Labs team uses various reputable on-chain data providers well-known on the crypto market to collect data from and train the VANGA Crypto model.
These platforms are as follows: Etherscan, Glassnode Studio, CryptoQuant, Nansen, Dune Analytics, Arkham Intelligence, IntoTheBlock, TokenTerminal, and a number of others. Glassnode Studio was used as a basic source for data collection as the platform provides the most comprehensive on-chain data and indicators.
Initially, the CREAN Labs team considered and examined more than 480 on-chain metrics and indicators from these resources.
In the course of further research, variable selection was performed and 20 on-chain metrics showing the strongest correlation with the predicted value (ETH price) were selected. Spearman's rank correlation coefficient was used as a variables’ significance measure.
Spearman's correlation coefficient p is a special case of the Pearson correlation coefficient, if the values of the factors X, Y are used as values of their ranks x, y, that is, indices in ordered sequences of their values. Spearman correlation is less sensitive to significant outliers because it deals not with the values themselves, but with their ranks.
Unlike the Pearson correlation, the Spearman correlation does not assume normal distribution for both data sets. Like other correlation coefficients, this one ranges from -1 to +1, where 0 means no correlation. Correlations of -1 or +1 imply an exact monotonic relationship.
Next, for a primary comparison of variable importance, a cross correlation function (CCF) was employed.
Among the selected metrics are the following:
addresses_active_count
fees_gas_price_mean
fees_volume_mean
transactions_count
transactions_transfers_volume_mean
external/internal_contract_calls
SOPR
(Spent Output Profit Ratio)
MVRV
(Market Value to Realized Value Ratio)
and some other significant metrics.
In addition to the on-chain metrics, CREAN Labs team has elaborated a number of own in-house custom indicators and metrics with high predictive capabilities.
These indicators have been created analytically and based on a combination of raw on-chain metrics, macroeconomic development indices (e.g. S&P 500, NASDAQ, DOW30, VIX, Crude Oil, etc.), global currency ratios (GBP/USD, JPY/USD, CHF/USD, CNY/USD, EUR/USD, etc.), and some crypto market ratios (such as BTC price, Ethereum Layer2 projects’ token prices and Ethereum liquid staking providers’ ratios).
Such complex aggregative indicators help CREAN Labs’ researchers capture the bigger picture and find unobvious interrelations and data patterns. Furthermore, the confluence of on-chain data with real world assets and economic indices provides a deeper understanding of ongoing processes that cannot be tracked using solely on-chain metrics.
The evolution process of our approaches to data elaboration includes the following major steps:
Initial data collecting, processing, and normalization were carried out.
A number of experiments were conducted, each using different hyperparameters to determine which ones are most suitable for our case.
Generating hypotheses on metrics and testing them. Expanding metrics sets, relying not only on correlation as the main significance criterion, applying computational as well as analytical approaches. Finding and testing different methods for determining the values’ significance and using the combinations of such methods.
Interpreting and understanding the use of certain metrics, based on logic, not on indicators’ formal significance. Examining and supplementing certain metrics.
Assaying different metrics generation methods, significantly expanding the space for finding variables, selecting useful and significant metrics. Adding new data providers.
Doing deeper preprocessing of metrics and target variables. Considering rates of variables change as prognostic factors.
Using shorter data time frames up to 1 second. Increasing the amount of data, thus increasing the learning ability of the models.
Using different forecasting time frames.
Using more data: supplementing on-chain data with real world assets information, macroeconomic indicators, market sentiment, etc.
Designing custom complex metrics with high-level trust.
Last updated