Ethereum Transactions Exploratory Data Analysis
My first attempt at analyzing Ethereum transactions. Only scratching the surface!
- Summary
- Next steps
- The details (for the technically minded)
- Analysis: Year-Over-Year (YOY) transactions 1 to 7 July 2021 vs 1 to 7 July 2022
- Analysis: Where did the drop in transaction come from?
Summary
-
Looking at the number of Ethereum transactions from 1 to 7 July 2022, there was a 14% drop on 6 July 2022 compared to 6 July 2021,Year-Over-Year. (See chart 2).
-
Focussing on 6 July 2022, of the top 50 addresses with the most transactions, we can see 3 categories of addresses: miners, crypto exchanges and no label (See chart 4).
"No label" are user(s) or organization(s) where based on their addresses, the name/label unidentified in etherscan. -
Although there was a 14% drop in number of transactions on 6 July 2022, there was an increase of almost 9% in the number of ETH per transaction. (See table)
ETH is the crypto currency used for payment on the Ethereum network (more explanation here).-
This could be due to the lower price of ETH on 6 July 2022.
The closing price of ETH on 6 July 2022 was about USD$1,186 (AUD$1,744) while on 6 July 2021, about USD$2,323 (AUD$3,097).
(Historical prices from Messari.) -
Also, this seems to be the case when we examine a longer period, see "ETH Per Transaction 2022 vs 2021 Last 3 Months", the 3rd chart in my Dune dashboard.
- On 13 June 2022, ETH per transaction was almost 7 compared to 13 June 2021, which was at 1.
The closing price of ETH on 13 June 2022, was at USD$1,208 (AUD$1,726) while on 13 June 2021, was at USD$2,509 (AUD$3,258).
- On 13 June 2022, ETH per transaction was almost 7 compared to 13 June 2021, which was at 1.
-
Next steps
-
There are 2 types of Ethereum accounts:
Externally owned (controlled by anyone with private keys) vs Contract (smart contract). -
What would transactions/value of ETH look like for them as we approach The Merge?
The details (for the technically minded)
Load the libraries
Go to top
import os
import json
import pandas as pd
import numpy as np
import altair as alt
# Set dataframe display behaviours
pd.set_option('display.max_columns', 100, 'display.max_rows', 100, 'display.max_colwidth', 100, 'display.float_format', '{:.4f}'.format)
Read the data files
- Details of the data extraction using the Alchemy API into JSON format are not shown here. (If interested in the steps, please let me know).
- In below file directory, the JSON format files were already formatted into Pandas dataframe and saved as separate parquet files based on block numbers.
Dataset 1: Ethereum transactions
- Due to memory and storage limitation of the Jupyter free instance on the cloud, only raw transactions from 1 July 2022 to 7 July 2022 and 6 July 2021 were extracted from the API of a remote node provider (Alchemy).
PATH = '../data/'
file_dir = os.listdir(PATH)
file_list = [os.path.join(PATH, file) for file in file_dir if file.startswith('df_')]
file_list
In order to work within the memory (RAM) limit, the files are read one by one and appended to a single dataframe.
df = pd.concat([pd.read_parquet(file) for file in file_list])
Format data types and add new column
Convert type to categorical data type
df['type'] = df['type'].astype('category')
Add eth_value column
Convert the value field which by default is in Wei (the smallest unit of Ether) to Ether.
1 ETH = 10^18 Wei
1 ETH = 10^9 Gwei
Reference: https://gwei.io/
df['eth_value'] = df['value'].div(10**18)
Dataset 2: Aggregated transaction counts
- This dataset is a total count of transactions per day rather than raw transactions (a dataset where each row is a transaction like above).
- The data is also extracted using Alchemy's API.
- Due to memory (RAM) limitation on this Jupyter instance, this dataset allows the analysis of transaction counts for selected dates in July 2021 and compare it to July 2022.
block_txn = json.load(open('../data/raw/block_txn_counts.json'))
df_txn_raw = pd.DataFrame(block_txn)
df_txn = df_txn_raw.copy()
Convert block_timestamp to datetime and create block_date field
df_txn['block_timestamp'] = df_txn['block_timestamp'].astype('datetime64[s]')
df_txn['block_date'] = df_txn['block_timestamp'].astype('datetime64[D]')
Dataset 3: Address names (labels)
- The dataset contains corresponding name (label) of the address of a sender.
- The name (label) of addresses were manually sourced from https://etherscan.io/ and saved as a Pandas dataframe.
- The names are not an exhaustive list and selected based on the need of the analysis below.
df_labels = pd.read_parquet(f'{PATH}address_labels.parquet')
Inspect dataset 1: Ethereum transactions
- Due to memory and storage limitation of the Jupyter free instance on the cloud, only raw transactions from 1 July 2022 to 7 July 2022 and 6 July 2021 were extracted from the API of a remote node provider (Alchemy).
Total number of rows and columns in the dataset
- Rows: 9,699,876
- Columns: 18
df.shape
df.info()
Data fields
-
hash- Hash of the transaction -
blockHash- Hash of the block -
blockNumber- Block number -
from- Address of the sender -
gas- Gas provided by the sender -
gasPrice- Gas price provided by the sender in Wei -
input- The data sent along with the transaction. Commonly used as part of contract interaction or as a message sent to the recipient. -
nonce- The number of transactions made by the sender prior to this one -
r(excluded to reduce memory usage) - The Elliptic Curve Digital Signature Algorithm (ECDSA) signature r. The standardised R field of the signature. See: https://openethereum.github.io/JSONRPC -
s(excluded to reduce memory usage) - The Elliptic Curve Digital Signature Algorithm (ECDSA) signature s. The standardised R field of the signature. See: https://openethereum.github.io/JSONRPC -
to- Address of the receiver -
transactionIndex- Integer of the transactions index position in the block -
type- Overtime, Ethereum Improvement Proposals (EIPs) have changed what a valid transaction looks like, whilst maintaining a high level of backwards compatibility with other transaction types. Today, the main types of transactions are legacy transactions and typed transactions (i.e. Type 1, Type 2).
See https://mycelium.xyz/research/the-journey-of-an-ethereum-transaction/.
Also see https://docs.dune.com/data-tables/data-tables/raw-data/ethereum-data#ethereum.transactions.- 0: 'legacy'
- 1: 'accessList/Type 1'
- 2: 'DynamicFee/Type 2'
-
v(excluded to reduce memory usage) - The Elliptic Curve Digital Signature Algorithm (ECDSA) recovery id. The standardised V field of the signature. See: https://openethereum.github.io/JSONRPC -
value- The amount of ether transferred in Wei. 1 ETH = 10^18 Wei. 1 ETH = 10^9 Gwei. -
accessList(excluded to reduce memory usage) - Contains addresses and storage keys that will be accessed. These are fields from legacy transactions, EIP-2930. See: https://openethereum.github.io/JSONRPC -
chainId- Value used in replay-protected transaction signing as introduced by EIP-155 -
maxFeePerGas- The maximum fee per gas the transaction sender is willing to pay total (introduced by EIP1559). For detailed explanation, refer to https://docs.alchemy.com/alchemy/guides/eip-1559/maxpriorityfeepergas-vs-maxfeepergas -
maxPriorityFeePerGas- The maximum fee per gas the transaction sender is willing to pay total (introduced by EIP1559) Refer to https://docs.alchemy.com/alchemy/guides/eip-1559/maxpriorityfeepergas-vs-maxfeepergas -
block_timestamp- Timestamp of the block
References:
https://docs.dune.com/data-tables/data-tables/raw-data/ethereum-data#ethereum.transactions
The Google "bigquery-public-data.crypto_ethereum.transactions" column description
https://ethereum.org/en/developers/docs/apis/json-rpc/
https://docs.alchemy.com/alchemy/apis/ethereum/eth-gettransactionbyhash
The number of transactions in dataset 1 by date
- Raw transactions for 30 June and 8 July 2022 are incomplete due to storage limitation and the focus of this analysis is the first 7 days of July 2022 and 2021.
- The total number of transactions for below dates (except 30 June and 8 July 2022) matches those data from Dune (see "Ethereum Transactions 2022 vs 2021 Last 3 Months" chart in dashboard).
df.groupby(['block_date']).size()
Sample rows from dataset 1
df.sample(3)
Aggregate dataset 2 (the transaction counts data)
df_txn_daily = df_txn.query('block_date.between("2021-07-01","2021-07-07", inclusive="both") or block_date.between("2022-07-01","2022-07-07", inclusive="both")')
df_txn_daily = df_txn_daily.groupby(['block_date']).agg(transactions=('block_transactions','sum'))
df_txn_daily.reset_index(inplace=True)
df_txn_daily['block_year'] = df_txn_daily['block_date'].dt.year
Examine the first 5 rows
df_txn_daily.head()
alt.Chart(df_txn_daily).mark_line().encode(
x=alt.X('monthdate(block_date):O', title='Month-Day'),
y=alt.Y('transactions:Q', title='Transactions'),
color=alt.Color('block_year:N', scale=alt.Scale(range=['saddlebrown','steelblue']), legend=alt.Legend(title='year')),
tooltip=[alt.Tooltip('block_year', title='Year'), alt.Tooltip('transactions', format=',', title='Transactions'), alt.Tooltip('monthdate(block_date)', title='Month-Day')],
).properties(
title='Chart 1 - A decline in Ethereum transactions on 6 and 7 July 2022 compared to 2021',
height=500,
width=700
)
Calculate the Year-Over-Year (YOY)
df_txn_daily_pct = df_txn_daily.set_index('block_date').drop(columns='block_year').pct_change(periods=7).dropna()
df_txn_daily_pct.rename(columns={'transactions':'pct_change'}, inplace=True)
Examine the first 5 rows
df_txn_daily_pct.head()
alt.Chart(df_txn_daily_pct.reset_index()).mark_bar().encode(
x=alt.X('monthdate(block_date):O', title='Month-Day'),
y=alt.Y('pct_change:Q', axis=alt.Axis(format='%', title='Year-Over-Year 2022 vs 2021 % Difference')),
color=alt.condition(
alt.datum.pct_change > 0,
alt.value("steelblue"),
alt.value("orange")),
tooltip=[alt.Tooltip('monthdate(block_date)', title='Month-Day'), alt.Tooltip('pct_change', title='2022 vs 2021 % Difference', format='.2%')]
).properties(
height=500,
width=700,
title='Chart 2 - 14% decline in Ethereum transactions on 6 July 2022'
)
Identify Top 50 sender addresses with the most transactions on 6 July 2022 and 2021
- The name (label) of addresses were manually sourced from etherscan and saved as dataset 3 (see above).
- Note that an organization, for e.g. a crypto exchange can have multiple addresses.
- For e.g. compare this address 0x3cd751e6b0078be393132286c442345e5dc49699 and 0xb5d85cbf7cb3ee0d56b3bb207d5fc4b82f43f511 in etherscan.
Filter dataset 1 (Ethereum transactions) to 6 July 2022 and 6 July 2021 and add address names (labels)
df_agg = df.query('block_date.isin(["2021-07-06","2022-07-06"])').groupby(['from','block_date']).agg(total_transactions=('hash','count'),total_value_eth=('eth_value','sum'))
df_agg.reset_index(inplace=True)
df_agg['year'] = df_agg['block_date'].dt.year
df_agg = df_agg.merge(df_labels, left_on='from', right_on='address', how='left')
df_agg.drop(columns='address', inplace=True)
#Create additional column to group the addresses into "miners", "exchanges" and "no label"
cat_map = {'2Miners: PPLNS':'miner','BeePool':'miner','Binance':'exchange','BlockFi':'exchange','Coinbase':'exchange',
'Crypto.com':'exchange','Ethermine':'miner','F2Pool Old':'miner','FTX Exchange':'exchange','Flexpool.io':'miner',
'Gate.io':'exchange','Gemini':'exchange','Hiveon':'miner','Hiveon Pool':'miner','KuCoin':'exchange',
'Nanopool':'miner','No Label':'no label','Spark Pool: Distributor':'miner'}
#Categorize the addresses into higher level category
df_agg['address_category'] = df_agg['address_label'].map(cat_map).astype('category')
Example Top 10 from addresses sorted by total transactions for 6 July 2021
- Below, we can see that there are multiple
fromaddresses belonging to the same organization/user. - So, we will group them into a single address label by doing another aggregation.
df_agg.query('block_date == "2021-07-06"').sort_values(by='total_transactions', ascending=False).iloc[:10,:]
Aggregate the address labels by category (miners, exchanges, no label) for 6 July 2021 and 6 July 2022
df_agg_category = df_agg.sort_values(by='total_transactions', ascending=False).iloc[:50,:].groupby(['address_category','year']).agg({'total_transactions':'sum','total_value_eth':'sum'})
Create Year-Over-Year calculations and pivot the aggregated category dataset
df_agg_category_pivot = df_agg_category.pivot_table(index='address_category', columns='year')
df_agg_category_pivot.columns = ['total_transactions_2021','total_transactions_2022','total_value_eth_2021','total_value_eth_2022']
#Create additional calculations
df_agg_category_pivot['total_transactions_pct_diff'] = df_agg_category_pivot['total_transactions_2022']/df_agg_category_pivot['total_transactions_2021']-1
df_agg_category_pivot['total_value_eth_pct_diff'] = df_agg_category_pivot['total_value_eth_2022']/df_agg_category_pivot['total_value_eth_2021']-1
df_agg_category_pivot['value_per_transaction_2021'] = df_agg_category_pivot['total_value_eth_2021']/df_agg_category_pivot['total_transactions_2021']
df_agg_category_pivot['value_per_transaction_2022'] = df_agg_category_pivot['total_value_eth_2022']/df_agg_category_pivot['total_transactions_2022']
df_agg_category_pivot['value_per_txn_pct_diff'] = df_agg_category_pivot['value_per_transaction_2022']/df_agg_category_pivot['value_per_transaction_2021']-1
#Replace infinite values with NaN
df_agg_category_pivot.replace(np.inf, np.nan, inplace=True)
alt.Chart(df_agg_category_pivot.reset_index()).mark_bar().encode(
x=alt.X('total_transactions_pct_diff:Q', title='6 July 2022 vs 6 July 2021 Transactions % Difference', axis=alt.Axis(format='%')),
y=alt.Y('address_category', sort='x', title='Address Category'),
size=alt.Size('total_transactions_2021:Q', title='6 July 2021 Transaction Volume', scale=alt.Scale(range=[0,30])),
color=alt.condition(
alt.datum.total_transactions_pct_diff > 0,
alt.value("steelblue"),
alt.value("orange")),
tooltip=[alt.Tooltip('address_category', title='Address Category'),
alt.Tooltip('total_transactions_pct_diff', title='6 July 2022 vs 2021 % Diff', format='.2%'),
alt.Tooltip('total_transactions_2022', format=',', title='6 July 2022 Transactions'),
alt.Tooltip('total_transactions_2021', format=',', title='6 July 2021 Transactions')]
).properties(
height=500,
width=700,
title='Chart 3 - Users with unidentified address labels showed a 73% decrease in transactions for 6 July 2022 vs 2021'
)
Aggregate The top 50 addresses by their address names (labels) for 6 July 2021 and 6 July 2022
- As mentioned above, there can be multiple addresses from the same organization/user. So we will group them into a single address name by doing another aggregation.
df_agg_label = df_agg.sort_values(by='total_transactions', ascending=False).iloc[:50,:].groupby(['address_label','year']).agg({'total_transactions':'sum','total_value_eth':'sum'})
Create Year-Over-Year calculations and pivot the dataset
df_agg_label_pivot = df_agg_label.pivot_table(index='address_label', columns='year')
df_agg_label_pivot.columns = ['total_transactions_2021','total_transactions_2022','total_value_eth_2021','total_value_eth_2022']
df_agg_label_pivot.reset_index(inplace=True)
df_agg_label_pivot.fillna(0, inplace=True)
#Create additional calculations
df_agg_label_pivot['total_transactions_pct_diff'] = df_agg_label_pivot['total_transactions_2022']/df_agg_label_pivot['total_transactions_2021']-1
df_agg_label_pivot['total_value_eth_pct_diff'] = df_agg_label_pivot['total_value_eth_2022']/df_agg_label_pivot['total_value_eth_2021']-1
df_agg_label_pivot['value_per_transaction_2021'] = df_agg_label_pivot['total_value_eth_2021']/df_agg_label_pivot['total_transactions_2021']
df_agg_label_pivot['value_per_transaction_2022'] = df_agg_label_pivot['total_value_eth_2022']/df_agg_label_pivot['total_transactions_2022']
df_agg_label_pivot['value_per_txn_pct_diff'] = df_agg_label_pivot['value_per_transaction_2022']/df_agg_label_pivot['value_per_transaction_2021']-1
#Replace infinite values with NaN
df_agg_label_pivot.replace(np.inf, np.nan, inplace=True)
df_agg_label_pivot.set_index('address_label', inplace=True)
Examine the first 5 rows
df_agg_label_pivot.head()
alt.Chart(df_agg_label_pivot.dropna(subset='total_transactions_pct_diff').reset_index()).mark_bar().encode(
x=alt.X('total_transactions_pct_diff:Q', title='6 July 2022 vs 6 July 2021 Transactions % Difference', axis=alt.Axis(format='%')),
y=alt.Y('address_label', sort='x', title='Addresses'),
size=alt.Size('total_transactions_2021:Q', title='6 July 2021 Transaction Volume', scale=alt.Scale(range=[0,30])),
color=alt.condition(
alt.datum.total_transactions_pct_diff > 0,
alt.value("steelblue"),
alt.value("orange")),
tooltip=[alt.Tooltip('address_label', title='Address Name'),
alt.Tooltip('total_transactions_pct_diff', title='6 July 2022 vs 2021 % Diff', format='.2%'),
alt.Tooltip('total_transactions_2022', format=',', title='6 July 2022 Transactions'),
alt.Tooltip('total_transactions_2021', format=',', title='6 July 2021 Transactions')]
).properties(
height=600,
width=700,
title='Chart 4 - Out of the top 50 addresses only FTX (a crypto exchange), showed a 177% increase in transactions for 6 July 2022 vs 2021'
)
What does the above charts tell us?
- From the chart above, we can see that there are 3 groups: miners, crypto exchanges and No Label.
- Miners: BeePool, Hiveon Pool, Nanopool, Spark Pool, 2Miners, F2Pool and Ethermine.
- Exchanges: BlockFi, Gate.io, Gemini, KuCoin, Binance, Crypto.com, Coinbase and FTX.
- No Label: User(s) or organization(s) where there is no name/label given to their addresses in etherscan.
- The bars in orange, are from the Top 50 addresses contributing to the overall decline of transactions on 6 July 2022.
- For e.g., (if you hover on the bar for Binance), the number of transactions from Binance addresses dropped by 71% in 6 July 2022 compared to same date last year.
- Also, the thickness of the bar tells us that it has the highest volume of transactions on 6 July 2021, around 90K.
- There was a 177% increase in transactions from FTX, though the volume is relatively small compared to the rest.
If we exclude FTX and include the total Ethereum (ETH) transacted in the chart
df_agg_label_pivot_excl = df_agg_label_pivot.dropna(subset='total_transactions_pct_diff').query('total_transactions_pct_diff < 0').reset_index()
bar = alt.Chart(df_agg_label_pivot_excl).mark_bar().encode(
x=alt.X('total_transactions_pct_diff:Q', title='6 July 2022 vs 6 July 2021 Transactions and ETH Value % Difference', axis=alt.Axis(format='%')),
y=alt.Y('address_label', sort='x', title='Addresses'),
size=alt.Size('total_transactions_2021:Q', title='6 July 2021 Transaction Volume', scale=alt.Scale(range=[0,30])),
color=alt.condition(
alt.datum.total_transactions_pct_diff > 0,
alt.value("steelblue"),
alt.value("orange")),
tooltip=[alt.Tooltip('address_label', title='Address Name'),
alt.Tooltip('total_transactions_pct_diff', title='6 July 2022 vs 2021 % Diff', format='.2%'),
alt.Tooltip('total_transactions_2022', format=',', title='6 July 2022 Transactions'),
alt.Tooltip('total_transactions_2021', format=',', title='6 July 2021 Transactions')]
).properties(
height=500,
width=800
)
tick = alt.Chart(df_agg_label_pivot_excl).mark_tick(
color='slategray',
thickness=5,
size=40 * 0.7
).encode(
x=alt.X('total_value_eth_pct_diff', axis=alt.Axis(format='%')),
y=alt.Y('address_label', sort=alt.Sort(field='total_transactions_pct_diff', order='ascending'), title='Addresses'),
tooltip=[alt.Tooltip('address_label', title='Address Name'),
alt.Tooltip('total_value_eth_pct_diff', title='6 July 2022 vs 2021 ETH Transacted % Diff', format='.2%'),
alt.Tooltip('total_value_eth_2022', format=',', title='6 July 2022 Total ETH Transacted'),
alt.Tooltip('total_value_eth_2021', format=',', title='6 July 2021 Total ETH Transacted')]
).properties(
height=600,
width=700,
title='Chart 5 - Despite decrease in transactions, e.g. F2Pool Old, showed a 46% increase in number of ETH transacted on 6 July 2022 vs 2021'
)
bar + tick
What does the above chart tell us?
- (Hover on the grey tick mark for F2Pool Old). We can see that there is almost a 46% increase in the number of ETH transacted on 6 July 2022 for F2Pool Old, despite 42% decline in actual number of transactions.
- This meant that there was more ETH per transaction on 6 July 2022 and could be due to the lower price of ETH.
- The closing price of ETH on 6 July 2021 was about USD$2,323 (AUD$3,097) and on 6 July 2022 was about USD$1,186 (AUD$1,744). See historical prices from Messari.
- So, an equivalent of USD$2.9 mil (AUD$3.9 mil) was transacted on 6 July 2021 and USD$2.1 mil (AUD$3.2 mil) on 6 July 2022 by F2Pool Old.
df_overall_agg = df.query('block_date.isin(["2021-07-06","2022-07-06"])').groupby(['block_date']).agg(total_transactions=('hash','count'),total_value_eth=('eth_value','sum'))
df_overall_agg['eth_per_transaction'] = df_overall_agg['total_value_eth']/df_overall_agg['total_transactions']
df_overall_agg.groupby(['block_date']).agg({'total_transactions':'sum', 'total_value_eth':'sum', 'eth_per_transaction':'sum'})
What does the above table mean?
- From above table, we can see that eventhough there is a 14% drop in number of transactions on 6 July 2022 vs 6 July 2021, there is an increase of almost 9% in ETH per transaction. As mentioned, this could be due to the lower price of ETH on 6 July 2022.
- If we examine a longer period, see the "ETH Per Transaction 2022 vs 2021 Last 3 Months" chart in this Dune dashboard.
- On 13 June 2022, ETH per transaction was almost 7 compared to 13 July 2021, which was at 1.
- The price of ETH on 13 June 2022, was even lower, at USD$1,208 (AUD$1,726) while on 13 June 2021, was at USD$2,509 (AUD$3,258).