A monorepo containing jupyter-blocks and jupyter-tidyblocks. Blockly extension for JupyterLab.
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Block Reference#

Complete reference for all blocks available in the Tidy Data toolbox. Blocks are organized by category, matching the sidebar in the Blockly editor.

Block names follow dplyr (tidyverse) conventions where an equivalent verb exists. The dplyr equivalent column shows the R function that inspired the block; a dash (—) means there is no direct dplyr analogue.


Data #FEBE4C#

Source blocks start a pipeline. They create a DataFrame stored in _df and have no top connector (nothing chains into them).

Block label Block type dplyr equivalent What it does Python generated
penguins dataset tidyblocks_data_penguins Palmer Penguins dataset loaded via seaborn _df = sns.load_dataset('penguins')
iris dataset tidyblocks_data_iris Fisher iris dataset: sepal/petal measurements for 3 species (via seaborn) _df = sns.load_dataset('iris')
titanic dataset tidyblocks_data_titanic Titanic passenger survival dataset (via seaborn) _df = sns.load_dataset('titanic')
gapminder dataset tidyblocks_data_gapminder Gapminder life expectancy / GDP dataset across countries and years (via plotly.express) _df = px.data.gapminder()
colors dataset tidyblocks_data_colors Built-in table of 11 colors with RGB values _df = pd.DataFrame({...})
earthquakes dataset tidyblocks_data_earthquakes 2016 global earthquake data from gvwilson/tidyblocks _df = pd.read_csv('<url>')
sequence 1 to N as col tidyblocks_data_sequence Integer sequence 1..N in a named column _df = pd.DataFrame({'col': range(1, N+1)})
dataset named name tidyblocks_data_user Reference a DataFrame previously saved with save as _df = name.copy()
read CSV path tidyblocks_data_csv Load a CSV file from a local or remote path _df = pd.read_csv('path')

Transform #76AADB#

Transform blocks read from and write back to _df. They can be chained in any order between a source block and a terminal block.

Row operations#

Block label Block type dplyr equivalent What it does Python generated
filter where cond tidyblocks_transform_filter filter() Keep only rows where the condition is True _df = _df[cond]
arrange by cols ↑↓ tidyblocks_transform_arrange arrange() Sort rows by one or more columns, ascending or descending _df = _df.sort_values(by=[...], ascending=True/False)
distinct by cols tidyblocks_transform_distinct distinct() Remove duplicate rows, keeping one per unique combination of columns _df = _df.drop_duplicates(subset=[...])
slice_head N rows tidyblocks_transform_slice_head slice_head() Keep the first N rows _df = _df.head(N)
slice_tail N rows tidyblocks_transform_slice_tail slice_tail() Keep the last N rows _df = _df.tail(N)
slice_sample N rows tidyblocks_transform_slice_sample slice_sample() Randomly sample N rows _df = _df.sample(n=N)
slice_min N rows by col tidyblocks_transform_slice_min slice_min() Keep the N rows with the smallest values in a column _df = _df.nsmallest(N, 'col')
slice_max N rows by col tidyblocks_transform_slice_max slice_max() Keep the N rows with the largest values in a column _df = _df.nlargest(N, 'col')
drop rows with missing in cols tidyblocks_transform_dropna — (tidyr::drop_na) Remove rows that have missing values in the specified columns _df = _df.dropna(subset=[...])

Column operations#

Block label Block type dplyr equivalent What it does Python generated
select columns cols tidyblocks_transform_select select() Keep only the named columns _df = _df[[...]]
drop columns cols tidyblocks_transform_drop select(-col) Remove the named columns _df = _df.drop(columns=[...])
mutate col = expr tidyblocks_transform_mutate mutate() Add a new column or overwrite an existing one with an expression _df = _df.assign(**{'col': expr})
rename old to new tidyblocks_transform_rename rename() Rename a single column _df = _df.rename(columns={'old': 'new'})
relocate cols before/after anchor tidyblocks_transform_relocate relocate() Move one or more columns to a new position relative to an anchor column reorders _df.columns
fill missing in col with val tidyblocks_transform_fillna — (tidyr::replace_na) Replace missing values in a column with a given value _df = _df.assign(**{'col': _df['col'].fillna(val)})

Grouping & aggregation#

Block label Block type dplyr equivalent What it does Python generated
group by cols tidyblocks_transform_groupby group_by() Group rows by the values in one or more columns for use with summarize or running _df = _df.groupby([...], as_index=False)
ungroup tidyblocks_transform_ungroup ungroup() Remove grouping and reset the row index _df = _df.reset_index(drop=True)
summarize fn of col as result tidyblocks_transform_summarize summarize() Aggregate each group (or the whole DataFrame) to a single row using count / sum / mean / median / min / max / std / var / n distinct / any / all _df = _df.agg(**{'result': ('col', 'fn')}).reset_index()
count by cols tidyblocks_transform_count count() Count rows for each unique combination of the specified columns _df = _df.groupby([...], as_index=False).size().rename(columns={'size': 'n'})
running fn of col as result tidyblocks_transform_running — (window fns) Compute a cumulative operation (cumsum / cummax / cummin / cummean / row index) across rows _df = _df.assign(**{'result': _df['col'].cumsum()}) etc.

Utilities#

Block label Block type dplyr equivalent What it does Python generated
bin col into N buckets as result tidyblocks_transform_bin — (cut()) Discretize a numeric column into N equal-width interval buckets _df = _df.assign(**{'result': pd.cut(_df['col'], bins=N).astype(str)})
save as name tidyblocks_transform_saveas Copy the current DataFrame into a named Python variable for later use with dataset named name = _df.copy()
display table tidyblocks_transform_display Render the current DataFrame as an HTML table in the output cell display(_df)

Combine #808080#

Combine blocks merge the current _df with a second DataFrame that was previously saved using save as.

Mutating joins (add columns from the right table)#

Block label Block type dplyr equivalent What it does Python generated
inner/left/right/outer join other on left col = right col tidyblocks_combine_join inner_join() / left_join() / right_join() / full_join() Join two DataFrames on matching key columns. Choose inner (only matching rows), left (all left rows), right (all right rows), or outer (all rows from both) _df = pd.merge(_df, other, left_on='lk', right_on='rk', how='...')
cross join with other tidyblocks_combine_cross_join cross_join() Cartesian product — every row in _df paired with every row in other _df = _df.merge(other, how='cross')

Filtering joins (keep/remove rows based on a match, no new columns)#

Block label Block type dplyr equivalent What it does Python generated
semi join other on left col = right col tidyblocks_combine_semi_join semi_join() Keep only rows in _df that have a matching key in other. No columns from other are added. _df = _df[_df['lk'].isin(other['rk'])]
anti join other on left col = right col tidyblocks_combine_anti_join anti_join() Keep only rows in _df that have no matching key in other _df = _df[~_df['lk'].isin(other['rk'])]

Binding (stack or glue tables together)#

Block label Block type dplyr equivalent What it does Python generated
bind_rows with other label column src tidyblocks_combine_bind_rows bind_rows() Vertically stack _df on top of other, adding a label column to identify the source of each row _df = pd.concat([_df.assign(src='left'), other.assign(src='right')]).reset_index(drop=True)
bind_cols with other tidyblocks_combine_bind_cols bind_cols() Horizontally bind _df and other by column position. Both tables must have the same number of rows. _df = pd.concat([_df, other], axis=1)

Plot #A4C588#

Plot blocks are terminal — they render a chart and have no bottom connector. All plots use Plotly Express.

Block label Block type dplyr equivalent What it does Python generated
bar chart x col y col tidyblocks_plot_bar Vertical bar chart px.bar(_df, x='col', y='col')
box plot x col y col tidyblocks_plot_box Box-and-whisker plot showing median, IQR, and outliers px.box(_df, x='col', y='col')
dot plot x col tidyblocks_plot_dot Strip/dot plot — one point per row along an axis px.strip(_df, x='col')
histogram of col bins N tidyblocks_plot_histogram Frequency histogram with N bins px.histogram(_df, x='col', nbins=N)
scatter plot x col y col color col trendline ☐ tidyblocks_plot_scatter Scatter plot with optional color grouping and OLS trendline px.scatter(_df, x=..., y=..., color=..., trendline=...)
line chart x col y col color col tidyblocks_plot_line Line chart with optional color grouping px.line(_df, x=..., y=..., color=...)
violin plot x col y col tidyblocks_plot_violin Violin plot showing the distribution shape px.violin(_df, x='col', y='col')
correlation heatmap tidyblocks_plot_heatmap Heatmap of pairwise Pearson correlations between all numeric columns px.imshow(_df.corr())

Stats #BA93DB#

Stats blocks are terminal — they print results and have no bottom connector. All stats use scipy.stats and scikit-learn.

Block label Block type dplyr equivalent What it does Python generated
one-sample t-test column col vs mean μ tidyblocks_stats_ttest_one Two-sided one-sample t-test: tests whether the column mean equals μ. Prints t-statistic and p-value. stats.ttest_1samp(_df['col'], μ)
two-sample t-test groups in group_col values in val_col tidyblocks_stats_ttest_two Two-sided two-sample t-test: splits rows into two groups and tests whether their means differ stats.ttest_ind(group_a, group_b)
k-means x col y col k N label col tidyblocks_stats_kmeans K-means clustering on two columns; adds a cluster label column to _df KMeans(n_clusters=N).fit_predict(...)
silhouette score x col y col labels col score col tidyblocks_stats_silhouette Computes the silhouette coefficient for existing cluster labels; adds a score column silhouette_score(X, labels)
Pearson/Spearman/Kendall correlation of col_a and col_b tidyblocks_stats_correlation Computes pairwise correlation coefficient and p-value between two columns stats.pearsonr / spearmanr / kendalltau
describe tidyblocks_stats_describe Prints DataFrame.describe() — count, mean, std, min, quartiles, max for every numeric column display(_df.describe())

Values #E7553C#

Value blocks are expression blocks — they produce a value and connect into input slots on transform or operation blocks. They do not have statement connectors.

Block label Block type dplyr equivalent What it does Python generated
column col tidyblocks_value_column Reference a DataFrame column by name _df['col']
number tidyblocks_value_number A numeric literal 0, 3.14, etc.
"text" tidyblocks_value_text A string literal 'text'
true / false tidyblocks_value_logical A boolean literal True / False
date YYYY-MM-DD tidyblocks_value_datetime A date/time constant pd.Timestamp('YYYY-MM-DD')
missing tidyblocks_value_missing NA An explicit missing (NA/NaN) value float("nan")
Normal(mean μ std σ) tidyblocks_value_normal Draw a column of values from a Normal distribution np.random.normal(μ, σ, len(_df))
Uniform(low a high b) tidyblocks_value_uniform Draw a column of values from a Uniform distribution np.random.uniform(a, b, len(_df))
Exponential(lambda λ) tidyblocks_value_exponential Draw a column of values from an Exponential distribution np.random.exponential(1/λ, len(_df))

Operations #F9B5B2#

Operation blocks are expression blocks used inside filter, mutate, fill missing, and similar blocks. They take value inputs and return a computed value.

Numeric & comparison#

Block label Block type dplyr equivalent What it does Python generated
a + b, a - b, a × b, a ÷ b, a % b, a ** b tidyblocks_op_arithmetic Standard arithmetic on two values (a + b), (a * b), etc.
a = b, a ≠ b, a < b, a ≤ b, a > b, a ≥ b tidyblocks_op_compare Element-wise comparison, returning a boolean column (a == b), (a < b), etc.
x between left and right tidyblocks_op_between between() Return True for values within the inclusive range [left, right] x.between(left, right)
abs / round / floor / ceil / sqrt / log / exp ( val ) tidyblocks_op_math Apply a standard math function to a column val.abs(), np.sqrt(val), etc.

Logic#

Block label Block type dplyr equivalent What it does Python generated
a AND b / a OR b tidyblocks_op_logic Element-wise logical AND/OR on two boolean columns (a & b), (a | b)
NOT val tidyblocks_op_not Element-wise logical NOT ~(val)
if cond then x else y tidyblocks_op_ifelse if_else() Return x where cond is True, y elsewhere np.where(cond, x, y)
coalesce val with replacement tidyblocks_op_coalesce coalesce() Replace missing values in val with values from replacement val.fillna(replacement)

Type operations#

Block label Block type dplyr equivalent What it does Python generated
val is missing / is number / is text / is date / is boolean tidyblocks_op_typecheck Test whether each element matches a specific type val.isna(), val.apply(isinstance(...)), etc.
convert val to number / text / bool / datetime tidyblocks_op_convert Cast a column to a different type pd.to_numeric(val), val.astype(str), etc.

Date & time#

Block label Block type dplyr equivalent What it does Python generated
year / month / day / weekday / hour / minute / second of val tidyblocks_op_datetime Extract a calendar component from a datetime column val.dt.year, val.dt.month, etc.

Window & ranking#

Block label Block type dplyr equivalent What it does Python generated
shift val by N tidyblocks_op_shift lag() / lead() Shift values forward (positive N = lag) or backward (negative N = lead) val.shift(N)
n_distinct val tidyblocks_op_n_distinct n_distinct() Count the number of distinct (unique) values in a column val.nunique()

String#

Block label Block type dplyr equivalent What it does Python generated
val . upper / lower / strip / length tidyblocks_op_string Apply a string operation to a text column val.str.upper(), val.str.len(), etc.
val contains pattern tidyblocks_op_str_contains stringr::str_detect() Return True where the string column matches a pattern val.str.contains('pattern', na=False)

Pipeline rules#

Block role Has top connector Has bottom connector Examples
Source No Yes all Data blocks
Transform Yes Yes filter, mutate, arrange, …
Terminal Yes No display table, all Plot blocks, all Stats blocks

A valid pipeline must be: one source → zero or more transforms → one terminal.