update to include and harmonize all dplyr verbs · teonbrooks.com/jupyter-blocks@2a3734d

teonbrooks.com / jupyter-blocks

fork

Configure Feed

Issues Pull Requests Commits Tags

Feed URL

Select the types of activity you want to include in your feed.

A monorepo containing jupyter-blocks and jupyter-tidyblocks. Blockly extension for JupyterLab.

fork

Configure Feed

Issues Pull Requests Commits Tags

Feed URL

Select the types of activity you want to include in your feed.

update to include and harmonize all dplyr verbs

Teon L Brooks 3 months ago 2a3734d1 4ade5911

+618 -37

13 changed files

expand all collapse all

.gitignore

CHANGELOG.md

docs

blocks-reference.md

work-summary.md

jupyter_tidyblocks

labextension

package.json

packages

blockly-extension

src

index.ts

tidyblocks

src

blocks

combine.ts

op.ts

transform.ts

generators

python

combine.ts

op.ts

transform.ts

toolbox.ts

.gitignore

reviewed

··· 127 127 jupyterlab_blockly/_version.py 128 128 /.turbo 129 129 /jupyter_tidyblocks/labextension/static 130 130 + .yarn/install-state.gz

+30 -3

CHANGELOG.md

reviewed

··· 9 9 10 10 ### New features 11 11 12 12 - - **`packages/tidyblocks`**: new monorepo package providing 50+ tidy-data analysis blocks organized into seven categories — Data, Transform, Combine, Plot, Stats, Value, and Op — with Python (pandas / plotly.express / scipy / sklearn) code generators 12 12 + - **`packages/tidyblocks`**: new monorepo package providing 60+ tidy-data analysis blocks organized into seven categories — Data, Transform, Combine, Plot, Stats, Values, and Operations — with Python (pandas / plotly.express / scipy / sklearn) code generators 13 13 - Exports `registerTidyblocks(registry)` for registering all blocks and the Tidy Data toolbox with any `IBlocklyRegistry` instance 14 14 + - Block names aligned with [dplyr (tidyverse)](https://dplyr.tidyverse.org/) conventions; 7 blocks renamed and 10 new blocks added (see below) 15 15 + 16 16 + #### dplyr alignment — renames 17 17 + 18 18 + | Old | New | dplyr verb | 19 19 + |---|---|---| 20 20 + | create column | mutate | `mutate()` | 21 21 + | sort by | arrange by | `arrange()` | 22 22 + | unique by | distinct by | `distinct()` | 23 23 + | first N rows | slice_head | `slice_head()` | 24 24 + | last N rows | slice_tail | `slice_tail()` | 25 25 + | sample N rows | slice_sample | `slice_sample()` | 26 26 + | glue with | bind_rows with | `bind_rows()` | 27 27 + 28 28 + #### dplyr alignment — new blocks 29 29 + 30 30 + - **Transform**: `count()`, `relocate()`, `slice_min()`, `slice_max()` 31 31 + - **Combine**: `semi_join()`, `anti_join()`, `bind_cols()` 32 32 + - **Operations**: `between()`, `coalesce()`, `n_distinct()` 33 33 + - **summarize** block: added `n distinct` aggregate function option 14 34 15 35 ### Rebrand & metadata 16 36 ··· 49 69 - `packages/tidyblocks/src/index.ts`: `registry.addToolbox` → `registry.registerToolbox` (correct method name on `IBlocklyRegistry`) 50 70 - Root `tsconfig.json`: added `"lib": ["ES2020", "DOM"]` to resolve `Intl.ResolvedRelativeTimeFormatOptions` error from `@jupyterlab/coreutils` 51 71 72 72 + ### Package manager 73 73 + 74 74 + - Migrated from Yarn 4 to npm; `yarn.lock` / `.yarnrc.yml` / `.yarn/` removed; `"resolutions"` → `"overrides"`; `jlpm` replaced with `npm run` in all scripts 75 75 + - `yarn.lock` added to `.gitignore` (regenerated as a build side-effect by `@jupyterlab/builder`'s bundled jlpm) 76 76 + 52 77 ### Docs 53 78 54 54 - - `docs/jupyterlab-blockly_architecture.md`: full architecture document (refers to the upstream project as `jupyterlab-blockly`) 55 55 - - `docs/tidyblocks-features.md`: feature inventory and port plan from gvwilson/tidyblocks 79 79 + - `docs/getting-started.md`: step-by-step guide for installing and testing the extension in JupyterLab 80 80 + - `docs/architecture.md`: full architecture reference (package layout, data-flow, extension points) 81 81 + - `docs/blocks-reference.md`: complete block reference with dplyr mapping, description, and generated Python for every block 82 82 + - `docs/work-summary.md`: narrative summary of all engineering work done in this release 56 83 - `docs/modernization-plan.md`: full modernization plan with phase-by-phase status 57 84 - `README.md`: rewritten; credits Greg Wilson's tidyblocks and QuantStack/jupyterlab-blockly 58 85

+223

docs/blocks-reference.md

reviewed

··· 1 1 + # Block Reference 2 2 + 3 3 + Complete reference for all blocks available in the **Tidy Data** toolbox. 4 4 + Blocks are organized by category, matching the sidebar in the Blockly editor. 5 5 + 6 6 + Block names follow [dplyr (tidyverse)](https://dplyr.tidyverse.org/) conventions 7 7 + where an equivalent verb exists. The **dplyr equivalent** column shows the R 8 8 + function that inspired the block; a dash (—) means there is no direct 9 9 + dplyr analogue. 10 10 + 11 11 + --- 12 12 + 13 13 + ## Data `#FEBE4C` 14 14 + 15 15 + Source blocks start a pipeline. They create a DataFrame stored in `_df` and 16 16 + have no top connector (nothing chains into them). 17 17 + 18 18 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 19 19 + |---|---|---|---|---| 20 20 + | penguins dataset | `tidyblocks_data_penguins` | — | Palmer Penguins dataset loaded via seaborn | `_df = sns.load_dataset('penguins')` | 21 21 + | colors dataset | `tidyblocks_data_colors` | — | Built-in table of 11 colors with RGB values | `_df = pd.DataFrame({...})` | 22 22 + | earthquakes dataset | `tidyblocks_data_earthquakes` | — | 2016 global earthquake data from gvwilson/tidyblocks | `_df = pd.read_csv('<url>')` | 23 23 + | sequence 1 to N as col | `tidyblocks_data_sequence` | — | Integer sequence 1..N in a named column | `_df = pd.DataFrame({'col': range(1, N+1)})` | 24 24 + | dataset named name | `tidyblocks_data_user` | — | Reference a DataFrame previously saved with **save as** | `_df = name.copy()` | 25 25 + | read CSV path | `tidyblocks_data_csv` | — | Load a CSV file from a local or remote path | `_df = pd.read_csv('path')` | 26 26 + 27 27 + --- 28 28 + 29 29 + ## Transform `#76AADB` 30 30 + 31 31 + Transform blocks read from and write back to `_df`. They can be chained in 32 32 + any order between a source block and a terminal block. 33 33 + 34 34 + ### Row operations 35 35 + 36 36 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 37 37 + |---|---|---|---|---| 38 38 + | filter where cond | `tidyblocks_transform_filter` | `filter()` | Keep only rows where the condition is `True` | `_df = _df[cond]` | 39 39 + | arrange by cols ↑↓ | `tidyblocks_transform_arrange` | `arrange()` | Sort rows by one or more columns, ascending or descending | `_df = _df.sort_values(by=[...], ascending=True/False)` | 40 40 + | distinct by cols | `tidyblocks_transform_distinct` | `distinct()` | Remove duplicate rows, keeping one per unique combination of columns | `_df = _df.drop_duplicates(subset=[...])` | 41 41 + | slice_head N rows | `tidyblocks_transform_slice_head` | `slice_head()` | Keep the first N rows | `_df = _df.head(N)` | 42 42 + | slice_tail N rows | `tidyblocks_transform_slice_tail` | `slice_tail()` | Keep the last N rows | `_df = _df.tail(N)` | 43 43 + | slice_sample N rows | `tidyblocks_transform_slice_sample` | `slice_sample()` | Randomly sample N rows | `_df = _df.sample(n=N)` | 44 44 + | slice_min N rows by col | `tidyblocks_transform_slice_min` | `slice_min()` | Keep the N rows with the smallest values in a column | `_df = _df.nsmallest(N, 'col')` | 45 45 + | slice_max N rows by col | `tidyblocks_transform_slice_max` | `slice_max()` | Keep the N rows with the largest values in a column | `_df = _df.nlargest(N, 'col')` | 46 46 + | drop rows with missing in cols | `tidyblocks_transform_dropna` | — (`tidyr::drop_na`) | Remove rows that have missing values in the specified columns | `_df = _df.dropna(subset=[...])` | 47 47 + 48 48 + ### Column operations 49 49 + 50 50 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 51 51 + |---|---|---|---|---| 52 52 + | select columns cols | `tidyblocks_transform_select` | `select()` | Keep only the named columns | `_df = _df[[...]]` | 53 53 + | drop columns cols | `tidyblocks_transform_drop` | `select(-col)` | Remove the named columns | `_df = _df.drop(columns=[...])` | 54 54 + | mutate col = expr | `tidyblocks_transform_mutate` | `mutate()` | Add a new column or overwrite an existing one with an expression | `_df = _df.assign(**{'col': expr})` | 55 55 + | rename old to new | `tidyblocks_transform_rename` | `rename()` | Rename a single column | `_df = _df.rename(columns={'old': 'new'})` | 56 56 + | relocate cols before/after anchor | `tidyblocks_transform_relocate` | `relocate()` | Move one or more columns to a new position relative to an anchor column | reorders `_df.columns` | 57 57 + | fill missing in col with val | `tidyblocks_transform_fillna` | — (`tidyr::replace_na`) | Replace missing values in a column with a given value | `_df = _df.assign(**{'col': _df['col'].fillna(val)})` | 58 58 + 59 59 + ### Grouping & aggregation 60 60 + 61 61 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 62 62 + |---|---|---|---|---| 63 63 + | group by cols | `tidyblocks_transform_groupby` | `group_by()` | Group rows by the values in one or more columns for use with summarize or running | `_df = _df.groupby([...], as_index=False)` | 64 64 + | ungroup | `tidyblocks_transform_ungroup` | `ungroup()` | Remove grouping and reset the row index | `_df = _df.reset_index(drop=True)` | 65 65 + | summarize fn of col as result | `tidyblocks_transform_summarize` | `summarize()` | Aggregate each group (or the whole DataFrame) to a single row using count / sum / mean / median / min / max / std / var / n distinct / any / all | `_df = _df.agg(**{'result': ('col', 'fn')}).reset_index()` | 66 66 + | count by cols | `tidyblocks_transform_count` | `count()` | Count rows for each unique combination of the specified columns | `_df = _df.groupby([...], as_index=False).size().rename(columns={'size': 'n'})` | 67 67 + | running fn of col as result | `tidyblocks_transform_running` | — (window fns) | Compute a cumulative operation (cumsum / cummax / cummin / cummean / row index) across rows | `_df = _df.assign(**{'result': _df['col'].cumsum()})` etc. | 68 68 + 69 69 + ### Utilities 70 70 + 71 71 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 72 72 + |---|---|---|---|---| 73 73 + | bin col into N buckets as result | `tidyblocks_transform_bin` | — (`cut()`) | Discretize a numeric column into N equal-width interval buckets | `_df = _df.assign(**{'result': pd.cut(_df['col'], bins=N).astype(str)})` | 74 74 + | save as name | `tidyblocks_transform_saveas` | — | Copy the current DataFrame into a named Python variable for later use with **dataset named** | `name = _df.copy()` | 75 75 + | display table | `tidyblocks_transform_display` | — | Render the current DataFrame as an HTML table in the output cell | `display(_df)` | 76 76 + 77 77 + --- 78 78 + 79 79 + ## Combine `#808080` 80 80 + 81 81 + Combine blocks merge the current `_df` with a second DataFrame that was 82 82 + previously saved using **save as**. 83 83 + 84 84 + ### Mutating joins (add columns from the right table) 85 85 + 86 86 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 87 87 + |---|---|---|---|---| 88 88 + | inner/left/right/outer join other on left col = right col | `tidyblocks_combine_join` | `inner_join()` / `left_join()` / `right_join()` / `full_join()` | Join two DataFrames on matching key columns. Choose inner (only matching rows), left (all left rows), right (all right rows), or outer (all rows from both) | `_df = pd.merge(_df, other, left_on='lk', right_on='rk', how='...')` | 89 89 + | cross join with other | `tidyblocks_combine_cross_join` | `cross_join()` | Cartesian product — every row in `_df` paired with every row in `other` | `_df = _df.merge(other, how='cross')` | 90 90 + 91 91 + ### Filtering joins (keep/remove rows based on a match, no new columns) 92 92 + 93 93 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 94 94 + |---|---|---|---|---| 95 95 + | semi join other on left col = right col | `tidyblocks_combine_semi_join` | `semi_join()` | Keep only rows in `_df` that have a matching key in `other`. No columns from `other` are added. | `_df = _df[_df['lk'].isin(other['rk'])]` | 96 96 + | anti join other on left col = right col | `tidyblocks_combine_anti_join` | `anti_join()` | Keep only rows in `_df` that have **no** matching key in `other` | `_df = _df[~_df['lk'].isin(other['rk'])]` | 97 97 + 98 98 + ### Binding (stack or glue tables together) 99 99 + 100 100 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 101 101 + |---|---|---|---|---| 102 102 + | bind_rows with other label column src | `tidyblocks_combine_bind_rows` | `bind_rows()` | Vertically stack `_df` on top of `other`, adding a label column to identify the source of each row | `_df = pd.concat([_df.assign(src='left'), other.assign(src='right')]).reset_index(drop=True)` | 103 103 + | bind_cols with other | `tidyblocks_combine_bind_cols` | `bind_cols()` | Horizontally bind `_df` and `other` by column position. Both tables must have the same number of rows. | `_df = pd.concat([_df, other], axis=1)` | 104 104 + 105 105 + --- 106 106 + 107 107 + ## Plot `#A4C588` 108 108 + 109 109 + Plot blocks are terminal — they render a chart and have no bottom connector. 110 110 + All plots use [Plotly Express](https://plotly.com/python/plotly-express/). 111 111 + 112 112 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 113 113 + |---|---|---|---|---| 114 114 + | bar chart x col y col | `tidyblocks_plot_bar` | — | Vertical bar chart | `px.bar(_df, x='col', y='col')` | 115 115 + | box plot x col y col | `tidyblocks_plot_box` | — | Box-and-whisker plot showing median, IQR, and outliers | `px.box(_df, x='col', y='col')` | 116 116 + | dot plot x col | `tidyblocks_plot_dot` | — | Strip/dot plot — one point per row along an axis | `px.strip(_df, x='col')` | 117 117 + | histogram of col bins N | `tidyblocks_plot_histogram` | — | Frequency histogram with N bins | `px.histogram(_df, x='col', nbins=N)` | 118 118 + | scatter plot x col y col color col trendline ☐ | `tidyblocks_plot_scatter` | — | Scatter plot with optional color grouping and OLS trendline | `px.scatter(_df, x=..., y=..., color=..., trendline=...)` | 119 119 + | line chart x col y col color col | `tidyblocks_plot_line` | — | Line chart with optional color grouping | `px.line(_df, x=..., y=..., color=...)` | 120 120 + | violin plot x col y col | `tidyblocks_plot_violin` | — | Violin plot showing the distribution shape | `px.violin(_df, x='col', y='col')` | 121 121 + | correlation heatmap | `tidyblocks_plot_heatmap` | — | Heatmap of pairwise Pearson correlations between all numeric columns | `px.imshow(_df.corr())` | 122 122 + 123 123 + --- 124 124 + 125 125 + ## Stats `#BA93DB` 126 126 + 127 127 + Stats blocks are terminal — they print results and have no bottom connector. 128 128 + All stats use [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html) 129 129 + and [scikit-learn](https://scikit-learn.org/). 130 130 + 131 131 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 132 132 + |---|---|---|---|---| 133 133 + | one-sample t-test column col vs mean μ | `tidyblocks_stats_ttest_one` | — | Two-sided one-sample t-test: tests whether the column mean equals μ. Prints t-statistic and p-value. | `stats.ttest_1samp(_df['col'], μ)` | 134 134 + | two-sample t-test groups in group_col values in val_col | `tidyblocks_stats_ttest_two` | — | Two-sided two-sample t-test: splits rows into two groups and tests whether their means differ | `stats.ttest_ind(group_a, group_b)` | 135 135 + | k-means x col y col k N label col | `tidyblocks_stats_kmeans` | — | K-means clustering on two columns; adds a cluster label column to `_df` | `KMeans(n_clusters=N).fit_predict(...)` | 136 136 + | silhouette score x col y col labels col score col | `tidyblocks_stats_silhouette` | — | Computes the silhouette coefficient for existing cluster labels; adds a score column | `silhouette_score(X, labels)` | 137 137 + | Pearson/Spearman/Kendall correlation of col_a and col_b | `tidyblocks_stats_correlation` | — | Computes pairwise correlation coefficient and p-value between two columns | `stats.pearsonr / spearmanr / kendalltau` | 138 138 + | describe | `tidyblocks_stats_describe` | — | Prints `DataFrame.describe()` — count, mean, std, min, quartiles, max for every numeric column | `display(_df.describe())` | 139 139 + 140 140 + --- 141 141 + 142 142 + ## Values `#E7553C` 143 143 + 144 144 + Value blocks are expression blocks — they produce a value and connect into 145 145 + input slots on transform or operation blocks. They do not have statement 146 146 + connectors. 147 147 + 148 148 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 149 149 + |---|---|---|---|---| 150 150 + | column col | `tidyblocks_value_column` | — | Reference a DataFrame column by name | `_df['col']` | 151 151 + | number | `tidyblocks_value_number` | — | A numeric literal | `0`, `3.14`, etc. | 152 152 + | "text" | `tidyblocks_value_text` | — | A string literal | `'text'` | 153 153 + | true / false | `tidyblocks_value_logical` | — | A boolean literal | `True` / `False` | 154 154 + | date YYYY-MM-DD | `tidyblocks_value_datetime` | — | A date/time constant | `pd.Timestamp('YYYY-MM-DD')` | 155 155 + | missing | `tidyblocks_value_missing` | `NA` | An explicit missing (NA/NaN) value | `float("nan")` | 156 156 + | Normal(mean μ std σ) | `tidyblocks_value_normal` | — | Draw a column of values from a Normal distribution | `np.random.normal(μ, σ, len(_df))` | 157 157 + | Uniform(low a high b) | `tidyblocks_value_uniform` | — | Draw a column of values from a Uniform distribution | `np.random.uniform(a, b, len(_df))` | 158 158 + | Exponential(lambda λ) | `tidyblocks_value_exponential` | — | Draw a column of values from an Exponential distribution | `np.random.exponential(1/λ, len(_df))` | 159 159 + 160 160 + --- 161 161 + 162 162 + ## Operations `#F9B5B2` 163 163 + 164 164 + Operation blocks are expression blocks used inside **filter**, **mutate**, 165 165 + **fill missing**, and similar blocks. They take value inputs and return a 166 166 + computed value. 167 167 + 168 168 + ### Numeric & comparison 169 169 + 170 170 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 171 171 + |---|---|---|---|---| 172 172 + | a + b, a - b, a × b, a ÷ b, a % b, a ** b | `tidyblocks_op_arithmetic` | — | Standard arithmetic on two values | `(a + b)`, `(a * b)`, etc. | 173 173 + | a = b, a ≠ b, a < b, a ≤ b, a > b, a ≥ b | `tidyblocks_op_compare` | — | Element-wise comparison, returning a boolean column | `(a == b)`, `(a < b)`, etc. | 174 174 + | x between left and right | `tidyblocks_op_between` | `between()` | Return `True` for values within the inclusive range `[left, right]` | `x.between(left, right)` | 175 175 + | abs / round / floor / ceil / sqrt / log / exp ( val ) | `tidyblocks_op_math` | — | Apply a standard math function to a column | `val.abs()`, `np.sqrt(val)`, etc. | 176 176 + 177 177 + ### Logic 178 178 + 179 179 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 180 180 + |---|---|---|---|---| 181 181 + | a AND b / a OR b | `tidyblocks_op_logic` | — | Element-wise logical AND/OR on two boolean columns | `(a & b)`, `(a \| b)` | 182 182 + | NOT val | `tidyblocks_op_not` | — | Element-wise logical NOT | `~(val)` | 183 183 + | if cond then x else y | `tidyblocks_op_ifelse` | `if_else()` | Return `x` where `cond` is `True`, `y` elsewhere | `np.where(cond, x, y)` | 184 184 + | coalesce val with replacement | `tidyblocks_op_coalesce` | `coalesce()` | Replace missing values in `val` with values from `replacement` | `val.fillna(replacement)` | 185 185 + 186 186 + ### Type operations 187 187 + 188 188 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 189 189 + |---|---|---|---|---| 190 190 + | val is missing / is number / is text / is date / is boolean | `tidyblocks_op_typecheck` | — | Test whether each element matches a specific type | `val.isna()`, `val.apply(isinstance(...))`, etc. | 191 191 + | convert val to number / text / bool / datetime | `tidyblocks_op_convert` | — | Cast a column to a different type | `pd.to_numeric(val)`, `val.astype(str)`, etc. | 192 192 + 193 193 + ### Date & time 194 194 + 195 195 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 196 196 + |---|---|---|---|---| 197 197 + | year / month / day / weekday / hour / minute / second of val | `tidyblocks_op_datetime` | — | Extract a calendar component from a datetime column | `val.dt.year`, `val.dt.month`, etc. | 198 198 + 199 199 + ### Window & ranking 200 200 + 201 201 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 202 202 + |---|---|---|---|---| 203 203 + | shift val by N | `tidyblocks_op_shift` | `lag()` / `lead()` | Shift values forward (positive N = lag) or backward (negative N = lead) | `val.shift(N)` | 204 204 + | n_distinct val | `tidyblocks_op_n_distinct` | `n_distinct()` | Count the number of distinct (unique) values in a column | `val.nunique()` | 205 205 + 206 206 + ### String 207 207 + 208 208 + | Block label | Block type | dplyr equivalent | What it does | Python generated | 209 209 + |---|---|---|---|---| 210 210 + | val . upper / lower / strip / length | `tidyblocks_op_string` | — | Apply a string operation to a text column | `val.str.upper()`, `val.str.len()`, etc. | 211 211 + | val contains pattern | `tidyblocks_op_str_contains` | `stringr::str_detect()` | Return `True` where the string column matches a pattern | `val.str.contains('pattern', na=False)` | 212 212 + 213 213 + --- 214 214 + 215 215 + ## Pipeline rules 216 216 + 217 217 + | Block role | Has top connector | Has bottom connector | Examples | 218 218 + |---|---|---|---| 219 219 + | **Source** | No | Yes | all Data blocks | 220 220 + | **Transform** | Yes | Yes | filter, mutate, arrange, … | 221 221 + | **Terminal** | Yes | No | display table, all Plot blocks, all Stats blocks | 222 222 + 223 223 + A valid pipeline must be: **one source → zero or more transforms → one terminal**.

+76 -2

docs/work-summary.md

reviewed

··· 211 211 212 212 --- 213 213 214 214 - ## 11. Documentation 214 214 + ## 12. npm migration 215 215 + 216 216 + **Problem:** The project used Yarn 4 (`packageManager: "yarn@4.6.0"`) but a 217 217 + stray `package-lock.json` and npm-installed `node_modules` had accumulated 218 218 + alongside it, creating an inconsistent state. 219 219 + 220 220 + **What was done:** 221 221 + - Removed `packageManager: "yarn@4.6.0"` and replaced with `"npm@11.1.0"`. 222 222 + - Converted `"resolutions"` → `"overrides"` (npm 8.3+ equivalent). 223 223 + - Converted `"workspaces"` from Yarn's `{ "packages": [...] }` object form to 224 224 + npm's array form `["packages/*"]`. 225 225 + - Replaced all `jlpm` references in `packages/blockly-extension/package.json` 226 226 + scripts with `npm run`. 227 227 + - Replaced `jlpm` references in root `lint` / `prettier` scripts with 228 228 + `npm run`. 229 229 + - Deleted `.yarnrc.yml`, `.yarn/` cache directory, and `yarn.lock`. 230 230 + - Updated `.gitignore` to track `node_modules/`, `package-lock.json`, and 231 231 + `yarn.lock` (the last is a build side-effect from `@jupyterlab/builder`'s 232 232 + bundled `jlpm`, which cannot be avoided). 233 233 + 234 234 + **Note:** `jlpm` (a yarn shim bundled inside the `jupyterlab` Python package) 235 235 + is called internally by `jupyter labextension build` and will always 236 236 + regenerate `yarn.lock` during a build. This is an implementation detail of 237 237 + `@jupyterlab/builder` that cannot be configured away; the file is gitignored. 238 238 + 239 239 + --- 240 240 + 241 241 + ## 13. dplyr alignment and new blocks 242 242 + 243 243 + **Motivation:** dplyr (R tidyverse) is the reference vocabulary for tidy-data 244 244 + analysis. Aligning block names to dplyr verbs makes the extension more 245 245 + intuitive for data scientists familiar with either R or the tidy-data 246 246 + paradigm. 247 247 + 248 248 + ### Renames (7 blocks) 249 249 + 250 250 + | Old block label | New block label | dplyr verb | 251 251 + |---|---|---| 252 252 + | `create column` | `mutate` | `mutate()` | 253 253 + | `sort by` | `arrange by` | `arrange()` | 254 254 + | `unique by` | `distinct by` | `distinct()` | 255 255 + | `first N rows` | `slice_head N rows` | `slice_head()` | 256 256 + | `last N rows` | `slice_tail N rows` | `slice_tail()` | 257 257 + | `sample N rows` | `slice_sample N rows` | `slice_sample()` | 258 258 + | `glue with` | `bind_rows with` | `bind_rows()` | 259 259 + 260 260 + Internal block type names were updated to match 261 261 + (e.g. `tidyblocks_transform_create` → `tidyblocks_transform_mutate`). 262 262 + 263 263 + ### New blocks (10 blocks) 264 264 + 265 265 + **Transform** 266 266 + - `count by cols` — `count()`: count rows per combination of columns 267 267 + - `relocate cols before/after anchor` — `relocate()`: move columns to a new position 268 268 + - `slice_min N rows by col` — `slice_min()`: keep N rows with smallest values 269 269 + - `slice_max N rows by col` — `slice_max()`: keep N rows with largest values 270 270 + 271 271 + **Combine** 272 272 + - `semi join` — `semi_join()`: filtering join, keep matched rows (no new columns) 273 273 + - `anti join` — `anti_join()`: filtering join, keep unmatched rows 274 274 + - `bind_cols with` — `bind_cols()`: horizontally bind two DataFrames by column position 275 275 + 276 276 + **Operations** 277 277 + - `between left and right` — `between()`: inclusive range check 278 278 + - `coalesce val with replacement` — `coalesce()`: first non-missing value 279 279 + - `n_distinct val` — `n_distinct()`: count unique values 280 280 + 281 281 + Also added `n distinct` as an option to the **summarize** block's function dropdown. 282 282 + 283 283 + --- 284 284 + 285 285 + ## 14. Documentation 286 286 + 287 287 + (was §11) 215 288 216 289 | Document | Description | 217 290 |---|---| 218 291 | `docs/getting-started.md` | Step-by-step guide: install, launch JupyterLab, create a `.jpblockly` file, build a penguins pipeline, run it, and see output. | 219 292 | `docs/modernization-plan.md` | Updated to reflect completed phases, corrected version numbers (JupyterLab 4.5 not 4.6), and added a new Phase 6 documenting all the fixes in this work. | 220 220 - | `docs/architecture.md` *(this work)* | Full architecture reference: package layout, data-flow diagram, object relationships, block pipeline conventions, code generation pattern, and extension points. | 293 293 + | `docs/architecture.md` | Full architecture reference: package layout, data-flow diagram, object relationships, block pipeline conventions, code generation pattern, and extension points. | 294 294 + | `docs/blocks-reference.md` | Complete block reference: every block organized by category, with its block type name, dplyr equivalent, description, and generated Python. | 221 295 | `CHANGELOG.md` | Rewrote the `0.1.0` entry with accurate versions and full sections covering all new features, the rebrand, dependency upgrades, build fixes, bug fixes, and docs. |

+1 -1

jupyter_tidyblocks/labextension/package.json

reviewed

··· 98 98 } 99 99 }, 100 100 "_build": { 101 101 - "load": "static/remoteEntry.898fe8e70b536ba95137.js", 101 101 + "load": "static/remoteEntry.a2e6fa6b678931659f31.js", 102 102 "extension": "./extension", 103 103 "style": "./style" 104 104 }

+3 -3

packages/blockly-extension/src/index.ts

reviewed

··· 111 111 // Handle state restoration. 112 112 if (restorer) { 113 113 // When restoring the app, if the document was open, reopen it 114 114 - restorer.restore(tracker, { 114 114 + restorer.restore(tracker as any, { 115 115 command: 'docmanager:open', 116 116 - args: widget => ({ path: widget.context.path, factory: FACTORY }), 117 117 - name: widget => widget.context.path 116 116 + args: (widget: any) => ({ path: widget.context.path, factory: FACTORY }), 117 117 + name: (widget: any) => widget.context.path 118 118 }); 119 119 } 120 120

+47 -2

packages/tidyblocks/src/blocks/combine.ts

reviewed

··· 9 9 10 10 Blockly.defineBlocksWithJsonArray([ 11 11 { 12 12 + // dplyr: inner_join / left_join / right_join / full_join 12 13 type: 'tidyblocks_combine_join', 13 14 message0: '%1 join %2 on left %3 = right %4', 14 15 args0: [ ··· 28 29 'Join the current DataFrame with a named DataFrame on matching columns.' 29 30 }, 30 31 { 31 31 - type: 'tidyblocks_combine_glue', 32 32 - message0: 'glue with %1 label column %2', 32 32 + // dplyr: bind_rows() — vertically stack two DataFrames 33 33 + type: 'tidyblocks_combine_bind_rows', 34 34 + message0: 'bind_rows with %1 label column %2', 33 35 args0: [ 34 36 { type: 'field_input', name: 'OTHER_DF', text: 'other_df' }, 35 37 { type: 'field_input', name: 'LABEL_COL', text: 'source' } ··· 41 43 'Vertically stack the current DataFrame with another, adding a label column.' 42 44 }, 43 45 { 46 46 + // dplyr: cross_join() — Cartesian product 44 47 type: 'tidyblocks_combine_cross_join', 45 48 message0: 'cross join with %1', 46 49 args0: [{ type: 'field_input', name: 'RIGHT_DF', text: 'other_df' }], ··· 48 51 nextStatement: null, 49 52 colour: '#808080', 50 53 tooltip: 'Cartesian product of the current DataFrame with another.' 54 54 + }, 55 55 + { 56 56 + // dplyr: semi_join() — keep rows in _df that have a match in other_df 57 57 + // (no columns from other_df are added) 58 58 + type: 'tidyblocks_combine_semi_join', 59 59 + message0: 'semi join %1 on left %2 = right %3', 60 60 + args0: [ 61 61 + { type: 'field_input', name: 'RIGHT_DF', text: 'other_df' }, 62 62 + { type: 'field_input', name: 'LEFT_ON', text: 'id' }, 63 63 + { type: 'field_input', name: 'RIGHT_ON', text: 'id' } 64 64 + ], 65 65 + previousStatement: null, 66 66 + nextStatement: null, 67 67 + colour: '#808080', 68 68 + tooltip: 69 69 + 'Keep only rows from the current DataFrame that have a match in the other DataFrame. No columns from the other DataFrame are added.' 70 70 + }, 71 71 + { 72 72 + // dplyr: anti_join() — keep rows in _df that have NO match in other_df 73 73 + type: 'tidyblocks_combine_anti_join', 74 74 + message0: 'anti join %1 on left %2 = right %3', 75 75 + args0: [ 76 76 + { type: 'field_input', name: 'RIGHT_DF', text: 'other_df' }, 77 77 + { type: 'field_input', name: 'LEFT_ON', text: 'id' }, 78 78 + { type: 'field_input', name: 'RIGHT_ON', text: 'id' } 79 79 + ], 80 80 + previousStatement: null, 81 81 + nextStatement: null, 82 82 + colour: '#808080', 83 83 + tooltip: 84 84 + 'Keep only rows from the current DataFrame that have no match in the other DataFrame.' 85 85 + }, 86 86 + { 87 87 + // dplyr: bind_cols() — horizontally bind two DataFrames by column position 88 88 + type: 'tidyblocks_combine_bind_cols', 89 89 + message0: 'bind_cols with %1', 90 90 + args0: [{ type: 'field_input', name: 'OTHER_DF', text: 'other_df' }], 91 91 + previousStatement: null, 92 92 + nextStatement: null, 93 93 + colour: '#808080', 94 94 + tooltip: 95 95 + 'Horizontally bind the current DataFrame with another by column position. Both must have the same number of rows.' 51 96 } 52 97 ]);

+37

packages/tidyblocks/src/blocks/op.ts

reviewed

··· 223 223 colour: '#F9B5B2', 224 224 inputsInline: true, 225 225 tooltip: 'Check whether a string column contains a pattern.' 226 226 + }, 227 227 + // dplyr: between() — test whether values fall within an inclusive range 228 228 + { 229 229 + type: 'tidyblocks_op_between', 230 230 + message0: '%1 between %2 and %3', 231 231 + args0: [ 232 232 + { type: 'input_value', name: 'VALUE' }, 233 233 + { type: 'field_number', name: 'LEFT', value: 0 }, 234 234 + { type: 'field_number', name: 'RIGHT', value: 1 } 235 235 + ], 236 236 + output: 'Boolean', 237 237 + colour: '#F9B5B2', 238 238 + inputsInline: true, 239 239 + tooltip: 'Return True for values within the inclusive range [left, right].' 240 240 + }, 241 241 + // dplyr: coalesce() — return the first non-missing value across columns 242 242 + { 243 243 + type: 'tidyblocks_op_coalesce', 244 244 + message0: 'coalesce %1 with %2', 245 245 + args0: [ 246 246 + { type: 'input_value', name: 'VALUE' }, 247 247 + { type: 'input_value', name: 'REPLACEMENT' } 248 248 + ], 249 249 + output: null, 250 250 + colour: '#F9B5B2', 251 251 + inputsInline: true, 252 252 + tooltip: 'Replace missing values in a column with values from another column or expression.' 253 253 + }, 254 254 + // dplyr: n_distinct() — count of unique values in a column 255 255 + { 256 256 + type: 'tidyblocks_op_n_distinct', 257 257 + message0: 'n_distinct %1', 258 258 + args0: [{ type: 'input_value', name: 'VALUE' }], 259 259 + output: 'Number', 260 260 + colour: '#F9B5B2', 261 261 + inputsInline: true, 262 262 + tooltip: 'Count the number of distinct (unique) values in a column.' 226 263 } 227 264 ]);

+81 -12

packages/tidyblocks/src/blocks/transform.ts

reviewed

··· 9 9 ['max', 'max'], 10 10 ['std dev', 'std'], 11 11 ['variance', 'var'], 12 12 + ['n distinct', 'nunique'], 12 13 ['any', 'any'], 13 14 ['all', 'all'] 14 15 ]; ··· 32 33 tooltip: 'Keep only rows matching a condition.' 33 34 }, 34 35 { 36 36 + // dplyr: select() — keep named columns 35 37 type: 'tidyblocks_transform_select', 36 38 message0: 'select columns %1', 37 39 args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1, col2' }], ··· 50 52 tooltip: 'Remove the specified columns (comma-separated).' 51 53 }, 52 54 { 53 53 - type: 'tidyblocks_transform_create', 54 54 - message0: 'create column %1 as %2', 55 55 + // dplyr: mutate() — create or overwrite a column 56 56 + type: 'tidyblocks_transform_mutate', 57 57 + message0: 'mutate %1 = %2', 55 58 args0: [ 56 59 { type: 'field_input', name: 'COLUMN', text: 'new_col' }, 57 60 { type: 'input_value', name: 'EXPRESSION' } ··· 62 65 tooltip: 'Add or replace a column using an expression.' 63 66 }, 64 67 { 68 68 + // dplyr: rename() — rename a single column 65 69 type: 'tidyblocks_transform_rename', 66 70 message0: 'rename %1 to %2', 67 71 args0: [ ··· 74 78 tooltip: 'Rename a column.' 75 79 }, 76 80 { 77 77 - type: 'tidyblocks_transform_sort', 78 78 - message0: 'sort by %1 %2', 81 81 + // dplyr: arrange() — order rows by column values 82 82 + type: 'tidyblocks_transform_arrange', 83 83 + message0: 'arrange by %1 %2', 79 84 args0: [ 80 85 { type: 'field_input', name: 'COLUMNS', text: 'col1' }, 81 86 { ··· 93 98 tooltip: 'Sort rows by one or more columns (comma-separated).' 94 99 }, 95 100 { 96 96 - type: 'tidyblocks_transform_unique', 97 97 - message0: 'unique by %1', 101 101 + // dplyr: distinct() — keep unique rows 102 102 + type: 'tidyblocks_transform_distinct', 103 103 + message0: 'distinct by %1', 98 104 args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1' }], 99 105 previousStatement: null, 100 106 nextStatement: null, ··· 102 108 tooltip: 'Keep only rows with distinct values in the specified columns.' 103 109 }, 104 110 { 111 111 + // dplyr: group_by() — group rows by column values 105 112 type: 'tidyblocks_transform_groupby', 106 113 message0: 'group by %1', 107 114 args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1' }], ··· 111 118 tooltip: 'Group rows by column values for subsequent summarize or running.' 112 119 }, 113 120 { 121 121 + // dplyr: ungroup() — remove grouping 114 122 type: 'tidyblocks_transform_ungroup', 115 123 message0: 'ungroup', 116 124 previousStatement: null, ··· 119 127 tooltip: 'Remove grouping and reset the index.' 120 128 }, 121 129 { 130 130 + // dplyr: summarize() — aggregate groups to one row each 122 131 type: 'tidyblocks_transform_summarize', 123 132 message0: 'summarize %1 of %2 as %3', 124 133 args0: [ ··· 196 205 tooltip: 'Drop rows that have missing values in the specified columns.' 197 206 }, 198 207 { 199 199 - type: 'tidyblocks_transform_sample', 200 200 - message0: 'sample %1 rows', 208 208 + // dplyr: slice_sample() — random sample of N rows 209 209 + type: 'tidyblocks_transform_slice_sample', 210 210 + message0: 'slice_sample %1 rows', 201 211 args0: [ 202 212 { type: 'field_number', name: 'N', value: 10, min: 1, precision: 1 } 203 213 ], ··· 207 217 tooltip: 'Randomly sample N rows from the DataFrame.' 208 218 }, 209 219 { 210 210 - type: 'tidyblocks_transform_head', 211 211 - message0: 'first %1 rows', 220 220 + // dplyr: slice_head() — first N rows 221 221 + type: 'tidyblocks_transform_slice_head', 222 222 + message0: 'slice_head %1 rows', 212 223 args0: [ 213 224 { type: 'field_number', name: 'N', value: 10, min: 1, precision: 1 } 214 225 ], ··· 218 229 tooltip: 'Keep only the first N rows.' 219 230 }, 220 231 { 221 221 - type: 'tidyblocks_transform_tail', 222 222 - message0: 'last %1 rows', 232 232 + // dplyr: slice_tail() — last N rows 233 233 + type: 'tidyblocks_transform_slice_tail', 234 234 + message0: 'slice_tail %1 rows', 223 235 args0: [ 224 236 { type: 'field_number', name: 'N', value: 10, min: 1, precision: 1 } 225 237 ], ··· 227 239 nextStatement: null, 228 240 colour: '#76AADB', 229 241 tooltip: 'Keep only the last N rows.' 242 242 + }, 243 243 + { 244 244 + // dplyr: slice_min() — N rows with smallest values in a column 245 245 + type: 'tidyblocks_transform_slice_min', 246 246 + message0: 'slice_min %1 rows by %2', 247 247 + args0: [ 248 248 + { type: 'field_number', name: 'N', value: 5, min: 1, precision: 1 }, 249 249 + { type: 'field_input', name: 'COLUMN', text: 'col1' } 250 250 + ], 251 251 + previousStatement: null, 252 252 + nextStatement: null, 253 253 + colour: '#76AADB', 254 254 + tooltip: 'Keep the N rows with the smallest values in a column.' 255 255 + }, 256 256 + { 257 257 + // dplyr: slice_max() — N rows with largest values in a column 258 258 + type: 'tidyblocks_transform_slice_max', 259 259 + message0: 'slice_max %1 rows by %2', 260 260 + args0: [ 261 261 + { type: 'field_number', name: 'N', value: 5, min: 1, precision: 1 }, 262 262 + { type: 'field_input', name: 'COLUMN', text: 'col1' } 263 263 + ], 264 264 + previousStatement: null, 265 265 + nextStatement: null, 266 266 + colour: '#76AADB', 267 267 + tooltip: 'Keep the N rows with the largest values in a column.' 268 268 + }, 269 269 + { 270 270 + // dplyr: count() — count rows per group (or total if ungrouped) 271 271 + type: 'tidyblocks_transform_count', 272 272 + message0: 'count by %1', 273 273 + args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1' }], 274 274 + previousStatement: null, 275 275 + nextStatement: null, 276 276 + colour: '#76AADB', 277 277 + tooltip: 'Count rows for each combination of the specified columns.' 278 278 + }, 279 279 + { 280 280 + // dplyr: relocate() — move column(s) to before or after another column 281 281 + type: 'tidyblocks_transform_relocate', 282 282 + message0: 'relocate %1 %2 %3', 283 283 + args0: [ 284 284 + { type: 'field_input', name: 'COLUMNS', text: 'col1' }, 285 285 + { 286 286 + type: 'field_dropdown', 287 287 + name: 'POSITION', 288 288 + options: [ 289 289 + ['before', 'before'], 290 290 + ['after', 'after'] 291 291 + ] 292 292 + }, 293 293 + { type: 'field_input', name: 'ANCHOR', text: 'col2' } 294 294 + ], 295 295 + previousStatement: null, 296 296 + nextStatement: null, 297 297 + colour: '#76AADB', 298 298 + tooltip: 'Move one or more columns to before or after a reference column.' 230 299 }, 231 300 { 232 301 type: 'tidyblocks_transform_display',

+31 -1

packages/tidyblocks/src/generators/python/combine.ts

reviewed

··· 1 1 import { pythonGenerator, Order } from 'blockly/python'; 2 2 3 3 + // dplyr: inner_join / left_join / right_join / full_join 3 4 pythonGenerator.forBlock['tidyblocks_combine_join'] = block => { 4 5 const how = block.getFieldValue('HOW'); 5 6 const rightDf = block.getFieldValue('RIGHT_DF'); ··· 11 12 ); 12 13 }; 13 14 14 14 - pythonGenerator.forBlock['tidyblocks_combine_glue'] = block => { 15 15 + // dplyr: bind_rows() — vertically stack with a label column 16 16 + pythonGenerator.forBlock['tidyblocks_combine_bind_rows'] = block => { 15 17 const otherDf = block.getFieldValue('OTHER_DF'); 16 18 const labelCol = block.getFieldValue('LABEL_COL'); 17 19 return ( ··· 22 24 ); 23 25 }; 24 26 27 27 + // dplyr: cross_join() — Cartesian product 25 28 pythonGenerator.forBlock['tidyblocks_combine_cross_join'] = block => { 26 29 const rightDf = block.getFieldValue('RIGHT_DF'); 27 30 return `_df = _df.merge(${rightDf}, how='cross')\n`; 31 31 + }; 32 32 + 33 33 + // dplyr: semi_join() — filtering join; keep rows that have a match 34 34 + // pandas has no native semi_join, so we use merge + filtering 35 35 + pythonGenerator.forBlock['tidyblocks_combine_semi_join'] = block => { 36 36 + const rightDf = block.getFieldValue('RIGHT_DF'); 37 37 + const leftOn = block.getFieldValue('LEFT_ON'); 38 38 + const rightOn = block.getFieldValue('RIGHT_ON'); 39 39 + return ( 40 40 + `_df = _df[_df['${leftOn}'].isin(${rightDf}['${rightOn}'])]\n` 41 41 + ); 42 42 + }; 43 43 + 44 44 + // dplyr: anti_join() — filtering join; keep rows that have no match 45 45 + pythonGenerator.forBlock['tidyblocks_combine_anti_join'] = block => { 46 46 + const rightDf = block.getFieldValue('RIGHT_DF'); 47 47 + const leftOn = block.getFieldValue('LEFT_ON'); 48 48 + const rightOn = block.getFieldValue('RIGHT_ON'); 49 49 + return ( 50 50 + `_df = _df[~_df['${leftOn}'].isin(${rightDf}['${rightOn}'])]\n` 51 51 + ); 52 52 + }; 53 53 + 54 54 + // dplyr: bind_cols() — horizontally bind by column position 55 55 + pythonGenerator.forBlock['tidyblocks_combine_bind_cols'] = block => { 56 56 + const otherDf = block.getFieldValue('OTHER_DF'); 57 57 + return `_df = pd.concat([_df.reset_index(drop=True), ${otherDf}.reset_index(drop=True)], axis=1)\n`; 28 58 }; 29 59 30 60 export { Order };

+21

packages/tidyblocks/src/generators/python/op.ts

reviewed

··· 129 129 return [`(${val}).str.contains('${pattern}', na=False)`, Order.FUNCTION_CALL]; 130 130 }; 131 131 132 132 + // dplyr: between(x, left, right) — inclusive range check 133 133 + pythonGenerator.forBlock['tidyblocks_op_between'] = (block, generator) => { 134 134 + const val = generator.valueToCode(block, 'VALUE', Order.NONE) || '_df.iloc[:, 0]'; 135 135 + const left = block.getFieldValue('LEFT'); 136 136 + const right = block.getFieldValue('RIGHT'); 137 137 + return [`(${val}).between(${left}, ${right})`, Order.FUNCTION_CALL]; 138 138 + }; 139 139 + 140 140 + // dplyr: coalesce(x, y) — first non-missing value 141 141 + pythonGenerator.forBlock['tidyblocks_op_coalesce'] = (block, generator) => { 142 142 + const val = generator.valueToCode(block, 'VALUE', Order.NONE) || '_df.iloc[:, 0]'; 143 143 + const replacement = generator.valueToCode(block, 'REPLACEMENT', Order.NONE) || 'None'; 144 144 + return [`(${val}).fillna(${replacement})`, Order.FUNCTION_CALL]; 145 145 + }; 146 146 + 147 147 + // dplyr: n_distinct(x) — count unique values 148 148 + pythonGenerator.forBlock['tidyblocks_op_n_distinct'] = (block, generator) => { 149 149 + const val = generator.valueToCode(block, 'VALUE', Order.NONE) || '_df.iloc[:, 0]'; 150 150 + return [`(${val}).nunique()`, Order.FUNCTION_CALL]; 151 151 + }; 152 152 + 132 153 export { Order };

+47 -6

packages/tidyblocks/src/generators/python/transform.ts

reviewed

··· 28 28 return `_df = _df.drop(columns=${cols})\n`; 29 29 }; 30 30 31 31 - pythonGenerator.forBlock['tidyblocks_transform_create'] = ( 31 31 + // dplyr: mutate() — create or overwrite a column 32 32 + pythonGenerator.forBlock['tidyblocks_transform_mutate'] = ( 32 33 block, 33 34 generator 34 35 ) => { ··· 38 39 return `_df = _df.assign(**{'${col}': ${expr}})\n`; 39 40 }; 40 41 42 42 + // dplyr: rename() — rename new_name = old_name 41 43 pythonGenerator.forBlock['tidyblocks_transform_rename'] = block => { 42 44 const oldName = block.getFieldValue('OLD_NAME'); 43 45 const newName = block.getFieldValue('NEW_NAME'); 44 46 return `_df = _df.rename(columns={'${oldName}': '${newName}'})\n`; 45 47 }; 46 48 47 47 - pythonGenerator.forBlock['tidyblocks_transform_sort'] = block => { 49 49 + // dplyr: arrange() — order rows by column values 50 50 + pythonGenerator.forBlock['tidyblocks_transform_arrange'] = block => { 48 51 const cols = toCols(block.getFieldValue('COLUMNS')); 49 52 const asc = block.getFieldValue('ORDER'); 50 53 return `_df = _df.sort_values(by=${cols}, ascending=${asc})\n`; 51 54 }; 52 55 53 53 - pythonGenerator.forBlock['tidyblocks_transform_unique'] = block => { 56 56 + // dplyr: distinct() — keep unique rows 57 57 + pythonGenerator.forBlock['tidyblocks_transform_distinct'] = block => { 54 58 const cols = toCols(block.getFieldValue('COLUMNS')); 55 59 return `_df = _df.drop_duplicates(subset=${cols})\n`; 56 60 }; ··· 111 115 return `_df = _df.dropna(subset=${cols})\n`; 112 116 }; 113 117 114 114 - pythonGenerator.forBlock['tidyblocks_transform_sample'] = block => { 118 118 + // dplyr: slice_sample() — random N rows 119 119 + pythonGenerator.forBlock['tidyblocks_transform_slice_sample'] = block => { 115 120 const n = block.getFieldValue('N'); 116 121 return `_df = _df.sample(n=${n})\n`; 117 122 }; 118 123 119 119 - pythonGenerator.forBlock['tidyblocks_transform_head'] = block => { 124 124 + // dplyr: slice_head() — first N rows 125 125 + pythonGenerator.forBlock['tidyblocks_transform_slice_head'] = block => { 120 126 const n = block.getFieldValue('N'); 121 127 return `_df = _df.head(${n})\n`; 122 128 }; 123 129 124 124 - pythonGenerator.forBlock['tidyblocks_transform_tail'] = block => { 130 130 + // dplyr: slice_tail() — last N rows 131 131 + pythonGenerator.forBlock['tidyblocks_transform_slice_tail'] = block => { 125 132 const n = block.getFieldValue('N'); 126 133 return `_df = _df.tail(${n})\n`; 134 134 + }; 135 135 + 136 136 + // dplyr: slice_min() — N rows with smallest values in a column 137 137 + pythonGenerator.forBlock['tidyblocks_transform_slice_min'] = block => { 138 138 + const n = block.getFieldValue('N'); 139 139 + const col = block.getFieldValue('COLUMN'); 140 140 + return `_df = _df.nsmallest(${n}, '${col}')\n`; 141 141 + }; 142 142 + 143 143 + // dplyr: slice_max() — N rows with largest values in a column 144 144 + pythonGenerator.forBlock['tidyblocks_transform_slice_max'] = block => { 145 145 + const n = block.getFieldValue('N'); 146 146 + const col = block.getFieldValue('COLUMN'); 147 147 + return `_df = _df.nlargest(${n}, '${col}')\n`; 148 148 + }; 149 149 + 150 150 + // dplyr: count() — count rows per combination of columns 151 151 + pythonGenerator.forBlock['tidyblocks_transform_count'] = block => { 152 152 + const cols = toCols(block.getFieldValue('COLUMNS')); 153 153 + return `_df = _df.groupby(${cols}, as_index=False).size().rename(columns={'size': 'n'})\n`; 154 154 + }; 155 155 + 156 156 + // dplyr: relocate() — move columns before or after a reference column 157 157 + pythonGenerator.forBlock['tidyblocks_transform_relocate'] = block => { 158 158 + const cols = block.getFieldValue('COLUMNS').split(',').map((c: string) => c.trim()); 159 159 + const position = block.getFieldValue('POSITION'); 160 160 + const anchor = block.getFieldValue('ANCHOR'); 161 161 + // Build the new column order by inserting cols before/after anchor. 162 162 + return ( 163 163 + `_tmp_cols = [c for c in _df.columns if c not in ${JSON.stringify(cols)}]\n` + 164 164 + `_anchor_idx = _tmp_cols.index('${anchor}')\n` + 165 165 + `_insert_at = _anchor_idx + ${position === 'after' ? 1 : 0}\n` + 166 166 + `_df = _df[_tmp_cols[:_insert_at] + ${JSON.stringify(cols)} + _tmp_cols[_insert_at:]]\n` 167 167 + ); 127 168 }; 128 169 129 170 pythonGenerator.forBlock['tidyblocks_transform_display'] = _block => {

+20 -7

packages/tidyblocks/src/toolbox.ts

reviewed

··· 1 1 /** 2 2 * Blockly toolbox definition for all jupyter-tidyblocks tidy-data blocks. 3 3 * Color palette matches the original tidyblocks project by Greg Wilson. 4 4 + * 5 5 + * Block naming follows dplyr (tidyverse) conventions where an equivalent 6 6 + * dplyr verb exists. 4 7 */ 5 8 export const TIDYBLOCKS_TOOLBOX = { 6 9 kind: 'categoryToolbox', ··· 38 41 { kind: 'block', type: 'tidyblocks_transform_filter' }, 39 42 { kind: 'block', type: 'tidyblocks_transform_select' }, 40 43 { kind: 'block', type: 'tidyblocks_transform_drop' }, 41 41 - { kind: 'block', type: 'tidyblocks_transform_create' }, 44 44 + { kind: 'block', type: 'tidyblocks_transform_mutate' }, // was: create 42 45 { kind: 'block', type: 'tidyblocks_transform_rename' }, 43 43 - { kind: 'block', type: 'tidyblocks_transform_sort' }, 44 44 - { kind: 'block', type: 'tidyblocks_transform_unique' }, 46 46 + { kind: 'block', type: 'tidyblocks_transform_relocate' }, // new: dplyr relocate() 47 47 + { kind: 'block', type: 'tidyblocks_transform_arrange' }, // was: sort 48 48 + { kind: 'block', type: 'tidyblocks_transform_distinct' }, // was: unique 45 49 { kind: 'block', type: 'tidyblocks_transform_groupby' }, 46 50 { kind: 'block', type: 'tidyblocks_transform_ungroup' }, 47 51 { kind: 'block', type: 'tidyblocks_transform_summarize' }, 52 52 + { kind: 'block', type: 'tidyblocks_transform_count' }, // new: dplyr count() 48 53 { kind: 'block', type: 'tidyblocks_transform_running' }, 49 54 { kind: 'block', type: 'tidyblocks_transform_bin' }, 50 55 { kind: 'block', type: 'tidyblocks_transform_saveas' }, 51 56 { kind: 'block', type: 'tidyblocks_transform_fillna' }, 52 57 { kind: 'block', type: 'tidyblocks_transform_dropna' }, 53 53 - { kind: 'block', type: 'tidyblocks_transform_sample' }, 54 54 - { kind: 'block', type: 'tidyblocks_transform_head' }, 55 55 - { kind: 'block', type: 'tidyblocks_transform_tail' }, 58 58 + { kind: 'block', type: 'tidyblocks_transform_slice_sample' }, // was: sample 59 59 + { kind: 'block', type: 'tidyblocks_transform_slice_head' }, // was: head 60 60 + { kind: 'block', type: 'tidyblocks_transform_slice_tail' }, // was: tail 61 61 + { kind: 'block', type: 'tidyblocks_transform_slice_min' }, // new: dplyr slice_min() 62 62 + { kind: 'block', type: 'tidyblocks_transform_slice_max' }, // new: dplyr slice_max() 56 63 { kind: 'block', type: 'tidyblocks_transform_display' } 57 64 ] 58 65 }, ··· 62 69 colour: '#808080', 63 70 contents: [ 64 71 { kind: 'block', type: 'tidyblocks_combine_join' }, 65 65 - { kind: 'block', type: 'tidyblocks_combine_glue' }, 72 72 + { kind: 'block', type: 'tidyblocks_combine_semi_join' }, // new: dplyr semi_join() 73 73 + { kind: 'block', type: 'tidyblocks_combine_anti_join' }, // new: dplyr anti_join() 74 74 + { kind: 'block', type: 'tidyblocks_combine_bind_rows' }, // was: glue 75 75 + { kind: 'block', type: 'tidyblocks_combine_bind_cols' }, // new: dplyr bind_cols() 66 76 { kind: 'block', type: 'tidyblocks_combine_cross_join' } 67 77 ] 68 78 }, ··· 117 127 contents: [ 118 128 { kind: 'block', type: 'tidyblocks_op_arithmetic' }, 119 129 { kind: 'block', type: 'tidyblocks_op_compare' }, 130 130 + { kind: 'block', type: 'tidyblocks_op_between' }, // new: dplyr between() 120 131 { kind: 'block', type: 'tidyblocks_op_logic' }, 121 132 { kind: 'block', type: 'tidyblocks_op_not' }, 122 133 { kind: 'block', type: 'tidyblocks_op_ifelse' }, 134 134 + { kind: 'block', type: 'tidyblocks_op_coalesce' }, // new: dplyr coalesce() 135 135 + { kind: 'block', type: 'tidyblocks_op_n_distinct' }, // new: dplyr n_distinct() 123 136 { kind: 'block', type: 'tidyblocks_op_typecheck' }, 124 137 { kind: 'block', type: 'tidyblocks_op_convert' }, 125 138 { kind: 'block', type: 'tidyblocks_op_datetime' },