···991010### New features
11111212-- **`packages/tidyblocks`**: new monorepo package providing 50+ tidy-data analysis blocks organized into seven categories — Data, Transform, Combine, Plot, Stats, Value, and Op — with Python (pandas / plotly.express / scipy / sklearn) code generators
1212+- **`packages/tidyblocks`**: new monorepo package providing 60+ tidy-data analysis blocks organized into seven categories — Data, Transform, Combine, Plot, Stats, Values, and Operations — with Python (pandas / plotly.express / scipy / sklearn) code generators
1313- Exports `registerTidyblocks(registry)` for registering all blocks and the Tidy Data toolbox with any `IBlocklyRegistry` instance
1414+- Block names aligned with [dplyr (tidyverse)](https://dplyr.tidyverse.org/) conventions; 7 blocks renamed and 10 new blocks added (see below)
1515+1616+#### dplyr alignment — renames
1717+1818+| Old | New | dplyr verb |
1919+|---|---|---|
2020+| create column | mutate | `mutate()` |
2121+| sort by | arrange by | `arrange()` |
2222+| unique by | distinct by | `distinct()` |
2323+| first N rows | slice_head | `slice_head()` |
2424+| last N rows | slice_tail | `slice_tail()` |
2525+| sample N rows | slice_sample | `slice_sample()` |
2626+| glue with | bind_rows with | `bind_rows()` |
2727+2828+#### dplyr alignment — new blocks
2929+3030+- **Transform**: `count()`, `relocate()`, `slice_min()`, `slice_max()`
3131+- **Combine**: `semi_join()`, `anti_join()`, `bind_cols()`
3232+- **Operations**: `between()`, `coalesce()`, `n_distinct()`
3333+- **summarize** block: added `n distinct` aggregate function option
14341535### Rebrand & metadata
1636···4969- `packages/tidyblocks/src/index.ts`: `registry.addToolbox` → `registry.registerToolbox` (correct method name on `IBlocklyRegistry`)
5070- Root `tsconfig.json`: added `"lib": ["ES2020", "DOM"]` to resolve `Intl.ResolvedRelativeTimeFormatOptions` error from `@jupyterlab/coreutils`
51717272+### Package manager
7373+7474+- Migrated from Yarn 4 to npm; `yarn.lock` / `.yarnrc.yml` / `.yarn/` removed; `"resolutions"` → `"overrides"`; `jlpm` replaced with `npm run` in all scripts
7575+- `yarn.lock` added to `.gitignore` (regenerated as a build side-effect by `@jupyterlab/builder`'s bundled jlpm)
7676+5277### Docs
53785454-- `docs/jupyterlab-blockly_architecture.md`: full architecture document (refers to the upstream project as `jupyterlab-blockly`)
5555-- `docs/tidyblocks-features.md`: feature inventory and port plan from gvwilson/tidyblocks
7979+- `docs/getting-started.md`: step-by-step guide for installing and testing the extension in JupyterLab
8080+- `docs/architecture.md`: full architecture reference (package layout, data-flow, extension points)
8181+- `docs/blocks-reference.md`: complete block reference with dplyr mapping, description, and generated Python for every block
8282+- `docs/work-summary.md`: narrative summary of all engineering work done in this release
5683- `docs/modernization-plan.md`: full modernization plan with phase-by-phase status
5784- `README.md`: rewritten; credits Greg Wilson's tidyblocks and QuantStack/jupyterlab-blockly
5885
+223
docs/blocks-reference.md
···11+# Block Reference
22+33+Complete reference for all blocks available in the **Tidy Data** toolbox.
44+Blocks are organized by category, matching the sidebar in the Blockly editor.
55+66+Block names follow [dplyr (tidyverse)](https://dplyr.tidyverse.org/) conventions
77+where an equivalent verb exists. The **dplyr equivalent** column shows the R
88+function that inspired the block; a dash (—) means there is no direct
99+dplyr analogue.
1010+1111+---
1212+1313+## Data `#FEBE4C`
1414+1515+Source blocks start a pipeline. They create a DataFrame stored in `_df` and
1616+have no top connector (nothing chains into them).
1717+1818+| Block label | Block type | dplyr equivalent | What it does | Python generated |
1919+|---|---|---|---|---|
2020+| penguins dataset | `tidyblocks_data_penguins` | — | Palmer Penguins dataset loaded via seaborn | `_df = sns.load_dataset('penguins')` |
2121+| colors dataset | `tidyblocks_data_colors` | — | Built-in table of 11 colors with RGB values | `_df = pd.DataFrame({...})` |
2222+| earthquakes dataset | `tidyblocks_data_earthquakes` | — | 2016 global earthquake data from gvwilson/tidyblocks | `_df = pd.read_csv('<url>')` |
2323+| sequence 1 to N as col | `tidyblocks_data_sequence` | — | Integer sequence 1..N in a named column | `_df = pd.DataFrame({'col': range(1, N+1)})` |
2424+| dataset named name | `tidyblocks_data_user` | — | Reference a DataFrame previously saved with **save as** | `_df = name.copy()` |
2525+| read CSV path | `tidyblocks_data_csv` | — | Load a CSV file from a local or remote path | `_df = pd.read_csv('path')` |
2626+2727+---
2828+2929+## Transform `#76AADB`
3030+3131+Transform blocks read from and write back to `_df`. They can be chained in
3232+any order between a source block and a terminal block.
3333+3434+### Row operations
3535+3636+| Block label | Block type | dplyr equivalent | What it does | Python generated |
3737+|---|---|---|---|---|
3838+| filter where cond | `tidyblocks_transform_filter` | `filter()` | Keep only rows where the condition is `True` | `_df = _df[cond]` |
3939+| arrange by cols ↑↓ | `tidyblocks_transform_arrange` | `arrange()` | Sort rows by one or more columns, ascending or descending | `_df = _df.sort_values(by=[...], ascending=True/False)` |
4040+| distinct by cols | `tidyblocks_transform_distinct` | `distinct()` | Remove duplicate rows, keeping one per unique combination of columns | `_df = _df.drop_duplicates(subset=[...])` |
4141+| slice_head N rows | `tidyblocks_transform_slice_head` | `slice_head()` | Keep the first N rows | `_df = _df.head(N)` |
4242+| slice_tail N rows | `tidyblocks_transform_slice_tail` | `slice_tail()` | Keep the last N rows | `_df = _df.tail(N)` |
4343+| slice_sample N rows | `tidyblocks_transform_slice_sample` | `slice_sample()` | Randomly sample N rows | `_df = _df.sample(n=N)` |
4444+| slice_min N rows by col | `tidyblocks_transform_slice_min` | `slice_min()` | Keep the N rows with the smallest values in a column | `_df = _df.nsmallest(N, 'col')` |
4545+| slice_max N rows by col | `tidyblocks_transform_slice_max` | `slice_max()` | Keep the N rows with the largest values in a column | `_df = _df.nlargest(N, 'col')` |
4646+| drop rows with missing in cols | `tidyblocks_transform_dropna` | — (`tidyr::drop_na`) | Remove rows that have missing values in the specified columns | `_df = _df.dropna(subset=[...])` |
4747+4848+### Column operations
4949+5050+| Block label | Block type | dplyr equivalent | What it does | Python generated |
5151+|---|---|---|---|---|
5252+| select columns cols | `tidyblocks_transform_select` | `select()` | Keep only the named columns | `_df = _df[[...]]` |
5353+| drop columns cols | `tidyblocks_transform_drop` | `select(-col)` | Remove the named columns | `_df = _df.drop(columns=[...])` |
5454+| mutate col = expr | `tidyblocks_transform_mutate` | `mutate()` | Add a new column or overwrite an existing one with an expression | `_df = _df.assign(**{'col': expr})` |
5555+| rename old to new | `tidyblocks_transform_rename` | `rename()` | Rename a single column | `_df = _df.rename(columns={'old': 'new'})` |
5656+| relocate cols before/after anchor | `tidyblocks_transform_relocate` | `relocate()` | Move one or more columns to a new position relative to an anchor column | reorders `_df.columns` |
5757+| fill missing in col with val | `tidyblocks_transform_fillna` | — (`tidyr::replace_na`) | Replace missing values in a column with a given value | `_df = _df.assign(**{'col': _df['col'].fillna(val)})` |
5858+5959+### Grouping & aggregation
6060+6161+| Block label | Block type | dplyr equivalent | What it does | Python generated |
6262+|---|---|---|---|---|
6363+| group by cols | `tidyblocks_transform_groupby` | `group_by()` | Group rows by the values in one or more columns for use with summarize or running | `_df = _df.groupby([...], as_index=False)` |
6464+| ungroup | `tidyblocks_transform_ungroup` | `ungroup()` | Remove grouping and reset the row index | `_df = _df.reset_index(drop=True)` |
6565+| summarize fn of col as result | `tidyblocks_transform_summarize` | `summarize()` | Aggregate each group (or the whole DataFrame) to a single row using count / sum / mean / median / min / max / std / var / n distinct / any / all | `_df = _df.agg(**{'result': ('col', 'fn')}).reset_index()` |
6666+| count by cols | `tidyblocks_transform_count` | `count()` | Count rows for each unique combination of the specified columns | `_df = _df.groupby([...], as_index=False).size().rename(columns={'size': 'n'})` |
6767+| running fn of col as result | `tidyblocks_transform_running` | — (window fns) | Compute a cumulative operation (cumsum / cummax / cummin / cummean / row index) across rows | `_df = _df.assign(**{'result': _df['col'].cumsum()})` etc. |
6868+6969+### Utilities
7070+7171+| Block label | Block type | dplyr equivalent | What it does | Python generated |
7272+|---|---|---|---|---|
7373+| bin col into N buckets as result | `tidyblocks_transform_bin` | — (`cut()`) | Discretize a numeric column into N equal-width interval buckets | `_df = _df.assign(**{'result': pd.cut(_df['col'], bins=N).astype(str)})` |
7474+| save as name | `tidyblocks_transform_saveas` | — | Copy the current DataFrame into a named Python variable for later use with **dataset named** | `name = _df.copy()` |
7575+| display table | `tidyblocks_transform_display` | — | Render the current DataFrame as an HTML table in the output cell | `display(_df)` |
7676+7777+---
7878+7979+## Combine `#808080`
8080+8181+Combine blocks merge the current `_df` with a second DataFrame that was
8282+previously saved using **save as**.
8383+8484+### Mutating joins (add columns from the right table)
8585+8686+| Block label | Block type | dplyr equivalent | What it does | Python generated |
8787+|---|---|---|---|---|
8888+| inner/left/right/outer join other on left col = right col | `tidyblocks_combine_join` | `inner_join()` / `left_join()` / `right_join()` / `full_join()` | Join two DataFrames on matching key columns. Choose inner (only matching rows), left (all left rows), right (all right rows), or outer (all rows from both) | `_df = pd.merge(_df, other, left_on='lk', right_on='rk', how='...')` |
8989+| cross join with other | `tidyblocks_combine_cross_join` | `cross_join()` | Cartesian product — every row in `_df` paired with every row in `other` | `_df = _df.merge(other, how='cross')` |
9090+9191+### Filtering joins (keep/remove rows based on a match, no new columns)
9292+9393+| Block label | Block type | dplyr equivalent | What it does | Python generated |
9494+|---|---|---|---|---|
9595+| semi join other on left col = right col | `tidyblocks_combine_semi_join` | `semi_join()` | Keep only rows in `_df` that have a matching key in `other`. No columns from `other` are added. | `_df = _df[_df['lk'].isin(other['rk'])]` |
9696+| anti join other on left col = right col | `tidyblocks_combine_anti_join` | `anti_join()` | Keep only rows in `_df` that have **no** matching key in `other` | `_df = _df[~_df['lk'].isin(other['rk'])]` |
9797+9898+### Binding (stack or glue tables together)
9999+100100+| Block label | Block type | dplyr equivalent | What it does | Python generated |
101101+|---|---|---|---|---|
102102+| bind_rows with other label column src | `tidyblocks_combine_bind_rows` | `bind_rows()` | Vertically stack `_df` on top of `other`, adding a label column to identify the source of each row | `_df = pd.concat([_df.assign(src='left'), other.assign(src='right')]).reset_index(drop=True)` |
103103+| bind_cols with other | `tidyblocks_combine_bind_cols` | `bind_cols()` | Horizontally bind `_df` and `other` by column position. Both tables must have the same number of rows. | `_df = pd.concat([_df, other], axis=1)` |
104104+105105+---
106106+107107+## Plot `#A4C588`
108108+109109+Plot blocks are terminal — they render a chart and have no bottom connector.
110110+All plots use [Plotly Express](https://plotly.com/python/plotly-express/).
111111+112112+| Block label | Block type | dplyr equivalent | What it does | Python generated |
113113+|---|---|---|---|---|
114114+| bar chart x col y col | `tidyblocks_plot_bar` | — | Vertical bar chart | `px.bar(_df, x='col', y='col')` |
115115+| box plot x col y col | `tidyblocks_plot_box` | — | Box-and-whisker plot showing median, IQR, and outliers | `px.box(_df, x='col', y='col')` |
116116+| dot plot x col | `tidyblocks_plot_dot` | — | Strip/dot plot — one point per row along an axis | `px.strip(_df, x='col')` |
117117+| histogram of col bins N | `tidyblocks_plot_histogram` | — | Frequency histogram with N bins | `px.histogram(_df, x='col', nbins=N)` |
118118+| scatter plot x col y col color col trendline ☐ | `tidyblocks_plot_scatter` | — | Scatter plot with optional color grouping and OLS trendline | `px.scatter(_df, x=..., y=..., color=..., trendline=...)` |
119119+| line chart x col y col color col | `tidyblocks_plot_line` | — | Line chart with optional color grouping | `px.line(_df, x=..., y=..., color=...)` |
120120+| violin plot x col y col | `tidyblocks_plot_violin` | — | Violin plot showing the distribution shape | `px.violin(_df, x='col', y='col')` |
121121+| correlation heatmap | `tidyblocks_plot_heatmap` | — | Heatmap of pairwise Pearson correlations between all numeric columns | `px.imshow(_df.corr())` |
122122+123123+---
124124+125125+## Stats `#BA93DB`
126126+127127+Stats blocks are terminal — they print results and have no bottom connector.
128128+All stats use [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html)
129129+and [scikit-learn](https://scikit-learn.org/).
130130+131131+| Block label | Block type | dplyr equivalent | What it does | Python generated |
132132+|---|---|---|---|---|
133133+| one-sample t-test column col vs mean μ | `tidyblocks_stats_ttest_one` | — | Two-sided one-sample t-test: tests whether the column mean equals μ. Prints t-statistic and p-value. | `stats.ttest_1samp(_df['col'], μ)` |
134134+| two-sample t-test groups in group_col values in val_col | `tidyblocks_stats_ttest_two` | — | Two-sided two-sample t-test: splits rows into two groups and tests whether their means differ | `stats.ttest_ind(group_a, group_b)` |
135135+| k-means x col y col k N label col | `tidyblocks_stats_kmeans` | — | K-means clustering on two columns; adds a cluster label column to `_df` | `KMeans(n_clusters=N).fit_predict(...)` |
136136+| silhouette score x col y col labels col score col | `tidyblocks_stats_silhouette` | — | Computes the silhouette coefficient for existing cluster labels; adds a score column | `silhouette_score(X, labels)` |
137137+| Pearson/Spearman/Kendall correlation of col_a and col_b | `tidyblocks_stats_correlation` | — | Computes pairwise correlation coefficient and p-value between two columns | `stats.pearsonr / spearmanr / kendalltau` |
138138+| describe | `tidyblocks_stats_describe` | — | Prints `DataFrame.describe()` — count, mean, std, min, quartiles, max for every numeric column | `display(_df.describe())` |
139139+140140+---
141141+142142+## Values `#E7553C`
143143+144144+Value blocks are expression blocks — they produce a value and connect into
145145+input slots on transform or operation blocks. They do not have statement
146146+connectors.
147147+148148+| Block label | Block type | dplyr equivalent | What it does | Python generated |
149149+|---|---|---|---|---|
150150+| column col | `tidyblocks_value_column` | — | Reference a DataFrame column by name | `_df['col']` |
151151+| number | `tidyblocks_value_number` | — | A numeric literal | `0`, `3.14`, etc. |
152152+| "text" | `tidyblocks_value_text` | — | A string literal | `'text'` |
153153+| true / false | `tidyblocks_value_logical` | — | A boolean literal | `True` / `False` |
154154+| date YYYY-MM-DD | `tidyblocks_value_datetime` | — | A date/time constant | `pd.Timestamp('YYYY-MM-DD')` |
155155+| missing | `tidyblocks_value_missing` | `NA` | An explicit missing (NA/NaN) value | `float("nan")` |
156156+| Normal(mean μ std σ) | `tidyblocks_value_normal` | — | Draw a column of values from a Normal distribution | `np.random.normal(μ, σ, len(_df))` |
157157+| Uniform(low a high b) | `tidyblocks_value_uniform` | — | Draw a column of values from a Uniform distribution | `np.random.uniform(a, b, len(_df))` |
158158+| Exponential(lambda λ) | `tidyblocks_value_exponential` | — | Draw a column of values from an Exponential distribution | `np.random.exponential(1/λ, len(_df))` |
159159+160160+---
161161+162162+## Operations `#F9B5B2`
163163+164164+Operation blocks are expression blocks used inside **filter**, **mutate**,
165165+**fill missing**, and similar blocks. They take value inputs and return a
166166+computed value.
167167+168168+### Numeric & comparison
169169+170170+| Block label | Block type | dplyr equivalent | What it does | Python generated |
171171+|---|---|---|---|---|
172172+| a + b, a - b, a × b, a ÷ b, a % b, a ** b | `tidyblocks_op_arithmetic` | — | Standard arithmetic on two values | `(a + b)`, `(a * b)`, etc. |
173173+| a = b, a ≠ b, a < b, a ≤ b, a > b, a ≥ b | `tidyblocks_op_compare` | — | Element-wise comparison, returning a boolean column | `(a == b)`, `(a < b)`, etc. |
174174+| x between left and right | `tidyblocks_op_between` | `between()` | Return `True` for values within the inclusive range `[left, right]` | `x.between(left, right)` |
175175+| abs / round / floor / ceil / sqrt / log / exp ( val ) | `tidyblocks_op_math` | — | Apply a standard math function to a column | `val.abs()`, `np.sqrt(val)`, etc. |
176176+177177+### Logic
178178+179179+| Block label | Block type | dplyr equivalent | What it does | Python generated |
180180+|---|---|---|---|---|
181181+| a AND b / a OR b | `tidyblocks_op_logic` | — | Element-wise logical AND/OR on two boolean columns | `(a & b)`, `(a \| b)` |
182182+| NOT val | `tidyblocks_op_not` | — | Element-wise logical NOT | `~(val)` |
183183+| if cond then x else y | `tidyblocks_op_ifelse` | `if_else()` | Return `x` where `cond` is `True`, `y` elsewhere | `np.where(cond, x, y)` |
184184+| coalesce val with replacement | `tidyblocks_op_coalesce` | `coalesce()` | Replace missing values in `val` with values from `replacement` | `val.fillna(replacement)` |
185185+186186+### Type operations
187187+188188+| Block label | Block type | dplyr equivalent | What it does | Python generated |
189189+|---|---|---|---|---|
190190+| val is missing / is number / is text / is date / is boolean | `tidyblocks_op_typecheck` | — | Test whether each element matches a specific type | `val.isna()`, `val.apply(isinstance(...))`, etc. |
191191+| convert val to number / text / bool / datetime | `tidyblocks_op_convert` | — | Cast a column to a different type | `pd.to_numeric(val)`, `val.astype(str)`, etc. |
192192+193193+### Date & time
194194+195195+| Block label | Block type | dplyr equivalent | What it does | Python generated |
196196+|---|---|---|---|---|
197197+| year / month / day / weekday / hour / minute / second of val | `tidyblocks_op_datetime` | — | Extract a calendar component from a datetime column | `val.dt.year`, `val.dt.month`, etc. |
198198+199199+### Window & ranking
200200+201201+| Block label | Block type | dplyr equivalent | What it does | Python generated |
202202+|---|---|---|---|---|
203203+| shift val by N | `tidyblocks_op_shift` | `lag()` / `lead()` | Shift values forward (positive N = lag) or backward (negative N = lead) | `val.shift(N)` |
204204+| n_distinct val | `tidyblocks_op_n_distinct` | `n_distinct()` | Count the number of distinct (unique) values in a column | `val.nunique()` |
205205+206206+### String
207207+208208+| Block label | Block type | dplyr equivalent | What it does | Python generated |
209209+|---|---|---|---|---|
210210+| val . upper / lower / strip / length | `tidyblocks_op_string` | — | Apply a string operation to a text column | `val.str.upper()`, `val.str.len()`, etc. |
211211+| val contains pattern | `tidyblocks_op_str_contains` | `stringr::str_detect()` | Return `True` where the string column matches a pattern | `val.str.contains('pattern', na=False)` |
212212+213213+---
214214+215215+## Pipeline rules
216216+217217+| Block role | Has top connector | Has bottom connector | Examples |
218218+|---|---|---|---|
219219+| **Source** | No | Yes | all Data blocks |
220220+| **Transform** | Yes | Yes | filter, mutate, arrange, … |
221221+| **Terminal** | Yes | No | display table, all Plot blocks, all Stats blocks |
222222+223223+A valid pipeline must be: **one source → zero or more transforms → one terminal**.
+76-2
docs/work-summary.md
···211211212212---
213213214214-## 11. Documentation
214214+## 12. npm migration
215215+216216+**Problem:** The project used Yarn 4 (`packageManager: "yarn@4.6.0"`) but a
217217+stray `package-lock.json` and npm-installed `node_modules` had accumulated
218218+alongside it, creating an inconsistent state.
219219+220220+**What was done:**
221221+- Removed `packageManager: "yarn@4.6.0"` and replaced with `"npm@11.1.0"`.
222222+- Converted `"resolutions"` → `"overrides"` (npm 8.3+ equivalent).
223223+- Converted `"workspaces"` from Yarn's `{ "packages": [...] }` object form to
224224+ npm's array form `["packages/*"]`.
225225+- Replaced all `jlpm` references in `packages/blockly-extension/package.json`
226226+ scripts with `npm run`.
227227+- Replaced `jlpm` references in root `lint` / `prettier` scripts with
228228+ `npm run`.
229229+- Deleted `.yarnrc.yml`, `.yarn/` cache directory, and `yarn.lock`.
230230+- Updated `.gitignore` to track `node_modules/`, `package-lock.json`, and
231231+ `yarn.lock` (the last is a build side-effect from `@jupyterlab/builder`'s
232232+ bundled `jlpm`, which cannot be avoided).
233233+234234+**Note:** `jlpm` (a yarn shim bundled inside the `jupyterlab` Python package)
235235+is called internally by `jupyter labextension build` and will always
236236+regenerate `yarn.lock` during a build. This is an implementation detail of
237237+`@jupyterlab/builder` that cannot be configured away; the file is gitignored.
238238+239239+---
240240+241241+## 13. dplyr alignment and new blocks
242242+243243+**Motivation:** dplyr (R tidyverse) is the reference vocabulary for tidy-data
244244+analysis. Aligning block names to dplyr verbs makes the extension more
245245+intuitive for data scientists familiar with either R or the tidy-data
246246+paradigm.
247247+248248+### Renames (7 blocks)
249249+250250+| Old block label | New block label | dplyr verb |
251251+|---|---|---|
252252+| `create column` | `mutate` | `mutate()` |
253253+| `sort by` | `arrange by` | `arrange()` |
254254+| `unique by` | `distinct by` | `distinct()` |
255255+| `first N rows` | `slice_head N rows` | `slice_head()` |
256256+| `last N rows` | `slice_tail N rows` | `slice_tail()` |
257257+| `sample N rows` | `slice_sample N rows` | `slice_sample()` |
258258+| `glue with` | `bind_rows with` | `bind_rows()` |
259259+260260+Internal block type names were updated to match
261261+(e.g. `tidyblocks_transform_create` → `tidyblocks_transform_mutate`).
262262+263263+### New blocks (10 blocks)
264264+265265+**Transform**
266266+- `count by cols` — `count()`: count rows per combination of columns
267267+- `relocate cols before/after anchor` — `relocate()`: move columns to a new position
268268+- `slice_min N rows by col` — `slice_min()`: keep N rows with smallest values
269269+- `slice_max N rows by col` — `slice_max()`: keep N rows with largest values
270270+271271+**Combine**
272272+- `semi join` — `semi_join()`: filtering join, keep matched rows (no new columns)
273273+- `anti join` — `anti_join()`: filtering join, keep unmatched rows
274274+- `bind_cols with` — `bind_cols()`: horizontally bind two DataFrames by column position
275275+276276+**Operations**
277277+- `between left and right` — `between()`: inclusive range check
278278+- `coalesce val with replacement` — `coalesce()`: first non-missing value
279279+- `n_distinct val` — `n_distinct()`: count unique values
280280+281281+Also added `n distinct` as an option to the **summarize** block's function dropdown.
282282+283283+---
284284+285285+## 14. Documentation
286286+287287+(was §11)
215288216289| Document | Description |
217290|---|---|
218291| `docs/getting-started.md` | Step-by-step guide: install, launch JupyterLab, create a `.jpblockly` file, build a penguins pipeline, run it, and see output. |
219292| `docs/modernization-plan.md` | Updated to reflect completed phases, corrected version numbers (JupyterLab 4.5 not 4.6), and added a new Phase 6 documenting all the fixes in this work. |
220220-| `docs/architecture.md` *(this work)* | Full architecture reference: package layout, data-flow diagram, object relationships, block pipeline conventions, code generation pattern, and extension points. |
293293+| `docs/architecture.md` | Full architecture reference: package layout, data-flow diagram, object relationships, block pipeline conventions, code generation pattern, and extension points. |
294294+| `docs/blocks-reference.md` | Complete block reference: every block organized by category, with its block type name, dplyr equivalent, description, and generated Python. |
221295| `CHANGELOG.md` | Rewrote the `0.1.0` entry with accurate versions and full sections covering all new features, the rebrand, dependency upgrades, build fixes, bug fixes, and docs. |
···111111 // Handle state restoration.
112112 if (restorer) {
113113 // When restoring the app, if the document was open, reopen it
114114- restorer.restore(tracker, {
114114+ restorer.restore(tracker as any, {
115115 command: 'docmanager:open',
116116- args: widget => ({ path: widget.context.path, factory: FACTORY }),
117117- name: widget => widget.context.path
116116+ args: (widget: any) => ({ path: widget.context.path, factory: FACTORY }),
117117+ name: (widget: any) => widget.context.path
118118 });
119119 }
120120
+47-2
packages/tidyblocks/src/blocks/combine.ts
···991010Blockly.defineBlocksWithJsonArray([
1111 {
1212+ // dplyr: inner_join / left_join / right_join / full_join
1213 type: 'tidyblocks_combine_join',
1314 message0: '%1 join %2 on left %3 = right %4',
1415 args0: [
···2829 'Join the current DataFrame with a named DataFrame on matching columns.'
2930 },
3031 {
3131- type: 'tidyblocks_combine_glue',
3232- message0: 'glue with %1 label column %2',
3232+ // dplyr: bind_rows() — vertically stack two DataFrames
3333+ type: 'tidyblocks_combine_bind_rows',
3434+ message0: 'bind_rows with %1 label column %2',
3335 args0: [
3436 { type: 'field_input', name: 'OTHER_DF', text: 'other_df' },
3537 { type: 'field_input', name: 'LABEL_COL', text: 'source' }
···4143 'Vertically stack the current DataFrame with another, adding a label column.'
4244 },
4345 {
4646+ // dplyr: cross_join() — Cartesian product
4447 type: 'tidyblocks_combine_cross_join',
4548 message0: 'cross join with %1',
4649 args0: [{ type: 'field_input', name: 'RIGHT_DF', text: 'other_df' }],
···4851 nextStatement: null,
4952 colour: '#808080',
5053 tooltip: 'Cartesian product of the current DataFrame with another.'
5454+ },
5555+ {
5656+ // dplyr: semi_join() — keep rows in _df that have a match in other_df
5757+ // (no columns from other_df are added)
5858+ type: 'tidyblocks_combine_semi_join',
5959+ message0: 'semi join %1 on left %2 = right %3',
6060+ args0: [
6161+ { type: 'field_input', name: 'RIGHT_DF', text: 'other_df' },
6262+ { type: 'field_input', name: 'LEFT_ON', text: 'id' },
6363+ { type: 'field_input', name: 'RIGHT_ON', text: 'id' }
6464+ ],
6565+ previousStatement: null,
6666+ nextStatement: null,
6767+ colour: '#808080',
6868+ tooltip:
6969+ 'Keep only rows from the current DataFrame that have a match in the other DataFrame. No columns from the other DataFrame are added.'
7070+ },
7171+ {
7272+ // dplyr: anti_join() — keep rows in _df that have NO match in other_df
7373+ type: 'tidyblocks_combine_anti_join',
7474+ message0: 'anti join %1 on left %2 = right %3',
7575+ args0: [
7676+ { type: 'field_input', name: 'RIGHT_DF', text: 'other_df' },
7777+ { type: 'field_input', name: 'LEFT_ON', text: 'id' },
7878+ { type: 'field_input', name: 'RIGHT_ON', text: 'id' }
7979+ ],
8080+ previousStatement: null,
8181+ nextStatement: null,
8282+ colour: '#808080',
8383+ tooltip:
8484+ 'Keep only rows from the current DataFrame that have no match in the other DataFrame.'
8585+ },
8686+ {
8787+ // dplyr: bind_cols() — horizontally bind two DataFrames by column position
8888+ type: 'tidyblocks_combine_bind_cols',
8989+ message0: 'bind_cols with %1',
9090+ args0: [{ type: 'field_input', name: 'OTHER_DF', text: 'other_df' }],
9191+ previousStatement: null,
9292+ nextStatement: null,
9393+ colour: '#808080',
9494+ tooltip:
9595+ 'Horizontally bind the current DataFrame with another by column position. Both must have the same number of rows.'
5196 }
5297]);
+37
packages/tidyblocks/src/blocks/op.ts
···223223 colour: '#F9B5B2',
224224 inputsInline: true,
225225 tooltip: 'Check whether a string column contains a pattern.'
226226+ },
227227+ // dplyr: between() — test whether values fall within an inclusive range
228228+ {
229229+ type: 'tidyblocks_op_between',
230230+ message0: '%1 between %2 and %3',
231231+ args0: [
232232+ { type: 'input_value', name: 'VALUE' },
233233+ { type: 'field_number', name: 'LEFT', value: 0 },
234234+ { type: 'field_number', name: 'RIGHT', value: 1 }
235235+ ],
236236+ output: 'Boolean',
237237+ colour: '#F9B5B2',
238238+ inputsInline: true,
239239+ tooltip: 'Return True for values within the inclusive range [left, right].'
240240+ },
241241+ // dplyr: coalesce() — return the first non-missing value across columns
242242+ {
243243+ type: 'tidyblocks_op_coalesce',
244244+ message0: 'coalesce %1 with %2',
245245+ args0: [
246246+ { type: 'input_value', name: 'VALUE' },
247247+ { type: 'input_value', name: 'REPLACEMENT' }
248248+ ],
249249+ output: null,
250250+ colour: '#F9B5B2',
251251+ inputsInline: true,
252252+ tooltip: 'Replace missing values in a column with values from another column or expression.'
253253+ },
254254+ // dplyr: n_distinct() — count of unique values in a column
255255+ {
256256+ type: 'tidyblocks_op_n_distinct',
257257+ message0: 'n_distinct %1',
258258+ args0: [{ type: 'input_value', name: 'VALUE' }],
259259+ output: 'Number',
260260+ colour: '#F9B5B2',
261261+ inputsInline: true,
262262+ tooltip: 'Count the number of distinct (unique) values in a column.'
226263 }
227264]);
+81-12
packages/tidyblocks/src/blocks/transform.ts
···99 ['max', 'max'],
1010 ['std dev', 'std'],
1111 ['variance', 'var'],
1212+ ['n distinct', 'nunique'],
1213 ['any', 'any'],
1314 ['all', 'all']
1415];
···3233 tooltip: 'Keep only rows matching a condition.'
3334 },
3435 {
3636+ // dplyr: select() — keep named columns
3537 type: 'tidyblocks_transform_select',
3638 message0: 'select columns %1',
3739 args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1, col2' }],
···5052 tooltip: 'Remove the specified columns (comma-separated).'
5153 },
5254 {
5353- type: 'tidyblocks_transform_create',
5454- message0: 'create column %1 as %2',
5555+ // dplyr: mutate() — create or overwrite a column
5656+ type: 'tidyblocks_transform_mutate',
5757+ message0: 'mutate %1 = %2',
5558 args0: [
5659 { type: 'field_input', name: 'COLUMN', text: 'new_col' },
5760 { type: 'input_value', name: 'EXPRESSION' }
···6265 tooltip: 'Add or replace a column using an expression.'
6366 },
6467 {
6868+ // dplyr: rename() — rename a single column
6569 type: 'tidyblocks_transform_rename',
6670 message0: 'rename %1 to %2',
6771 args0: [
···7478 tooltip: 'Rename a column.'
7579 },
7680 {
7777- type: 'tidyblocks_transform_sort',
7878- message0: 'sort by %1 %2',
8181+ // dplyr: arrange() — order rows by column values
8282+ type: 'tidyblocks_transform_arrange',
8383+ message0: 'arrange by %1 %2',
7984 args0: [
8085 { type: 'field_input', name: 'COLUMNS', text: 'col1' },
8186 {
···9398 tooltip: 'Sort rows by one or more columns (comma-separated).'
9499 },
95100 {
9696- type: 'tidyblocks_transform_unique',
9797- message0: 'unique by %1',
101101+ // dplyr: distinct() — keep unique rows
102102+ type: 'tidyblocks_transform_distinct',
103103+ message0: 'distinct by %1',
98104 args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1' }],
99105 previousStatement: null,
100106 nextStatement: null,
···102108 tooltip: 'Keep only rows with distinct values in the specified columns.'
103109 },
104110 {
111111+ // dplyr: group_by() — group rows by column values
105112 type: 'tidyblocks_transform_groupby',
106113 message0: 'group by %1',
107114 args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1' }],
···111118 tooltip: 'Group rows by column values for subsequent summarize or running.'
112119 },
113120 {
121121+ // dplyr: ungroup() — remove grouping
114122 type: 'tidyblocks_transform_ungroup',
115123 message0: 'ungroup',
116124 previousStatement: null,
···119127 tooltip: 'Remove grouping and reset the index.'
120128 },
121129 {
130130+ // dplyr: summarize() — aggregate groups to one row each
122131 type: 'tidyblocks_transform_summarize',
123132 message0: 'summarize %1 of %2 as %3',
124133 args0: [
···196205 tooltip: 'Drop rows that have missing values in the specified columns.'
197206 },
198207 {
199199- type: 'tidyblocks_transform_sample',
200200- message0: 'sample %1 rows',
208208+ // dplyr: slice_sample() — random sample of N rows
209209+ type: 'tidyblocks_transform_slice_sample',
210210+ message0: 'slice_sample %1 rows',
201211 args0: [
202212 { type: 'field_number', name: 'N', value: 10, min: 1, precision: 1 }
203213 ],
···207217 tooltip: 'Randomly sample N rows from the DataFrame.'
208218 },
209219 {
210210- type: 'tidyblocks_transform_head',
211211- message0: 'first %1 rows',
220220+ // dplyr: slice_head() — first N rows
221221+ type: 'tidyblocks_transform_slice_head',
222222+ message0: 'slice_head %1 rows',
212223 args0: [
213224 { type: 'field_number', name: 'N', value: 10, min: 1, precision: 1 }
214225 ],
···218229 tooltip: 'Keep only the first N rows.'
219230 },
220231 {
221221- type: 'tidyblocks_transform_tail',
222222- message0: 'last %1 rows',
232232+ // dplyr: slice_tail() — last N rows
233233+ type: 'tidyblocks_transform_slice_tail',
234234+ message0: 'slice_tail %1 rows',
223235 args0: [
224236 { type: 'field_number', name: 'N', value: 10, min: 1, precision: 1 }
225237 ],
···227239 nextStatement: null,
228240 colour: '#76AADB',
229241 tooltip: 'Keep only the last N rows.'
242242+ },
243243+ {
244244+ // dplyr: slice_min() — N rows with smallest values in a column
245245+ type: 'tidyblocks_transform_slice_min',
246246+ message0: 'slice_min %1 rows by %2',
247247+ args0: [
248248+ { type: 'field_number', name: 'N', value: 5, min: 1, precision: 1 },
249249+ { type: 'field_input', name: 'COLUMN', text: 'col1' }
250250+ ],
251251+ previousStatement: null,
252252+ nextStatement: null,
253253+ colour: '#76AADB',
254254+ tooltip: 'Keep the N rows with the smallest values in a column.'
255255+ },
256256+ {
257257+ // dplyr: slice_max() — N rows with largest values in a column
258258+ type: 'tidyblocks_transform_slice_max',
259259+ message0: 'slice_max %1 rows by %2',
260260+ args0: [
261261+ { type: 'field_number', name: 'N', value: 5, min: 1, precision: 1 },
262262+ { type: 'field_input', name: 'COLUMN', text: 'col1' }
263263+ ],
264264+ previousStatement: null,
265265+ nextStatement: null,
266266+ colour: '#76AADB',
267267+ tooltip: 'Keep the N rows with the largest values in a column.'
268268+ },
269269+ {
270270+ // dplyr: count() — count rows per group (or total if ungrouped)
271271+ type: 'tidyblocks_transform_count',
272272+ message0: 'count by %1',
273273+ args0: [{ type: 'field_input', name: 'COLUMNS', text: 'col1' }],
274274+ previousStatement: null,
275275+ nextStatement: null,
276276+ colour: '#76AADB',
277277+ tooltip: 'Count rows for each combination of the specified columns.'
278278+ },
279279+ {
280280+ // dplyr: relocate() — move column(s) to before or after another column
281281+ type: 'tidyblocks_transform_relocate',
282282+ message0: 'relocate %1 %2 %3',
283283+ args0: [
284284+ { type: 'field_input', name: 'COLUMNS', text: 'col1' },
285285+ {
286286+ type: 'field_dropdown',
287287+ name: 'POSITION',
288288+ options: [
289289+ ['before', 'before'],
290290+ ['after', 'after']
291291+ ]
292292+ },
293293+ { type: 'field_input', name: 'ANCHOR', text: 'col2' }
294294+ ],
295295+ previousStatement: null,
296296+ nextStatement: null,
297297+ colour: '#76AADB',
298298+ tooltip: 'Move one or more columns to before or after a reference column.'
230299 },
231300 {
232301 type: 'tidyblocks_transform_display',