···11+## About Me
22+33+I am a data analyst with experience across the non-profit and public sectors. The most important part of my work, especially within my public sector roles, is using data to build deep understanding of the domains I work in, which both informs my own analysis and helps the audience develop meaningful narratives.
···11+{
22+ "hash": "72ff8a5760dd49f80c2a9695ad4b4eb8",
33+ "result": {
44+ "engine": "knitr",
55+ "markdown": "---\ntitle: \"R, DuckDB and Me\"\nauthor: Rory Lawless\ndate: 2025-03-30\nlastmod: 2025-03-31\nformat: html\ndraft: true\n---\n\nOver the past year, [DuckDB](https://duckdb.org/docs/stable/clients/r) has gradually become an important part of my data science workflow - at first clumsily, then seamlessly. I don’t typically work with large datasets, however, integrating DuckDB has addressed some of my frustrations, especially when dealing with hardware limitations and moderately-sized but inefficiently stored data. With this in mind, here are two major benefits I’ve found since integrating DuckDB into my workflow.\n\n## Handling larger-than-memory data\n\nAs noted, I don't work with very large data often but I still run into annoying issues caused by repeated reloading of data after making mistakes - a habit I call Read-Error-Reread (RERe? Let’s make it happen!). Now, this is not an issue for a .csv file containing a few hundred rows and, for larger files or those stored in legacy formats, I could add a \"backup\" step to my code, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- read.csv(\"some-data-file.csv\")\ndata_backup <- data\n\n# Do some work on data\n\n# Ahh! I made a mistake, let's try again\n\ndata <- data_backup\n```\n:::\n\n\nThis works fine, but it is a bit of an anti-pattern and ought to, in my opinion, be avoided. Instead of adding this extra step - possibly increasing the memory used in the R session - you can use DuckDB to directly query files stored on disk, without having to load them into memory first.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(duckdb)\n\ncon <- dbConnect(duckdb::duckdb())\n\ndata <- dbGetQuery(\n\tcon,\n\t\"SELECT col_1, col_2, col_4, col_10\n\tFROM 'some-data-file.csv'\n\tWHERE col_10 = 'some_value'\"\n)\n```\n:::\n\n\nThis may seem more complicated at first, and does require some knowledge of SQL, but it is a very efficient way of working with larger datasets, especially in the early stages when you're still exploring the data and working out what you're going to do with it.\n\n## {duckplyr}\n\nA game-changer for me, which really accelerated my adoption of DuckDB as a backend for processing data, was the [{duckplyr}](https://duckplyr.tidyverse.org) package. Those familiar with [{dbplyr}](https://dbplyr.tidyverse.org) will understand the theory behind this package; it allows queries to be built using the standard set of [{dplyr}](https://dplyr.tidyverse.org) functions, which are then converted to SQL behind the scenes. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(duckdb)\nlibrary(duckplyr)\n\ncon <- dbConnect(duckdb::duckdb())\n\npath_to_some_data_file <- \"some-data-file.csv\"\n\ndata <- tbl_file(con, path_to_some_data_file) |>\n\tas_duckdb_tibble() |>\n\tselect(col_1, col_2, col_4, col_10) |>\n\tfilter(col_10 == \"some_value\")\n```\n:::\n\n\nAside from the `tbl_file()`, and `as_duckdb_tibble()` functions, the rest of the code will be familiar to anyone who has used {dplyr} before. The main advantage of using {duckplyr} over writing SQL and using the [{DBI}](https://dbi.r-dbi.org) package is readability - using common {dplyr} functions makes it accessible to a wider range of users. This is a big benefit for teams where not everyone is comfortable reading or writing SQL.\n\nAdditionally, should the original author fall off the face of the earth, the code is still maintainable by others and readily adapted to eliminate the dependency on DuckDB.\n\n## Final thoughts\n\nDuckDB and R are a great combination, allowing me to overcome some of my (self-inflicted?) frustrations in my day-to-day data work. With {duckplyr}, querying data directly from files has smoothed out some of the rough edges in my workflow.\n\n### Update\n\nThe code and text was updated to add the `as_duckdb_tibble()` function that was errorneously missed in the original post.",
66+ "supporting": [],
77+ "filters": [
88+ "rmarkdown/pagebreak.lua"
99+ ],
1010+ "includes": {},
1111+ "engineDependencies": {},
1212+ "preserve": {},
1313+ "postProcess": true
1414+ }
1515+}
···11+{
22+ "hash": "93302147db836b82be4b7d349d84b1a9",
33+ "result": {
44+ "engine": "knitr",
55+ "markdown": "---\ntitle: \"The basics of DuckDB in R\"\nauthor: Rory Lawless\ndate: 2025-03-30\nlastmod: 2026-01-01\nformat: html\naliases: \n - r-duckdb-and-me.html\n---\n\nOver the past year, [DuckDB](https://duckdb.org/docs/stable/clients/r) has gradually become an important part of my data science workflow - at first clumsily, then seamlessly. I don’t typically work with large datasets, however, integrating DuckDB has addressed some of my frustrations, especially when dealing with hardware limitations and moderately-sized but inefficiently stored data. With this in mind, here are two major benefits I’ve found since integrating DuckDB into my workflow.\n\n## Handling larger-than-memory data\n\nAs noted, I don't work with very large data often but I still run into annoying issues caused by repeated reloading of data after making mistakes. Now, this is not an issue for a .csv file containing a few hundred rows and, for larger files or those stored in legacy formats, I could add a \"backup\" step to my code, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata <- read.csv(\"some-data-file.csv\")\ndata_backup <- data\n\n# Do some work on data, maybe make a mistake...\n\ndata <- data_backup\n```\n:::\n\n\nThis works fine, but I consider it an anti-pattern and ought to, in my opinion, be avoided. Instead of adding this extra step - likely increasing the memory used in the R session - you can use DuckDB to directly query files stored on disk, without having to load them into memory first.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(duckdb)\n\n# Create a DuckDB connection\ncon <- dbConnect(duckdb::duckdb())\n\n# Write a SQL query to read data directly from the CSV file\ndata <- dbGetQuery(\n\tcon,\n\t\"SELECT col_1, col_2, col_4, col_10\n\tFROM 'some-data-file.csv'\n\tWHERE col_10 = 'some_value'\"\n)\n```\n:::\n\n\nThis may seem more complicated at first, and does require some knowledge of SQL, but it is a very efficient way of working with larger datasets, especially in the early stages when you're still exploring the data and working out what you're going to do with it.\n\n## {duckplyr}\n\nA game-changer for me, which really accelerated my adoption of DuckDB as a backend for processing data, was the [{duckplyr}](https://duckplyr.tidyverse.org) package. Those familiar with [{dbplyr}](https://dbplyr.tidyverse.org) will understand the theory behind this package; it allows queries to be built using the standard set of [{dplyr}](https://dplyr.tidyverse.org) functions, which are converted to SQL behind the scenes. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(duckplyr)\n\n# Read CSV using DuckDB behind the scenes\ndata <- read_csv_duckdb(\"some-data-file.csv\")\n\n# Perform data manipulation using dplyr syntax\ndata <- data |>\n\tselect(col_1, col_2, col_4, col_10) |>\n\tfilter(col_10 == \"some_value\")\n```\n:::\n\n\nAside from the `read_csv_duckdb()` function, the rest of the code will be familiar to anyone who has used {dplyr} before. The main advantage of using {duckplyr} over writing SQL and using the [{DBI}](https://dbi.r-dbi.org) package is readability - using common {dplyr} functions makes it accessible to a wider range of users. This is a big benefit for teams where not everyone is comfortable reading or writing SQL.\n\nAdditionally, should the original author fall off the face of the earth, the code is still maintainable by others and readily adapted to eliminate the dependency on DuckDB.\n\n## Final thoughts\n\nDuckDB and R are a great combination, allowing me to overcome some of my (self-inflicted?) frustrations in my day-to-day data work. With {duckplyr}, querying data directly from files has smoothed out some of the rough edges in my workflow.\n",
66+ "supporting": [],
77+ "filters": [
88+ "rmarkdown/pagebreak.lua"
99+ ],
1010+ "includes": {},
1111+ "engineDependencies": {},
1212+ "preserve": {},
1313+ "postProcess": true
1414+ }
1515+}
···11+{
22+ "hash": "061c38d96c7f4c560b958a82d0e2a18f",
33+ "result": {
44+ "engine": "knitr",
55+ "markdown": "---\ntitle: \"Using 1Password Secret References in R\"\nauthor: Rory Lawless\ndate: 2025-11-30\nformat: html\ndraft: true\n---\n\nIf you're like me, you store your API keys in your .Renviron file and forget about them. Not only is it risky behaviour to store them in plaintext, it is also a nightmarish way to manage and rotate out keys when needed.\n\n1Password offers a great solution managing your sensitive data, it even has a dedicated API Credentials item type for people super into taxonomy. The real magic, however, happens when you introduce yourself to [1Password CLI and its handy secret references](https://developer.1password.com/docs/cli/secret-reference-syntax) feature.\n\nI won't go into detail on installation and configuration, [the documentation does a better job than I would](https://developer.1password.com/docs/cli/get-started). Once you have migrated an API key to 1Password, you can refer to the secret using its URI instead of including it in plain text in your .Renvion or R script.\n\nThe URI of a credential comes in this format: `op://<vault-name>/<item-name>/[section-name/]<field-name>`. For example, an Anthropic API key save as an item named \"ClaudeAPI\" inside your 'Private' with the value in a field called 'credential' can be referred to as `op://Private/ClaudeAPI/credential` (note, the `[section-name/]` element is only required if the field is under a named section within the item).\n\nThe URI of your credential can then replace the plain text API key anywhere you were storing it. Using our ClaudeAPI, our key=value in .Renviron would look like:\n\n``` \nANTHROPIC_API_KEY=\"op://Private/ClaudeAPI/credential\"\n```\n\nOur key is stored inside our 1Password vault, safe from prying eyes and accidental exposure. To access this environment variable in a script, we would typically write something like `op://Private/Claude API Key/credential`, while this will technically run you will receive an error as you will be attempting to pass the literal secret reference to your API call. We will need to alter this code in two ways, the first is to use one of [the methods 1Password CLI provides](https://developer.1password.com/docs/cli/secrets-scripts) for loading secrets into code, the one we will use in our R script is `op read`. \n\nThe second change, calling the function `system2()` from base, which lets us run `op read` (or any system command) from our R script. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsystem2(\n\t\"op\",\n\targs = c(\"read\", shQuote(Sys.getenv(\"ANTHROPIC_API_KEY\"))),\n\tstdout = TRUE\n)\n```\n:::\n",
66+ "supporting": [
77+ "index_files"
88+ ],
99+ "filters": [
1010+ "rmarkdown/pagebreak.lua"
1111+ ],
1212+ "includes": {},
1313+ "engineDependencies": {},
1414+ "preserve": {},
1515+ "postProcess": true
1616+ }
1717+}
···11+---
22+about:
33+ template: solana
44+---
55+66+{{< include _about-short.qmd >}}
77+88+## Résumé
99+1010+### Work
1111+**Data Analyst, Office of the Deputy Mayor for Education (DME)**
1212+1313+*Washington, DC.* 2022–Present
1414+1515+- Contributed analysis for key publications, including 2023 Master Facilities Plan.
1616+- Improved school enrollment projections processes, leading cross-agency collaboration.
1717+- Information officer and racial justice & equity team member.
1818+1919+**Associate Data Analyst, Financial Conduct Authority (FCA)**
2020+2121+*London, UK.* 2020–2022
2222+2323+- Produced insightful and impactful analysis for internal and external audiences in support of high profile, politically sensitive publications.
2424+- Data collection and extraction using SQL and web scraping techniques.
2525+- Developed best practice for data management and replicable analysis through training and providing ad hoc advice and support.
2626+2727+**Research Assistant - Data, Child Outcomes Research Consortium (CORC)**
2828+2929+*London, UK.* 2018–2020
3030+3131+- Supporting services to submit data to the organization, including one-to-one support as well as organizing and hosting webinars.
3232+- Redesigned data collection tools to improve quality of submissions and automated internal data validation processes.
3333+3434+### Education
3535+3636+**MSc Democracy and Comparative Politics, Distinction**
3737+3838+*University College London. London, UK.* 2016
3939+4040+**BA (Hons.) Politics, Upper Second Class**
4141+4242+*Royal Holloway, University of London. Surrey, UK.* 2013
···11+# options specified here will apply to all posts in this folder
22+33+# freeze computational output
44+# (see https://quarto.org/docs/projects/code-execution.html#freeze)
55+freeze: auto
66+date-format: iso
+77
posts/the-basics-of-duckdb-in-r/index.qmd
···11+---
22+title: "The basics of DuckDB in R"
33+author: Rory Lawless
44+date: 2025-03-30
55+lastmod: 2026-01-01
66+format: html
77+aliases:
88+ - r-duckdb-and-me.html
99+---
1010+1111+Over the past year, [DuckDB](https://duckdb.org/docs/stable/clients/r) has gradually become an important part of my data science workflow - at first clumsily, then seamlessly. I don’t typically work with large datasets, however, integrating DuckDB has addressed some of my frustrations, especially when dealing with hardware limitations and moderately-sized but inefficiently stored data. With this in mind, here are two major benefits I’ve found since integrating DuckDB into my workflow.
1212+1313+## Handling larger-than-memory data
1414+1515+As noted, I don't work with very large data often but I still run into annoying issues caused by repeated reloading of data after making mistakes. Now, this is not an issue for a .csv file containing a few hundred rows and, for larger files or those stored in legacy formats, I could add a "backup" step to my code, like so:
1616+1717+```{r}
1818+#| eval: false
1919+2020+data <- read.csv("some-data-file.csv")
2121+data_backup <- data
2222+2323+# Do some work on data, maybe make a mistake...
2424+2525+data <- data_backup
2626+2727+```
2828+2929+This works fine, but I consider it an anti-pattern and ought to, in my opinion, be avoided. Instead of adding this extra step - likely increasing the memory used in the R session - you can use DuckDB to directly query files stored on disk, without having to load them into memory first.
3030+3131+```{r}
3232+#| eval: false
3333+3434+library(tidyverse)
3535+library(duckdb)
3636+3737+# Create a DuckDB connection
3838+con <- dbConnect(duckdb::duckdb())
3939+4040+# Write a SQL query to read data directly from the CSV file
4141+data <- dbGetQuery(
4242+ con,
4343+ "SELECT col_1, col_2, col_4, col_10
4444+ FROM 'some-data-file.csv'
4545+ WHERE col_10 = 'some_value'"
4646+)
4747+```
4848+4949+This may seem more complicated at first, and does require some knowledge of SQL, but it is a very efficient way of working with larger datasets, especially in the early stages when you're still exploring the data and working out what you're going to do with it.
5050+5151+## {duckplyr}
5252+5353+A game-changer for me, which really accelerated my adoption of DuckDB as a backend for processing data, was the [{duckplyr}](https://duckplyr.tidyverse.org) package. Those familiar with [{dbplyr}](https://dbplyr.tidyverse.org) will understand the theory behind this package; it allows queries to be built using the standard set of [{dplyr}](https://dplyr.tidyverse.org) functions, which are converted to SQL behind the scenes.
5454+5555+5656+```{r}
5757+#| eval: false
5858+5959+library(tidyverse)
6060+library(duckplyr)
6161+6262+# Read CSV using DuckDB behind the scenes
6363+data <- read_csv_duckdb("some-data-file.csv")
6464+6565+# Perform data manipulation using dplyr syntax
6666+data <- data |>
6767+ select(col_1, col_2, col_4, col_10) |>
6868+ filter(col_10 == "some_value")
6969+```
7070+7171+Aside from the `read_csv_duckdb()` function, the rest of the code will be familiar to anyone who has used {dplyr} before. The main advantage of using {duckplyr} over writing SQL and using the [{DBI}](https://dbi.r-dbi.org) package is readability - using common {dplyr} functions makes it accessible to a wider range of users. This is a big benefit for teams where not everyone is comfortable reading or writing SQL.
7272+7373+Additionally, should the original author fall off the face of the earth, the code is still maintainable by others and readily adapted to eliminate the dependency on DuckDB.
7474+7575+## Final thoughts
7676+7777+DuckDB and R are a great combination, allowing me to overcome some of my (self-inflicted?) frustrations in my day-to-day data work. With {duckplyr}, querying data directly from files has smoothed out some of the rough edges in my workflow.