···11Package: mintR
22-Title: What the Package Does (One Line, Title Case)
22+Title: mintR: function to mine text
33Version: 0.0.0.9000
44Authors@R:
55- person("First", "Last", , "first.last@example.com", role = c("aut", "cre"))
66-Description: What the package does (one paragraph).
77-License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
88- license
55+ person("Aurélien", "Goutsmedt", , "agoutsmedt@hotmail.fr", role = c("aut", "cre"))
66+Description: Various functions for text mining and analysis: tokenization, counting terms, extracting structural topic model effects, labelling topics, etc..
77+License: CC BY 4.0
98Encoding: UTF-8
99+Imports:
1010+ cli,
1111+ data.table,
1212+ forcats,
1313+ ggplot2,
1414+ here,
1515+ scales,
1616+ stats,
1717+ stm,
1818+ stringr,
1919+ tidytext,
2020+ tokenizers,
2121+ utils
1022Roxygen: list(markdown = TRUE)
1123RoxygenNote: 7.3.3
+395
LICENSE.md
···11+Attribution 4.0 International
22+33+=======================================================================
44+55+Creative Commons Corporation ("Creative Commons") is not a law firm and
66+does not provide legal services or legal advice. Distribution of
77+Creative Commons public licenses does not create a lawyer-client or
88+other relationship. Creative Commons makes its licenses and related
99+information available on an "as-is" basis. Creative Commons gives no
1010+warranties regarding its licenses, any material licensed under their
1111+terms and conditions, or any related information. Creative Commons
1212+disclaims all liability for damages resulting from their use to the
1313+fullest extent possible.
1414+1515+Using Creative Commons Public Licenses
1616+1717+Creative Commons public licenses provide a standard set of terms and
1818+conditions that creators and other rights holders may use to share
1919+original works of authorship and other material subject to copyright
2020+and certain other rights specified in the public license below. The
2121+following considerations are for informational purposes only, are not
2222+exhaustive, and do not form part of our licenses.
2323+2424+ Considerations for licensors: Our public licenses are
2525+ intended for use by those authorized to give the public
2626+ permission to use material in ways otherwise restricted by
2727+ copyright and certain other rights. Our licenses are
2828+ irrevocable. Licensors should read and understand the terms
2929+ and conditions of the license they choose before applying it.
3030+ Licensors should also secure all rights necessary before
3131+ applying our licenses so that the public can reuse the
3232+ material as expected. Licensors should clearly mark any
3333+ material not subject to the license. This includes other CC-
3434+ licensed material, or material used under an exception or
3535+ limitation to copyright. More considerations for licensors:
3636+ wiki.creativecommons.org/Considerations_for_licensors
3737+3838+ Considerations for the public: By using one of our public
3939+ licenses, a licensor grants the public permission to use the
4040+ licensed material under specified terms and conditions. If
4141+ the licensor's permission is not necessary for any reason--for
4242+ example, because of any applicable exception or limitation to
4343+ copyright--then that use is not regulated by the license. Our
4444+ licenses grant only permissions under copyright and certain
4545+ other rights that a licensor has authority to grant. Use of
4646+ the licensed material may still be restricted for other
4747+ reasons, including because others have copyright or other
4848+ rights in the material. A licensor may make special requests,
4949+ such as asking that all changes be marked or described.
5050+ Although not required by our licenses, you are encouraged to
5151+ respect those requests where reasonable. More considerations
5252+ for the public:
5353+ wiki.creativecommons.org/Considerations_for_licensees
5454+5555+=======================================================================
5656+5757+Creative Commons Attribution 4.0 International Public License
5858+5959+By exercising the Licensed Rights (defined below), You accept and agree
6060+to be bound by the terms and conditions of this Creative Commons
6161+Attribution 4.0 International Public License ("Public License"). To the
6262+extent this Public License may be interpreted as a contract, You are
6363+granted the Licensed Rights in consideration of Your acceptance of
6464+these terms and conditions, and the Licensor grants You such rights in
6565+consideration of benefits the Licensor receives from making the
6666+Licensed Material available under these terms and conditions.
6767+6868+6969+Section 1 -- Definitions.
7070+7171+ a. Adapted Material means material subject to Copyright and Similar
7272+ Rights that is derived from or based upon the Licensed Material
7373+ and in which the Licensed Material is translated, altered,
7474+ arranged, transformed, or otherwise modified in a manner requiring
7575+ permission under the Copyright and Similar Rights held by the
7676+ Licensor. For purposes of this Public License, where the Licensed
7777+ Material is a musical work, performance, or sound recording,
7878+ Adapted Material is always produced where the Licensed Material is
7979+ synched in timed relation with a moving image.
8080+8181+ b. Adapter's License means the license You apply to Your Copyright
8282+ and Similar Rights in Your contributions to Adapted Material in
8383+ accordance with the terms and conditions of this Public License.
8484+8585+ c. Copyright and Similar Rights means copyright and/or similar rights
8686+ closely related to copyright including, without limitation,
8787+ performance, broadcast, sound recording, and Sui Generis Database
8888+ Rights, without regard to how the rights are labeled or
8989+ categorized. For purposes of this Public License, the rights
9090+ specified in Section 2(b)(1)-(2) are not Copyright and Similar
9191+ Rights.
9292+9393+ d. Effective Technological Measures means those measures that, in the
9494+ absence of proper authority, may not be circumvented under laws
9595+ fulfilling obligations under Article 11 of the WIPO Copyright
9696+ Treaty adopted on December 20, 1996, and/or similar international
9797+ agreements.
9898+9999+ e. Exceptions and Limitations means fair use, fair dealing, and/or
100100+ any other exception or limitation to Copyright and Similar Rights
101101+ that applies to Your use of the Licensed Material.
102102+103103+ f. Licensed Material means the artistic or literary work, database,
104104+ or other material to which the Licensor applied this Public
105105+ License.
106106+107107+ g. Licensed Rights means the rights granted to You subject to the
108108+ terms and conditions of this Public License, which are limited to
109109+ all Copyright and Similar Rights that apply to Your use of the
110110+ Licensed Material and that the Licensor has authority to license.
111111+112112+ h. Licensor means the individual(s) or entity(ies) granting rights
113113+ under this Public License.
114114+115115+ i. Share means to provide material to the public by any means or
116116+ process that requires permission under the Licensed Rights, such
117117+ as reproduction, public display, public performance, distribution,
118118+ dissemination, communication, or importation, and to make material
119119+ available to the public including in ways that members of the
120120+ public may access the material from a place and at a time
121121+ individually chosen by them.
122122+123123+ j. Sui Generis Database Rights means rights other than copyright
124124+ resulting from Directive 96/9/EC of the European Parliament and of
125125+ the Council of 11 March 1996 on the legal protection of databases,
126126+ as amended and/or succeeded, as well as other essentially
127127+ equivalent rights anywhere in the world.
128128+129129+ k. You means the individual or entity exercising the Licensed Rights
130130+ under this Public License. Your has a corresponding meaning.
131131+132132+133133+Section 2 -- Scope.
134134+135135+ a. License grant.
136136+137137+ 1. Subject to the terms and conditions of this Public License,
138138+ the Licensor hereby grants You a worldwide, royalty-free,
139139+ non-sublicensable, non-exclusive, irrevocable license to
140140+ exercise the Licensed Rights in the Licensed Material to:
141141+142142+ a. reproduce and Share the Licensed Material, in whole or
143143+ in part; and
144144+145145+ b. produce, reproduce, and Share Adapted Material.
146146+147147+ 2. Exceptions and Limitations. For the avoidance of doubt, where
148148+ Exceptions and Limitations apply to Your use, this Public
149149+ License does not apply, and You do not need to comply with
150150+ its terms and conditions.
151151+152152+ 3. Term. The term of this Public License is specified in Section
153153+ 6(a).
154154+155155+ 4. Media and formats; technical modifications allowed. The
156156+ Licensor authorizes You to exercise the Licensed Rights in
157157+ all media and formats whether now known or hereafter created,
158158+ and to make technical modifications necessary to do so. The
159159+ Licensor waives and/or agrees not to assert any right or
160160+ authority to forbid You from making technical modifications
161161+ necessary to exercise the Licensed Rights, including
162162+ technical modifications necessary to circumvent Effective
163163+ Technological Measures. For purposes of this Public License,
164164+ simply making modifications authorized by this Section 2(a)
165165+ (4) never produces Adapted Material.
166166+167167+ 5. Downstream recipients.
168168+169169+ a. Offer from the Licensor -- Licensed Material. Every
170170+ recipient of the Licensed Material automatically
171171+ receives an offer from the Licensor to exercise the
172172+ Licensed Rights under the terms and conditions of this
173173+ Public License.
174174+175175+ b. No downstream restrictions. You may not offer or impose
176176+ any additional or different terms or conditions on, or
177177+ apply any Effective Technological Measures to, the
178178+ Licensed Material if doing so restricts exercise of the
179179+ Licensed Rights by any recipient of the Licensed
180180+ Material.
181181+182182+ 6. No endorsement. Nothing in this Public License constitutes or
183183+ may be construed as permission to assert or imply that You
184184+ are, or that Your use of the Licensed Material is, connected
185185+ with, or sponsored, endorsed, or granted official status by,
186186+ the Licensor or others designated to receive attribution as
187187+ provided in Section 3(a)(1)(A)(i).
188188+189189+ b. Other rights.
190190+191191+ 1. Moral rights, such as the right of integrity, are not
192192+ licensed under this Public License, nor are publicity,
193193+ privacy, and/or other similar personality rights; however, to
194194+ the extent possible, the Licensor waives and/or agrees not to
195195+ assert any such rights held by the Licensor to the limited
196196+ extent necessary to allow You to exercise the Licensed
197197+ Rights, but not otherwise.
198198+199199+ 2. Patent and trademark rights are not licensed under this
200200+ Public License.
201201+202202+ 3. To the extent possible, the Licensor waives any right to
203203+ collect royalties from You for the exercise of the Licensed
204204+ Rights, whether directly or through a collecting society
205205+ under any voluntary or waivable statutory or compulsory
206206+ licensing scheme. In all other cases the Licensor expressly
207207+ reserves any right to collect such royalties.
208208+209209+210210+Section 3 -- License Conditions.
211211+212212+Your exercise of the Licensed Rights is expressly made subject to the
213213+following conditions.
214214+215215+ a. Attribution.
216216+217217+ 1. If You Share the Licensed Material (including in modified
218218+ form), You must:
219219+220220+ a. retain the following if it is supplied by the Licensor
221221+ with the Licensed Material:
222222+223223+ i. identification of the creator(s) of the Licensed
224224+ Material and any others designated to receive
225225+ attribution, in any reasonable manner requested by
226226+ the Licensor (including by pseudonym if
227227+ designated);
228228+229229+ ii. a copyright notice;
230230+231231+ iii. a notice that refers to this Public License;
232232+233233+ iv. a notice that refers to the disclaimer of
234234+ warranties;
235235+236236+ v. a URI or hyperlink to the Licensed Material to the
237237+ extent reasonably practicable;
238238+239239+ b. indicate if You modified the Licensed Material and
240240+ retain an indication of any previous modifications; and
241241+242242+ c. indicate the Licensed Material is licensed under this
243243+ Public License, and include the text of, or the URI or
244244+ hyperlink to, this Public License.
245245+246246+ 2. You may satisfy the conditions in Section 3(a)(1) in any
247247+ reasonable manner based on the medium, means, and context in
248248+ which You Share the Licensed Material. For example, it may be
249249+ reasonable to satisfy the conditions by providing a URI or
250250+ hyperlink to a resource that includes the required
251251+ information.
252252+253253+ 3. If requested by the Licensor, You must remove any of the
254254+ information required by Section 3(a)(1)(A) to the extent
255255+ reasonably practicable.
256256+257257+ 4. If You Share Adapted Material You produce, the Adapter's
258258+ License You apply must not prevent recipients of the Adapted
259259+ Material from complying with this Public License.
260260+261261+262262+Section 4 -- Sui Generis Database Rights.
263263+264264+Where the Licensed Rights include Sui Generis Database Rights that
265265+apply to Your use of the Licensed Material:
266266+267267+ a. for the avoidance of doubt, Section 2(a)(1) grants You the right
268268+ to extract, reuse, reproduce, and Share all or a substantial
269269+ portion of the contents of the database;
270270+271271+ b. if You include all or a substantial portion of the database
272272+ contents in a database in which You have Sui Generis Database
273273+ Rights, then the database in which You have Sui Generis Database
274274+ Rights (but not its individual contents) is Adapted Material; and
275275+276276+ c. You must comply with the conditions in Section 3(a) if You Share
277277+ all or a substantial portion of the contents of the database.
278278+279279+For the avoidance of doubt, this Section 4 supplements and does not
280280+replace Your obligations under this Public License where the Licensed
281281+Rights include other Copyright and Similar Rights.
282282+283283+284284+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
285285+286286+ a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
287287+ EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
288288+ AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
289289+ ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
290290+ IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
291291+ WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
292292+ PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
293293+ ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
294294+ KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
295295+ ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
296296+297297+ b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
298298+ TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
299299+ NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
300300+ INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
301301+ COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
302302+ USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
303303+ ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
304304+ DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
305305+ IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
306306+307307+ c. The disclaimer of warranties and limitation of liability provided
308308+ above shall be interpreted in a manner that, to the extent
309309+ possible, most closely approximates an absolute disclaimer and
310310+ waiver of all liability.
311311+312312+313313+Section 6 -- Term and Termination.
314314+315315+ a. This Public License applies for the term of the Copyright and
316316+ Similar Rights licensed here. However, if You fail to comply with
317317+ this Public License, then Your rights under this Public License
318318+ terminate automatically.
319319+320320+ b. Where Your right to use the Licensed Material has terminated under
321321+ Section 6(a), it reinstates:
322322+323323+ 1. automatically as of the date the violation is cured, provided
324324+ it is cured within 30 days of Your discovery of the
325325+ violation; or
326326+327327+ 2. upon express reinstatement by the Licensor.
328328+329329+ For the avoidance of doubt, this Section 6(b) does not affect any
330330+ right the Licensor may have to seek remedies for Your violations
331331+ of this Public License.
332332+333333+ c. For the avoidance of doubt, the Licensor may also offer the
334334+ Licensed Material under separate terms or conditions or stop
335335+ distributing the Licensed Material at any time; however, doing so
336336+ will not terminate this Public License.
337337+338338+ d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
339339+ License.
340340+341341+342342+Section 7 -- Other Terms and Conditions.
343343+344344+ a. The Licensor shall not be bound by any additional or different
345345+ terms or conditions communicated by You unless expressly agreed.
346346+347347+ b. Any arrangements, understandings, or agreements regarding the
348348+ Licensed Material not stated herein are separate from and
349349+ independent of the terms and conditions of this Public License.
350350+351351+352352+Section 8 -- Interpretation.
353353+354354+ a. For the avoidance of doubt, this Public License does not, and
355355+ shall not be interpreted to, reduce, limit, restrict, or impose
356356+ conditions on any use of the Licensed Material that could lawfully
357357+ be made without permission under this Public License.
358358+359359+ b. To the extent possible, if any provision of this Public License is
360360+ deemed unenforceable, it shall be automatically reformed to the
361361+ minimum extent necessary to make it enforceable. If the provision
362362+ cannot be reformed, it shall be severed from this Public License
363363+ without affecting the enforceability of the remaining terms and
364364+ conditions.
365365+366366+ c. No term or condition of this Public License will be waived and no
367367+ failure to comply consented to unless expressly agreed to by the
368368+ Licensor.
369369+370370+ d. Nothing in this Public License constitutes or may be interpreted
371371+ as a limitation upon, or waiver of, any privileges and immunities
372372+ that apply to the Licensor or You, including from the legal
373373+ processes of any jurisdiction or authority.
374374+375375+376376+=======================================================================
377377+378378+Creative Commons is not a party to its public
379379+licenses. Notwithstanding, Creative Commons may elect to apply one of
380380+its public licenses to material it publishes and in those instances
381381+will be considered the “Licensor.” The text of the Creative Commons
382382+public licenses is dedicated to the public domain under the CC0 Public
383383+Domain Dedication. Except for the limited purpose of indicating that
384384+material is shared under a Creative Commons public license or as
385385+otherwise permitted by the Creative Commons policies published at
386386+creativecommons.org/policies, Creative Commons does not authorize the
387387+use of the trademark "Creative Commons" or any other trademark or logo
388388+of Creative Commons without its prior written consent including,
389389+without limitation, in connection with any unauthorized modifications
390390+to any of its public licenses or any other arrangements,
391391+understandings, or agreements concerning use of licensed material. For
392392+the avoidance of doubt, this paragraph does not form part of the
393393+public licenses.
394394+395395+Creative Commons may be contacted at creativecommons.org.
+17
NAMESPACE
···11# Generated by roxygen2: do not edit by hand
2233+export(average_frex)
44+export(calculate_beta)
55+export(calculate_frex)
66+export(calculate_lift)
77+export(calculate_score)
88+export(compute_mean_frex)
99+export(compute_stm_term_counts)
1010+export(compute_tf_idf)
1111+export(extract_ngrams)
1212+export(extract_top_terms)
1313+export(extract_topic_effects)
1414+export(filter_group_quantile)
1515+export(make_topic_labels)
1616+export(plot_topic_prevalence)
1717+export(plot_topic_prevalence_over_time)
1818+export(stm_term_counts)
1919+import(data.table)
+147
R/compute_tf_idf.R
···11+#' Compute TF–IDF (optionally weighted) for tokens per document
22+#'
33+#' @title Compute TF–IDF for tokenized data
44+#' @description Compute term frequency (TF), inverse document frequency (IDF),
55+#' and TF–IDF from tokenized data.
66+#'
77+#' @param df A `data.frame` or `data.table` with token data.
88+#' @param token_col Character scalar. Column name for token values.
99+#' @param document_col Character scalar or vector. Column(s) defining document ID.
1010+#' @param weight_col Optional character scalar with numeric weights (for example
1111+#' aggregated token counts). If `NULL`, each row contributes weight `1`.
1212+#'
1313+#' @return A `data.table` with one row per token-document pair and computed
1414+#' columns including `corpus_tf`, `nb_doc_word`, `df`, `idf`, and `tf_idf`.
1515+#' If `weight_col` is provided, `weighted_tf` is returned instead of `tf`.
1616+#'
1717+#' @examples
1818+#' dt <- data.table::data.table(
1919+#' document = c("d1", "d1", "d2"),
2020+#' token = c("labor", "market", "labor")
2121+#' )
2222+#' compute_tf_idf(dt)
2323+#'
2424+#' @export
2525+compute_tf_idf <- function(
2626+ df,
2727+ token_col = "token",
2828+ document_col = "document",
2929+ weight_col = NULL
3030+) {
3131+ df <- data.table::as.data.table(df)
3232+ df <- data.table::copy(df)
3333+3434+ document_col <- as.character(document_col)
3535+ token_col <- as.character(token_col)
3636+3737+ if (!all(document_col %in% colnames(df))) {
3838+ stop(
3939+ "Input must contain document column(s): ",
4040+ paste(document_col, collapse = ", "),
4141+ call. = FALSE
4242+ )
4343+ }
4444+ if (!token_col %in% colnames(df)) {
4545+ stop("Input must contain token column: ", token_col, call. = FALSE)
4646+ }
4747+4848+ if (!is.null(weight_col)) {
4949+ weight_col <- as.character(weight_col)
5050+ if (!weight_col %in% colnames(df)) {
5151+ stop("weight_col '", weight_col, "' not found in input", call. = FALSE)
5252+ }
5353+ if (!is.numeric(df[[weight_col]])) {
5454+ df[, (weight_col) := as.numeric(.SD[[1]]), .SDcols = weight_col]
5555+ }
5656+ if (anyNA(df[[weight_col]])) {
5757+ warning("NA values found in weight_col; treating as 0 for weighting")
5858+ df[is.na(get(weight_col)), (weight_col) := 0]
5959+ }
6060+ df[, .weight := abs(df[[weight_col]])]
6161+ } else {
6262+ df[, .weight := 1.0]
6363+ }
6464+6565+ if (token_col %in% document_col) {
6666+ stop("`token_col` must be distinct from `document_col`", call. = FALSE)
6767+ }
6868+6969+ doc_key <- ".document_tmp_key"
7070+ i <- 1L
7171+ while (doc_key %in% colnames(df)) {
7272+ doc_key <- paste0(".document_tmp_key", i)
7373+ i <- i + 1L
7474+ }
7575+7676+ if (length(document_col) == 1L) {
7777+ df[, (doc_key) := as.character(.SD[[1]]), .SDcols = document_col]
7878+ } else {
7979+ df[,
8080+ (doc_key) := do.call(paste, c(.SD, sep = "\r")),
8181+ .SDcols = document_col
8282+ ]
8383+ }
8484+8585+ if (token_col != "token") {
8686+ data.table::setnames(df, token_col, "token")
8787+ token_was_renamed <- TRUE
8888+ } else {
8989+ token_was_renamed <- FALSE
9090+ }
9191+9292+ corpus_tf_dt <- df[, .(corpus_tf = sum(.weight)), by = "token"]
9393+9494+ token_doc <- df[,
9595+ .(token_doc_weight = sum(.weight)),
9696+ by = c("token", doc_key)
9797+ ]
9898+9999+ token_doc <- merge(
100100+ token_doc,
101101+ corpus_tf_dt,
102102+ by = "token",
103103+ all.x = TRUE,
104104+ sort = FALSE
105105+ )
106106+107107+ doc_totals <- df[, .(nb_doc_word = sum(.weight)), by = doc_key]
108108+109109+ token_doc <- merge(
110110+ token_doc,
111111+ doc_totals,
112112+ by = doc_key,
113113+ all.x = TRUE,
114114+ sort = FALSE
115115+ )
116116+117117+ token_doc[, tf := token_doc_weight / nb_doc_word]
118118+119119+ df_dt <- token_doc[, .(df = .N), by = token]
120120+ total_docs <- data.table::uniqueN(token_doc[[doc_key]])
121121+122122+ token_doc <- merge(token_doc, df_dt, by = "token", all.x = TRUE, sort = FALSE)
123123+124124+ token_doc[, idf := log(total_docs / df)]
125125+ token_doc[, tf_idf := tf * idf]
126126+127127+ if (token_was_renamed) {
128128+ data.table::setnames(token_doc, "token", token_col)
129129+ }
130130+131131+ if (length(document_col) == 1L) {
132132+ data.table::setnames(token_doc, doc_key, document_col)
133133+ } else {
134134+ token_doc[,
135135+ (document_col) := data.table::tstrsplit(get(doc_key), "\r", fixed = TRUE)
136136+ ]
137137+ token_doc[, (doc_key) := NULL]
138138+ }
139139+140140+ token_doc[, c("token_doc_weight", ".weight") := NULL]
141141+142142+ if (!is.null(weight_col)) {
143143+ data.table::setnames(token_doc, "tf", "weighted_tf")
144144+ }
145145+146146+ token_doc[]
147147+}
+268
R/extract_ngrams.R
···11+#' Extract n-grams from text with basic filtering and grouped counts
22+#'
33+#' @title Extract n-grams from text (document-level only)
44+#' @description Tokenise text into n-grams (unigrams, bigrams, etc.) per group,
55+#' remove tokens containing digits or stop words, and return a compact
66+#' data.table with one row per grouping + token + ngram and a frequency column.
77+#'
88+#' @param df A `data.frame` or `data.table` containing the input text.
99+#' This function converts `df` to `data.table` in-place.
1010+#' @param ngrams Integer scalar or vector. If scalar (for example `2L`),
1111+#' values from `1:ngrams` are computed. If vector (for example `c(1L, 2L)`),
1212+#' only those sizes are computed.
1313+#' @param grouping_cols Character vector of grouping columns to keep in output.
1414+#' Use `NULL` for no grouping.
1515+#' @param text_col Character scalar. Name of the column that contains text.
1616+#' @param min_nchar Integer scalar. Minimum token length to keep.
1717+#' @param stop_words Optional character vector of stop words.
1818+#' If `NULL`, `tidytext::stop_words$word` is used.
1919+#' @param chunk_size Optional integer number of rows per processing chunk.
2020+#' Use `NULL` or `<= 0` to process in a single pass.
2121+#'
2222+#' @return A `data.table` with columns `grouping_cols`, `token`, `ngram`,
2323+#' and `doc_freq`.
2424+#'
2525+#' @details
2626+#' Tokens are lower-cased by the tokenizer. Tokens containing punctuation,
2727+#' digits, symbols, or stop words are removed. Multi-word tokens in output
2828+#' are normalized with underscores (`_`).
2929+#'
3030+#' @examples
3131+#' df <- data.frame(id = 1:2, text = c("Economic growth matters", "Growth and trade"))
3232+#' extract_ngrams(df, ngrams = 2L, grouping_cols = "id")
3333+#'
3434+#' @export
3535+extract_ngrams <- function(
3636+ df,
3737+ ngrams = 2L,
3838+ grouping_cols = NULL,
3939+ text_col = "text",
4040+ min_nchar = 2L,
4141+ stop_words = NULL,
4242+ chunk_size = NULL
4343+) {
4444+ if (!requireNamespace("tokenizers", quietly = TRUE)) {
4545+ stop("Please install the 'tokenizers' package.")
4646+ }
4747+ if (!requireNamespace("tidytext", quietly = TRUE)) {
4848+ stop("Please install the 'tidytext' package.")
4949+ }
5050+ if (!requireNamespace("stringr", quietly = TRUE)) {
5151+ stop("Please install the 'stringr' package.")
5252+ }
5353+5454+ has_cli <- requireNamespace("cli", quietly = TRUE)
5555+ cli_info <- function(...) if (has_cli) cli::cli_alert_info(...) else NULL
5656+ cli_success <- function(...) {
5757+ if (has_cli) cli::cli_alert_success(...) else NULL
5858+ }
5959+6060+ cli_info("Starting n-gram extraction (document-level)")
6161+6262+ data.table::setDT(df)
6363+6464+ if (is.null(stop_words)) {
6565+ stop_words <- unique(tidytext::stop_words$word)
6666+ }
6767+ stop_words <- tolower(stop_words)
6868+6969+ required_cols <- unique(c(grouping_cols, text_col))
7070+ missing_cols <- setdiff(required_cols, names(df))
7171+ if (length(missing_cols) > 0L) {
7272+ stop(
7373+ "Missing required columns in df: ",
7474+ paste(missing_cols, collapse = ", ")
7575+ )
7676+ }
7777+7878+ if (length(ngrams) == 1L) {
7979+ ng_range <- seq_len(as.integer(ngrams))
8080+ } else {
8181+ ng_range <- as.integer(ngrams)
8282+ }
8383+8484+ if (is.null(grouping_cols)) {
8585+ grouping_cols <- character(0)
8686+ }
8787+8888+ make_empty_result <- function() {
8989+ cols <- c(grouping_cols, "token", "ngram", "doc_freq")
9090+ dt <- data.table::data.table()
9191+ for (nm in cols) {
9292+ dt[, (nm) := list(NULL)]
9393+ }
9494+ dt
9595+ }
9696+9797+ nrows <- nrow(df)
9898+ if (nrows == 0L) {
9999+ return(make_empty_result())
100100+ }
101101+102102+ process_chunked <- function(
103103+ df_subset,
104104+ ng_range,
105105+ grouping_cols,
106106+ text_col,
107107+ chunk_size = NULL
108108+ ) {
109109+ nrows_local <- nrow(df_subset)
110110+ tokens_acc_list <- list()
111111+ list_idx <- 0L
112112+113113+ if (!is.null(chunk_size) && chunk_size > 0L) {
114114+ total_chunks <- ceiling(nrows_local / chunk_size)
115115+ if (has_cli) {
116116+ pb <- cli::cli_progress_bar("Processing chunks", total = total_chunks)
117117+ on.exit(cli::cli_progress_done(), add = TRUE)
118118+ }
119119+ chunk_i <- 0L
120120+ for (start in seq(1L, nrows_local, by = chunk_size)) {
121121+ chunk_i <- chunk_i + 1L
122122+ end <- min(start + chunk_size - 1L, nrows_local)
123123+ if (has_cli) {
124124+ cli::cli_progress_update()
125125+ cli::cli_alert_info(
126126+ "Processing chunk {chunk_i} of {total_chunks} (rows {start}-{end})"
127127+ )
128128+ }
129129+ chunk <- df_subset[start:end, c(grouping_cols, text_col), with = FALSE]
130130+ for (n in ng_range) {
131131+ tmp <- data.table::copy(chunk)
132132+ tmp[,
133133+ token := tokenizers::tokenize_ngrams(
134134+ get(text_col),
135135+ n = as.integer(n),
136136+ lowercase = TRUE
137137+ )
138138+ ]
139139+ tmp[, (text_col) := NULL]
140140+ tmp[, ngram := as.integer(n)]
141141+ tmp_unnest <- tmp[,
142142+ .(token = unlist(token)),
143143+ by = c(grouping_cols, "ngram")
144144+ ]
145145+ if (nrow(tmp_unnest) == 0L) {
146146+ next
147147+ }
148148+ chunk_counts <- tmp_unnest[,
149149+ .(doc_freq = .N),
150150+ by = c(grouping_cols, "token", "ngram")
151151+ ]
152152+ list_idx <- list_idx + 1L
153153+ tokens_acc_list[[list_idx]] <- chunk_counts
154154+ rm(tmp, tmp_unnest, chunk_counts)
155155+ gc()
156156+ }
157157+ rm(chunk)
158158+ gc()
159159+ }
160160+ if (has_cli) {
161161+ cli_success(sprintf(
162162+ "Completed chunked processing (%d chunks)",
163163+ total_chunks
164164+ ))
165165+ }
166166+ } else {
167167+ for (n in ng_range) {
168168+ tmp <- df_subset[, c(grouping_cols, text_col), with = FALSE]
169169+ tmp[,
170170+ token := tokenizers::tokenize_ngrams(
171171+ get(text_col),
172172+ n = as.integer(n),
173173+ lowercase = TRUE
174174+ )
175175+ ]
176176+ tmp[, (text_col) := NULL]
177177+ tmp[, ngram := as.integer(n)]
178178+ tmp_unnest <- tmp[,
179179+ .(token = unlist(token)),
180180+ by = c(grouping_cols, "ngram")
181181+ ]
182182+ if (nrow(tmp_unnest) == 0L) {
183183+ next
184184+ }
185185+ chunk_counts <- tmp_unnest[,
186186+ .(doc_freq = .N),
187187+ by = c(grouping_cols, "token", "ngram")
188188+ ]
189189+ list_idx <- list_idx + 1L
190190+ tokens_acc_list[[list_idx]] <- chunk_counts
191191+ rm(tmp, tmp_unnest, chunk_counts, tmp)
192192+ gc()
193193+ }
194194+ if (has_cli) cli_success("Completed processing all n-gram sizes")
195195+ }
196196+197197+ if (length(tokens_acc_list) == 0L) {
198198+ return(NULL)
199199+ }
200200+201201+ combined <- data.table::rbindlist(
202202+ tokens_acc_list,
203203+ use.names = TRUE,
204204+ fill = TRUE
205205+ )
206206+ combined[,
207207+ .(doc_freq = sum(doc_freq)),
208208+ by = c(grouping_cols, "token", "ngram")
209209+ ]
210210+ }
211211+212212+ tokens_acc <- process_chunked(
213213+ df,
214214+ ng_range,
215215+ grouping_cols,
216216+ text_col,
217217+ chunk_size = chunk_size
218218+ )
219219+220220+ if (is.null(tokens_acc) || nrow(tokens_acc) == 0L) {
221221+ return(make_empty_result())
222222+ }
223223+224224+ unique_tokens <- unique(tokens_acc$token)
225225+ unique_tokens <- as.character(unique_tokens)
226226+227227+ keep_idx <- nchar(unique_tokens) >= as.integer(min_nchar)
228228+229229+ if (any(keep_idx)) {
230230+ toks_to_check <- unique_tokens[keep_idx]
231231+ } else {
232232+ return(make_empty_result())
233233+ }
234234+235235+ toks_to_check <- stringr::str_squish(toks_to_check)
236236+237237+ pattern_special <- stringr::regex("[^\\p{L} ]+", dotall = FALSE)
238238+ has_special <- stringr::str_detect(toks_to_check, pattern_special)
239239+ has_special <- has_special | stringr::str_detect(toks_to_check, "_")
240240+241241+ if (length(stop_words) == 0L) {
242242+ has_stop <- rep(FALSE, length(toks_to_check))
243243+ } else {
244244+ sw_esc <- stringr::str_replace_all(
245245+ stop_words,
246246+ "([\\^$.|?*+()\\[\\]{}\\\\])",
247247+ "\\\\\\1"
248248+ )
249249+ pattern <- paste0("\\b(", paste(sw_esc, collapse = "|"), ")\\b")
250250+ has_stop <- stringr::str_detect(toks_to_check, pattern)
251251+ }
252252+253253+ keep_mask <- !has_stop & !has_special
254254+ keep_tokens <- toks_to_check[keep_mask]
255255+256256+ if (length(keep_tokens) == 0L) {
257257+ return(make_empty_result())
258258+ }
259259+260260+ tokens_acc <- tokens_acc[token %in% keep_tokens]
261261+ tokens_acc[, token := stringr::str_replace_all(token, " ", "_")]
262262+263263+ final_cols <- c(grouping_cols, "token", "ngram", "doc_freq")
264264+ final_cols <- final_cols[final_cols %in% names(tokens_acc)]
265265+ data.table::setcolorder(tokens_acc, final_cols)
266266+267267+ tokens_acc[]
268268+}
+55
R/filter_group_quantile.R
···11+#' Compute and apply a per-group quantile threshold to filter rows
22+#'
33+#' @title Filter rows by per-group quantile threshold
44+#' @description Computes a group-specific quantile threshold from a numeric
55+#' metric and returns rows above that threshold.
66+#'
77+#' @param dt A `data.frame` or `data.table`.
88+#' @param group_cols Character vector of grouping columns.
99+#' @param metric Character scalar. Numeric metric column to threshold.
1010+#' @param probs Numeric scalar in `(0, 1)` passed to `stats::quantile()`.
1111+#' @param inclusive Logical scalar. If `TRUE`, uses `>=`; else uses `>`.
1212+#' @param keep_threshold Logical scalar. If `TRUE`, includes the `threshold`
1313+#' column in the result.
1414+#'
1515+#' @return A filtered `data.table`.
1616+#'
1717+#' @examples
1818+#' dt <- data.table::data.table(group = c("a", "a", "b"), tf = c(1, 3, 2))
1919+#' filter_group_quantile(dt, group_cols = "group", metric = "tf", probs = 0.5)
2020+#'
2121+#' @export
2222+filter_group_quantile <- function(
2323+ dt,
2424+ group_cols,
2525+ metric = "tf",
2626+ probs = 0.75,
2727+ inclusive = FALSE,
2828+ keep_threshold = FALSE
2929+) {
3030+ if (!data.table::is.data.table(dt)) {
3131+ dt <- data.table::as.data.table(dt)
3232+ }
3333+3434+ dt[,
3535+ threshold := as.numeric(stats::quantile(
3636+ .SD[[1]],
3737+ probs = probs,
3838+ na.rm = TRUE
3939+ )),
4040+ by = group_cols,
4141+ .SDcols = metric
4242+ ]
4343+4444+ if (inclusive) {
4545+ res <- dt[get(metric) >= threshold]
4646+ } else {
4747+ res <- dt[get(metric) > threshold]
4848+ }
4949+5050+ if (!keep_threshold) {
5151+ res[, threshold := NULL]
5252+ }
5353+5454+ res[]
5555+}
···11+#' Create short human-readable topic labels from an stm model
22+#'
33+#' @title Create short topic labels from an `stm` model
44+#' @description Extract top words per topic with `stm::labelTopics()` and
55+#' build compact labels such as `"3: labor, wages, unions"`.
66+#'
77+#' @param stm_model Fitted `stm` model object.
88+#' @param type Label type from `c("prob", "frex", "lift", "score")`.
99+#' @param n_words Integer scalar. Number of words per label.
1010+#' @param sep Character scalar used between words.
1111+#'
1212+#' @return A `data.table` with columns `topic` and `label`.
1313+#'
1414+#' @examples
1515+#' \dontrun{
1616+#' # labels <- make_topic_labels(fit, type = "frex", n_words = 4)
1717+#' }
1818+#'
1919+#' @export
2020+make_topic_labels <- function(
2121+ stm_model,
2222+ type = c("prob", "frex", "lift", "score"),
2323+ n_words = 3,
2424+ sep = ", "
2525+) {
2626+ type <- match.arg(type)
2727+ if (n_words < 1) {
2828+ stop("n_words must be >= 1")
2929+ }
3030+3131+ lt <- stm::labelTopics(stm_model, n = n_words)
3232+ mat <- lt[[type]]
3333+3434+ dt <- data.table::as.data.table(mat)
3535+ data.table::setnames(dt, paste0("w", seq_len(ncol(dt))))
3636+ dt[, topic := .I]
3737+3838+ sdcols <- paste0("w", seq_len(n_words))
3939+ dt[,
4040+ label_short := apply(.SD, 1, function(x) {
4141+ paste0(stats::na.omit(x), collapse = sep)
4242+ }),
4343+ .SDcols = sdcols
4444+ ]
4545+4646+ dt[, label := paste0(topic, ": ", label_short)]
4747+ topic_labels <- dt[, .(topic, label)]
4848+ topic_labels
4949+}
+7
R/mintR-package.R
···11+#' mintR: Text mining helpers for STM workflows
22+#'
33+#' Package-level documentation for `mintR`.
44+#'
55+#' @keywords internal
66+#' @import data.table
77+"_PACKAGE"
+124
R/topic_effects.R
···11+#' Extract simulated topic effects from an STM estimate object
22+#'
33+#' @title Extract topic effects from STM simulations
44+#' @description Simulate topic effects for a covariate from an STM
55+#' `estimateEffect()`-like object, and return tidy estimates with uncertainty.
66+#'
77+#' @param x STM estimate-like object (e.g. from `estimateEffect()`).
88+#' @param covariate Character scalar, covariate name.
99+#' @param model Optional STM model object used for labels.
1010+#' @param topics Integer vector of topics to extract.
1111+#' @param method Either `"pointestimate"` or `"difference"`.
1212+#' @param cov.value1,cov.value2 Values used when `method = "difference"`.
1313+#' @param moderator Optional moderator variable name.
1414+#' @param moderator.value Optional moderator value.
1515+#' @param npoints Integer scalar, number of grid points for point estimates.
1616+#' @param nsims Integer scalar, number of simulation draws.
1717+#' @param ci.level Numeric scalar in (0, 1).
1818+#' @param custom.labels Optional topic labels.
1919+#' @param labeltype Label type passed to STM internals.
2020+#' @param n,frexw Additional label parameters passed to STM internals.
2121+#' @return A tidy data.table of simulated topic effects.
2222+#'
2323+#' @details
2424+#' This function relies on non-exported STM internals (`stm:::`) to generate
2525+#' the contrast matrix, simulate coefficients, and produce labels.
2626+#'
2727+#' @examples
2828+#' \dontrun{
2929+#' # effects <- extract_topic_effects(est, covariate = "year", nsims = 200)
3030+#' }
3131+#'
3232+#' @export
3333+extract_topic_effects <- function(
3434+ x,
3535+ covariate,
3636+ model = NULL,
3737+ topics = x$topics,
3838+ method = "pointestimate",
3939+ cov.value1 = NULL,
4040+ cov.value2 = NULL,
4141+ moderator = NULL,
4242+ moderator.value = NULL,
4343+ npoints = 500,
4444+ nsims = 500,
4545+ ci.level = 0.95,
4646+ custom.labels = NULL,
4747+ labeltype = "numbers",
4848+ n = 7,
4949+ frexw = 0.5
5050+) {
5151+ cthis <- stm:::produce_cmatrix(
5252+ prep = x,
5353+ covariate = covariate,
5454+ method = method,
5555+ cov.value1 = cov.value1,
5656+ cov.value2 = cov.value2,
5757+ moderator = moderator,
5858+ npoints = npoints,
5959+ moderator.value = moderator.value
6060+ )
6161+6262+ simbetas <- stm:::simBetas(parameters = x$parameters, nsims = nsims)
6363+ uvals <- cthis$cdata[[covariate]]
6464+ offset <- (1 - ci.level) / 2
6565+6666+ labels <- stm:::createLabels(
6767+ labeltype = labeltype,
6868+ covariate = covariate,
6969+ method = method,
7070+ cdata = cthis$cdata,
7171+ cov.value1 = cov.value1,
7272+ cov.value2 = cov.value2,
7373+ model = model,
7474+ n = n,
7575+ topics = x$topics,
7676+ custom.labels = custom.labels,
7777+ frexw = frexw
7878+ )
7979+8080+ out <- lapply(topics, function(i) {
8181+ sims <- cthis$cmatrix %*% t(simbetas[[which(x$topics == i)]])
8282+8383+ if (method == "difference") {
8484+ diff <- sims[1, ] - sims[2, ]
8585+ out_inner <- data.frame(
8686+ method = method,
8787+ topic = i,
8888+ covariate = covariate,
8989+ covariate.value = paste0(cov.value1, "-", cov.value2),
9090+ estimate = mean(diff),
9191+ std.error = stats::sd(diff),
9292+ ci.level = ci.level,
9393+ ci.lower = stats::quantile(diff, offset),
9494+ ci.upper = stats::quantile(diff, 1 - offset),
9595+ label = labels[which(x$topics == i)]
9696+ )
9797+ } else {
9898+ out_inner <- data.frame(
9999+ method = method,
100100+ topic = i,
101101+ covariate = covariate,
102102+ covariate.value = uvals,
103103+ estimate = rowMeans(sims),
104104+ std.error = apply(sims, 1, stats::sd),
105105+ ci.level = ci.level,
106106+ ci.lower = apply(sims, 1, stats::quantile, probs = offset),
107107+ ci.upper = apply(sims, 1, stats::quantile, probs = (1 - offset)),
108108+ label = labels[which(x$topics == i)]
109109+ )
110110+ }
111111+112112+ if (!is.null(moderator)) {
113113+ out_inner$moderator <- moderator
114114+ out_inner$moderator.value <- moderator.value
115115+ }
116116+117117+ rownames(out_inner) <- NULL
118118+ out_inner
119119+ })
120120+121121+ out <- do.call("rbind", out)
122122+ data.table::setDT(out)
123123+ out
124124+}
+187
R/topic_prevalence_plots.R
···11+#' Plot average topic prevalence from an STM model
22+#'
33+#' @title Plot average topic prevalence
44+#' @description Compute average prevalence from `theta` and draw a horizontal
55+#' bar chart. Optionally save the figure.
66+#'
77+#' @param stm_model STM model object with `theta` matrix.
88+#' @param topic_labels Table with columns `topic` and `label`.
99+#' @param out_file Optional output path used when `save = TRUE`.
1010+#' @param save Logical scalar, whether to save the figure.
1111+#' @param width,height Numeric scalar dimensions in inches for saved file.
1212+#' @return A list with `topic_prev` (data.table) and `plot` (ggplot object).
1313+#'
1414+#' @examples
1515+#' stm_model <- list(theta = matrix(c(0.2, 0.8, 0.4, 0.6), nrow = 2, byrow = TRUE))
1616+#' topic_labels <- data.frame(topic = 1:2, label = c("Labor", "Trade"))
1717+#' plot_topic_prevalence(stm_model, topic_labels, save = FALSE)
1818+#'
1919+#' @export
2020+plot_topic_prevalence <- function(
2121+ stm_model,
2222+ topic_labels,
2323+ out_file = NULL,
2424+ save = TRUE,
2525+ width = 25,
2626+ height = 20
2727+) {
2828+ if (is.null(stm_model$theta)) {
2929+ stop(
3030+ "`stm_model$theta` not found. Fit STM with theta available before plotting.",
3131+ call. = FALSE
3232+ )
3333+ }
3434+3535+ theta <- stm_model$theta
3636+3737+ topic_prev <- data.table::data.table(
3838+ topic = seq_len(ncol(theta)),
3939+ prevalence = colMeans(theta)
4040+ )
4141+4242+ topic_labels_dt <- data.table::as.data.table(topic_labels)
4343+ topic_prev <- merge(topic_prev, topic_labels_dt, by = "topic", all.x = TRUE)
4444+ data.table::setorder(topic_prev, -prevalence)
4545+4646+ p <- ggplot2::ggplot(
4747+ topic_prev,
4848+ ggplot2::aes(x = forcats::fct_reorder(label, prevalence), y = prevalence)
4949+ ) +
5050+ ggplot2::geom_col() +
5151+ ggplot2::coord_flip() +
5252+ ggplot2::scale_y_continuous(
5353+ labels = scales::percent_format(),
5454+ expand = c(0, 0)
5555+ ) +
5656+ ggplot2::labs(
5757+ x = "Topic (top words)",
5858+ y = "Average topic prevalence",
5959+ title = "Topic prevalence (average across docs)"
6060+ ) +
6161+ ggplot2::theme_minimal(base_size = 20)
6262+6363+ if (isTRUE(save)) {
6464+ if (is.null(out_file)) {
6565+ out_file <- here::here(
6666+ "figures",
6767+ "topic_model",
6868+ "stm_target_journals_topic_prevalence.png"
6969+ )
7070+ }
7171+7272+ ggplot2::ggsave(
7373+ plot = p,
7474+ filename = out_file,
7575+ width = width,
7676+ height = height,
7777+ units = "in",
7878+ dpi = 300
7979+ )
8080+ }
8181+8282+ list(topic_prev = topic_prev, plot = p)
8383+}
8484+8585+#' Plot topic prevalence over a covariate grid
8686+#'
8787+#' @title Plot topic prevalence over time
8888+#' @description Compute simulated topic effects over a covariate grid and
8989+#' draw faceted trajectories with confidence ribbons.
9090+#'
9191+#' @param estimations STM estimate object accepted by `extract_topic_effects()`.
9292+#' @param topic_labels Table with columns `topic` and `label`.
9393+#' @param covariate Character scalar, covariate used on x-axis.
9494+#' @param nsims Integer scalar, number of simulation draws.
9595+#' @param out_file Output path for `ggsave()`.
9696+#' @param width,height Numeric scalar dimensions in inches for saved file.
9797+#' @param wrap_width Integer scalar for wrapped facet labels.
9898+#' @param base_size Numeric scalar for ggplot theme size.
9999+#' @return An invisible list with `years_estimate`, `plot`, and `out_file`.
100100+#'
101101+#' @examples
102102+#' \dontrun{
103103+#' # plot_topic_prevalence_over_time(est, topic_labels, covariate = "year", out_file = "topic_year.png")
104104+#' }
105105+#'
106106+#' @export
107107+plot_topic_prevalence_over_time <- function(
108108+ estimations,
109109+ topic_labels,
110110+ covariate = "year",
111111+ nsims = 500,
112112+ out_file = NULL,
113113+ width = 30,
114114+ height = 25,
115115+ wrap_width = 30,
116116+ base_size = 20
117117+) {
118118+ years_estimate <- extract_topic_effects(
119119+ x = estimations,
120120+ covariate = covariate,
121121+ method = "pointestimate",
122122+ npoints = 500,
123123+ nsims = nsims
124124+ )
125125+126126+ years_estimate <- merge(
127127+ years_estimate,
128128+ topic_labels,
129129+ by = "topic",
130130+ all.x = TRUE
131131+ )
132132+133133+ if ("label.y" %in% names(years_estimate)) {
134134+ data.table::setnames(years_estimate, "label.y", "topic_label")
135135+ } else if ("label" %in% names(years_estimate)) {
136136+ data.table::setnames(years_estimate, "label", "topic_label")
137137+ }
138138+139139+ if ("label.x" %in% names(years_estimate)) {
140140+ years_estimate[, label.x := NULL]
141141+ }
142142+143143+ years_estimate[, topic := as.integer(topic)]
144144+ years_estimate[,
145145+ topic_label := stringr::str_wrap(
146146+ as.character(topic_label),
147147+ width = wrap_width
148148+ )
149149+ ]
150150+ years_estimate <- years_estimate[order(topic)]
151151+ years_estimate[,
152152+ topic_label := factor(topic_label, levels = unique(topic_label))
153153+ ]
154154+155155+ p <- ggplot2::ggplot(
156156+ years_estimate,
157157+ ggplot2::aes(x = as.numeric(covariate.value), y = estimate)
158158+ ) +
159159+ ggplot2::geom_line() +
160160+ ggplot2::geom_ribbon(
161161+ ggplot2::aes(ymin = ci.lower, ymax = ci.upper),
162162+ alpha = 0.4
163163+ ) +
164164+ ggplot2::labs(x = NULL, y = "Estimated topic prevalence over time") +
165165+ ggplot2::facet_wrap(~topic_label) +
166166+ ggplot2::scale_x_continuous(expand = c(0, 0)) +
167167+ ggplot2::scale_y_continuous(
168168+ expand = c(0, 0),
169169+ labels = scales::percent_format()
170170+ ) +
171171+ ggplot2::theme_minimal(base_size = base_size)
172172+173173+ ggplot2::ggsave(
174174+ plot = p,
175175+ filename = out_file,
176176+ width = width,
177177+ height = height,
178178+ units = "in",
179179+ dpi = 300
180180+ )
181181+182182+ invisible(list(
183183+ years_estimate = years_estimate,
184184+ plot = p,
185185+ out_file = out_file
186186+ ))
187187+}
+542
R/topic_term_measures.R
···11+#' Calculate FREX scores and return top terms per topic
22+#'
33+#' @title Calculate FREX top terms
44+#' @description Compute FREX scores from `model$beta$logbeta[[1]]` and return
55+#' top terms per topic.
66+#'
77+#' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`.
88+#' @param nb_terms Integer scalar, number of terms to return per topic.
99+#' @param w Numeric scalar in \[0, 1\], FREX weighting parameter.
1010+#' @return A data.table with top FREX terms per topic.
1111+#'
1212+#' @examples
1313+# ' logbeta <- matrix(rnorm(10), nrow = 5)
1414+# ' mdl <- list(beta = list(logbeta = list(logbeta)), vocab = paste0("w", 1:5))
1515+# ' calculate_frex(mdl, nb_terms = 3)
1616+#'
1717+#' @export
1818+calculate_frex <- function(model, nb_terms = 10L, w = 0.5) {
1919+ if (
2020+ !is.list(model) ||
2121+ is.null(model$beta) ||
2222+ is.null(model$beta$logbeta) ||
2323+ length(model$beta$logbeta) < 1 ||
2424+ is.null(model$vocab)
2525+ ) {
2626+ stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE)
2727+ }
2828+ if (!is.numeric(w) || length(w) != 1L || w < 0 || w > 1) {
2929+ stop("`w` must be a numeric scalar in [0, 1].", call. = FALSE)
3030+ }
3131+3232+ nb_terms <- as.integer(nb_terms)
3333+ if (is.na(nb_terms) || nb_terms < 1L) {
3434+ stop("`nb_terms` must be an integer >= 1.", call. = FALSE)
3535+ }
3636+3737+ logbeta <- model$beta$logbeta[[1]]
3838+ if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) {
3939+ return(data.table::data.table(
4040+ topic = integer(0),
4141+ term = character(0),
4242+ frex = numeric(0),
4343+ rank = integer(0),
4444+ mean = numeric(0)
4545+ ))
4646+ }
4747+4848+ col_lse <- function(mat) {
4949+ apply(mat, 2, function(x) {
5050+ x_max <- max(x)
5151+ x_max + log(sum(exp(x - x_max)))
5252+ })
5353+ }
5454+ excl <- t(t(logbeta) - col_lse(logbeta))
5555+5656+ freqscore <- apply(logbeta, 1, data.table::frank) / ncol(logbeta)
5757+ exclscore <- apply(excl, 1, data.table::frank) / ncol(logbeta)
5858+5959+ if (!identical(dim(freqscore), dim(logbeta))) {
6060+ freqscore <- t(freqscore)
6161+ }
6262+ if (!identical(dim(exclscore), dim(logbeta))) {
6363+ exclscore <- t(exclscore)
6464+ }
6565+6666+ frex_mat <- 1 / (w / freqscore + (1 - w) / exclscore)
6767+6868+ frex_dt <- data.table::as.data.table(t(frex_mat))
6969+ data.table::setnames(frex_dt, paste0("V", seq_len(ncol(frex_dt))))
7070+ frex_dt[, term := as.character(model$vocab)]
7171+7272+ frex_long <- data.table::melt(
7373+ frex_dt,
7474+ id.vars = "term",
7575+ variable.name = "topic",
7676+ value.name = "frex",
7777+ variable.factor = FALSE
7878+ )
7979+ frex_long[, topic := as.integer(sub("^V", "", topic))]
8080+8181+ data.table::setorder(frex_long, topic, -frex)
8282+ top_dt <- frex_long[, head(.SD, nb_terms), by = topic]
8383+8484+ top_dt[, rank := seq_len(.N), by = topic]
8585+ top_dt[, mean := mean(frex), by = topic]
8686+8787+ top_dt[, .(
8888+ topic = as.integer(topic),
8989+ term = as.character(term),
9090+ frex = as.numeric(frex),
9191+ rank = as.integer(rank),
9292+ mean = as.numeric(mean)
9393+ )]
9494+}
9595+9696+#' Compute mean FREX across topics
9797+#'
9898+#' @title Compute mean FREX
9999+#' @description Compute the average of per-topic mean FREX values.
100100+#'
101101+#' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`.
102102+#' @param nb_terms Integer scalar, number of terms per topic used for averaging.
103103+#' @param w Numeric scalar in \[0, 1], FREX weighting parameter.
104104+#' @return Numeric scalar mean FREX across topics.
105105+#' @export
106106+compute_mean_frex <- function(model, nb_terms = 10L, w = 0.5) {
107107+ frex_dt <- calculate_frex(model, nb_terms = nb_terms, w = w)
108108+ mean(unique(frex_dt[, .(topic, mean)])$mean)
109109+}
110110+111111+#' Backward-compatible alias for mean FREX
112112+#'
113113+#' @title Average FREX (alias)
114114+#' @description Alias to `compute_mean_frex()` for backward compatibility.
115115+#'
116116+#' @inheritParams compute_mean_frex
117117+#' @return Numeric scalar mean FREX across topics.
118118+#' @export
119119+average_frex <- function(model, nb_terms = 10L, w = 0.5) {
120120+ compute_mean_frex(model = model, nb_terms = nb_terms, w = w)
121121+}
122122+123123+#' Extract top terms per topic by beta
124124+#'
125125+#' @title Calculate beta top terms
126126+#' @description Convert log-probabilities to probabilities and return top terms
127127+#' per topic.
128128+#'
129129+#' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`.
130130+#' @param nb_terms Integer scalar, number of terms to return per topic.
131131+#' @return A data.table with top beta terms per topic.
132132+#' @export
133133+calculate_beta <- function(model, nb_terms = 10L) {
134134+ if (
135135+ !is.list(model) ||
136136+ is.null(model$beta) ||
137137+ is.null(model$beta$logbeta) ||
138138+ length(model$beta$logbeta) < 1 ||
139139+ is.null(model$vocab)
140140+ ) {
141141+ stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE)
142142+ }
143143+144144+ nb_terms <- as.integer(nb_terms)
145145+ if (is.na(nb_terms) || nb_terms < 1L) {
146146+ stop("`nb_terms` must be an integer >= 1.", call. = FALSE)
147147+ }
148148+149149+ logbeta <- model$beta$logbeta[[1]]
150150+ if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) {
151151+ return(data.table::data.table(
152152+ topic = integer(0),
153153+ term = character(0),
154154+ beta = numeric(0),
155155+ rank = integer(0)
156156+ ))
157157+ }
158158+159159+ beta_mat <- exp(logbeta)
160160+161161+ beta_dt <- data.table::as.data.table(t(beta_mat))
162162+ data.table::setnames(beta_dt, paste0("V", seq_len(ncol(beta_dt))))
163163+ beta_dt[, term := as.character(model$vocab)]
164164+165165+ beta_long <- data.table::melt(
166166+ beta_dt,
167167+ id.vars = "term",
168168+ variable.name = "topic",
169169+ value.name = "beta",
170170+ variable.factor = FALSE
171171+ )
172172+ beta_long[, topic := as.integer(sub("^V", "", topic))]
173173+174174+ data.table::setorder(beta_long, topic, -beta)
175175+ top_dt <- beta_long[, head(.SD, nb_terms), by = topic]
176176+ top_dt[, rank := seq_len(.N), by = topic]
177177+178178+ top_dt[, .(
179179+ topic = as.integer(topic),
180180+ term = as.character(term),
181181+ beta = as.numeric(beta),
182182+ rank = as.integer(rank)
183183+ )]
184184+}
185185+186186+#' Calculate LDA-style score for terms per topic
187187+#'
188188+#' @title Calculate score top terms
189189+#' @description Compute an LDA-style score and return top terms by topic.
190190+#'
191191+#' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`.
192192+#' @param nb_terms Integer scalar, number of terms to return per topic.
193193+#' @return A data.table with top score terms per topic.
194194+#' @export
195195+calculate_score <- function(model, nb_terms = 10L) {
196196+ if (
197197+ !is.list(model) ||
198198+ is.null(model$beta) ||
199199+ is.null(model$beta$logbeta) ||
200200+ length(model$beta$logbeta) < 1 ||
201201+ is.null(model$vocab)
202202+ ) {
203203+ stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE)
204204+ }
205205+206206+ nb_terms <- as.integer(nb_terms)
207207+ if (is.na(nb_terms) || nb_terms < 1L) {
208208+ stop("`nb_terms` must be an integer >= 1.", call. = FALSE)
209209+ }
210210+211211+ logbeta <- model$beta$logbeta[[1]]
212212+ if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) {
213213+ return(data.table::data.table(
214214+ topic = integer(0),
215215+ term = character(0),
216216+ score = numeric(0),
217217+ rank = integer(0)
218218+ ))
219219+ }
220220+221221+ col_means <- colMeans(logbeta)
222222+ score_mat <- exp(logbeta) *
223223+ (logbeta -
224224+ matrix(
225225+ col_means,
226226+ nrow = nrow(logbeta),
227227+ ncol = ncol(logbeta),
228228+ byrow = TRUE
229229+ ))
230230+231231+ score_dt <- data.table::as.data.table(t(score_mat))
232232+ data.table::setnames(score_dt, paste0("V", seq_len(ncol(score_dt))))
233233+ score_dt[, term := as.character(model$vocab)]
234234+235235+ score_long <- data.table::melt(
236236+ score_dt,
237237+ id.vars = "term",
238238+ variable.name = "topic",
239239+ value.name = "score",
240240+ variable.factor = FALSE
241241+ )
242242+ score_long[, topic := as.integer(sub("^V", "", topic))]
243243+244244+ data.table::setorder(score_long, topic, -score)
245245+ top_dt <- score_long[, head(.SD, nb_terms), by = topic]
246246+ top_dt[, rank := seq_len(.N), by = topic]
247247+248248+ top_dt[, .(
249249+ topic = as.integer(topic),
250250+ term = as.character(term),
251251+ score = as.numeric(score),
252252+ rank = as.integer(rank)
253253+ )]
254254+}
255255+256256+#' Compute STM term counts from `documents`
257257+#'
258258+#' @title Compute STM term counts
259259+#' @description Aggregate token counts from `stm_data$documents` into a term
260260+#' count table.
261261+#'
262262+#' @param stm_data A list-like stm object with `documents` and `vocab`.
263263+#' @return A data.table with columns `term` and `count`.
264264+#' @export
265265+compute_stm_term_counts <- function(stm_data) {
266266+ if (is.null(stm_data$documents) || length(stm_data$documents) == 0) {
267267+ return(data.table::data.table(term = character(), count = integer()))
268268+ }
269269+270270+ doc_term_list <- lapply(stm_data$documents, function(doc) {
271271+ if (is.null(doc) || length(doc) == 0) {
272272+ return(NULL)
273273+ }
274274+ m <- as.matrix(doc)
275275+ data.table::data.table(
276276+ term_index = as.integer(m[1, ]),
277277+ count = as.integer(m[2, ])
278278+ )
279279+ })
280280+281281+ doc_term_list <- Filter(Negate(is.null), doc_term_list)
282282+ if (length(doc_term_list) == 0L) {
283283+ return(data.table::data.table(term = character(), count = integer()))
284284+ }
285285+286286+ term_counts <- data.table::rbindlist(doc_term_list)
287287+ term_counts <- term_counts[, .(count = sum(count)), by = term_index]
288288+ term_counts[, term := stm_data$vocab[term_index]]
289289+ term_counts[, .(term, count)][order(-count)]
290290+}
291291+292292+#' Backward-compatible alias for STM term counts
293293+#'
294294+#' @title STM term counts (alias)
295295+#' @description Alias to `compute_stm_term_counts()` for backward compatibility.
296296+#'
297297+#' @inheritParams compute_stm_term_counts
298298+#' @return A data.table with columns `term` and `count`.
299299+#' @export
300300+stm_term_counts <- function(stm_data) {
301301+ compute_stm_term_counts(stm_data)
302302+}
303303+304304+#' Calculate Lift measure from STM model and empirical term counts
305305+#'
306306+#' @title Calculate lift top terms
307307+#' @description Compute lift as log topic probability relative to empirical term
308308+#' frequency and return top terms by topic.
309309+#'
310310+#' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`.
311311+#' @param list_terms data.frame/data.table with columns `term` and `count`.
312312+#' @param model_data Optional STM data; used if `list_terms` is NULL.
313313+#' @param nb_terms Integer scalar, number of terms to return per topic.
314314+#' @return A data.table with top lift terms per topic.
315315+#' @export
316316+calculate_lift <- function(
317317+ model,
318318+ list_terms = NULL,
319319+ model_data = NULL,
320320+ nb_terms = 10L
321321+) {
322322+ if (
323323+ !is.list(model) ||
324324+ is.null(model$beta) ||
325325+ is.null(model$beta$logbeta) ||
326326+ length(model$beta$logbeta) < 1 ||
327327+ is.null(model$vocab)
328328+ ) {
329329+ stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE)
330330+ }
331331+332332+ nb_terms <- as.integer(nb_terms)
333333+ if (is.na(nb_terms) || nb_terms < 1L) {
334334+ stop("`nb_terms` must be an integer >= 1.", call. = FALSE)
335335+ }
336336+337337+ logbeta <- model$beta$logbeta[[1]]
338338+ if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) {
339339+ return(data.table::data.table(
340340+ topic = integer(0),
341341+ term = character(0),
342342+ lift = numeric(0),
343343+ rank = integer(0)
344344+ ))
345345+ }
346346+347347+ if (is.null(list_terms)) {
348348+ if (!is.null(model_data)) {
349349+ cli::cli_alert_info(
350350+ "`list_terms` is NULL. Calculating counts with `compute_stm_term_counts()`."
351351+ )
352352+ lt_dt <- compute_stm_term_counts(model_data)
353353+ } else {
354354+ cli::cli_abort(
355355+ "`list_terms` is NULL and `model_data` is also NULL. Cannot compute empirical term counts."
356356+ )
357357+ }
358358+ } else {
359359+ lt_dt <- data.table::as.data.table(list_terms)
360360+ if (!all(c("term", "count") %in% colnames(lt_dt))) {
361361+ stop(
362362+ "`list_terms` must contain columns 'term' and 'count'.",
363363+ call. = FALSE
364364+ )
365365+ }
366366+ lt_dt <- lt_dt[, .(count = sum(as.numeric(count))), by = term]
367367+ }
368368+369369+ vocab <- as.character(model$vocab)
370370+ counts_aligned <- lt_dt[match(vocab, lt_dt$term), count]
371371+ counts_aligned[is.na(counts_aligned)] <- 0
372372+ total_counts <- sum(counts_aligned, na.rm = TRUE)
373373+374374+ if (total_counts == 0) {
375375+ return(data.table::data.table(
376376+ topic = integer(0),
377377+ term = character(0),
378378+ lift = numeric(0),
379379+ rank = integer(0)
380380+ ))
381381+ }
382382+383383+ if (nrow(logbeta) == length(vocab)) {
384384+ logbeta_terms <- logbeta
385385+ } else if (ncol(logbeta) == length(vocab)) {
386386+ logbeta_terms <- t(logbeta)
387387+ } else {
388388+ stop(
389389+ "Cannot align logbeta and vocab: neither rows nor columns match vocab length.",
390390+ call. = FALSE
391391+ )
392392+ }
393393+394394+ emp_log <- log(counts_aligned) - log(total_counts)
395395+ emp_log_mat <- matrix(
396396+ emp_log,
397397+ nrow = nrow(logbeta_terms),
398398+ ncol = ncol(logbeta_terms),
399399+ byrow = FALSE
400400+ )
401401+402402+ lift_mat <- logbeta_terms - emp_log_mat
403403+404404+ lift_dt_wide <- data.table::as.data.table(lift_mat)
405405+ data.table::setnames(lift_dt_wide, paste0("V", seq_len(ncol(lift_dt_wide))))
406406+ lift_dt_wide[, term := vocab]
407407+408408+ lift_long <- data.table::melt(
409409+ lift_dt_wide,
410410+ id.vars = "term",
411411+ variable.name = "topic",
412412+ value.name = "lift",
413413+ variable.factor = FALSE
414414+ )
415415+ lift_long[, topic := as.integer(sub("^V", "", topic))]
416416+417417+ data.table::setorder(lift_long, topic, -lift)
418418+ top_dt <- lift_long[, head(.SD, nb_terms), by = topic]
419419+ top_dt[, rank := seq_len(.N), by = topic]
420420+421421+ top_dt[, .(
422422+ topic = as.integer(topic),
423423+ term = as.character(term),
424424+ lift = as.numeric(lift),
425425+ rank = as.integer(rank)
426426+ )]
427427+}
428428+429429+#' Extract top terms across selected measures
430430+#'
431431+#' @title Extract top terms across measures
432432+#' @description Combine top terms from one or several measures (`frex`, `beta`,
433433+#' `score`, `lift`) into one tidy table.
434434+#'
435435+#' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`.
436436+#' @param list_terms Optional empirical term counts (`term`, `count`).
437437+#' @param model_data Optional STM data used when `list_terms` is NULL for lift.
438438+#' @param nb_terms Integer scalar, number of terms per topic and measure.
439439+#' @param frex_weight Numeric scalar in \[0, 1\] for FREX weighting.
440440+#' @param measures Character vector of measures among `frex`, `beta`, `score`, `lift`.
441441+#' @return A data.table with columns `topic`, `term`, `value`, `rank`, `measure`.
442442+#'
443443+#' @examples
444444+# ' logbeta <- matrix(rnorm(10), nrow = 5)
445445+# ' mdl <- list(beta = list(logbeta = list(logbeta)), vocab = paste0("w", 1:5))
446446+# ' extract_top_terms(mdl, nb_terms = 2, measures = c("beta", "score"))
447447+#'
448448+#' @export
449449+extract_top_terms <- function(
450450+ model,
451451+ list_terms = NULL,
452452+ model_data = NULL,
453453+ nb_terms = 10L,
454454+ frex_weight = 0.5,
455455+ measures = c("frex", "beta", "score", "lift")
456456+) {
457457+ allowed_measures <- c("frex", "beta", "score", "lift")
458458+ measures <- tolower(as.character(measures))
459459+460460+ if (length(measures) == 0L) {
461461+ stop(
462462+ "`measures` must contain at least one measure from: frex, beta, score, lift.",
463463+ call. = FALSE
464464+ )
465465+ }
466466+ if (!all(measures %in% allowed_measures)) {
467467+ stop(
468468+ "Invalid measure(s): ",
469469+ paste(setdiff(measures, allowed_measures), collapse = ", "),
470470+ ". Allowed: frex, beta, score, lift.",
471471+ call. = FALSE
472472+ )
473473+ }
474474+475475+ nb_terms <- as.integer(nb_terms)
476476+ if (is.na(nb_terms) || nb_terms < 1L) {
477477+ stop("`nb_terms` must be an integer >= 1.", call. = FALSE)
478478+ }
479479+480480+ results_list <- list()
481481+482482+ if ("frex" %in% measures) {
483483+ frex_dt <- calculate_frex(model, nb_terms = nb_terms, w = frex_weight)
484484+ if (nrow(frex_dt) > 0L) {
485485+ frex_dt[, measure := "frex"]
486486+ data.table::setnames(frex_dt, "frex", "value")
487487+ if ("mean" %in% names(frex_dt)) {
488488+ frex_dt[, mean := NULL]
489489+ }
490490+ frex_dt <- frex_dt[, .(topic, term, value, rank, measure)]
491491+ }
492492+ results_list[["frex"]] <- frex_dt
493493+ }
494494+495495+ if ("beta" %in% measures) {
496496+ beta_dt <- calculate_beta(model, nb_terms = nb_terms)
497497+ if (nrow(beta_dt) > 0L) {
498498+ beta_dt[, measure := "beta"]
499499+ data.table::setnames(beta_dt, "beta", "value")
500500+ beta_dt <- beta_dt[, .(topic, term, value, rank, measure)]
501501+ }
502502+ results_list[["beta"]] <- beta_dt
503503+ }
504504+505505+ if ("score" %in% measures) {
506506+ score_dt <- calculate_score(model, nb_terms = nb_terms)
507507+ if (nrow(score_dt) > 0L) {
508508+ score_dt[, measure := "score"]
509509+ data.table::setnames(score_dt, "score", "value")
510510+ score_dt <- score_dt[, .(topic, term, value, rank, measure)]
511511+ }
512512+ results_list[["score"]] <- score_dt
513513+ }
514514+515515+ if ("lift" %in% measures) {
516516+ lift_dt <- calculate_lift(
517517+ model = model,
518518+ list_terms = list_terms,
519519+ model_data = model_data,
520520+ nb_terms = nb_terms
521521+ )
522522+ if (nrow(lift_dt) > 0L) {
523523+ lift_dt[, measure := "lift"]
524524+ data.table::setnames(lift_dt, "lift", "value")
525525+ lift_dt <- lift_dt[, .(topic, term, value, rank, measure)]
526526+ }
527527+ results_list[["lift"]] <- lift_dt
528528+ }
529529+530530+ present <- Filter(function(x) !is.null(x), results_list)
531531+ if (length(present) == 0L) {
532532+ return(data.table::data.table(
533533+ topic = integer(0),
534534+ term = character(0),
535535+ value = numeric(0),
536536+ rank = integer(0),
537537+ measure = character(0)
538538+ ))
539539+ }
540540+541541+ data.table::rbindlist(present, use.names = TRUE, fill = TRUE)
542542+}
+24
man/average_frex.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{average_frex}
44+\alias{average_frex}
55+\title{Average FREX (alias)}
66+\usage{
77+average_frex(model, nb_terms = 10L, w = 0.5)
88+}
99+\arguments{
1010+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1111+1212+\item{nb_terms}{Integer scalar, number of terms per topic used for averaging.}
1313+1414+\item{w}{Numeric scalar in [0, 1], FREX weighting parameter.}
1515+}
1616+\value{
1717+Numeric scalar mean FREX across topics.
1818+}
1919+\description{
2020+Alias to \code{compute_mean_frex()} for backward compatibility.
2121+}
2222+\details{
2323+Backward-compatible alias for mean FREX
2424+}
+23
man/calculate_beta.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{calculate_beta}
44+\alias{calculate_beta}
55+\title{Calculate beta top terms}
66+\usage{
77+calculate_beta(model, nb_terms = 10L)
88+}
99+\arguments{
1010+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1111+1212+\item{nb_terms}{Integer scalar, number of terms to return per topic.}
1313+}
1414+\value{
1515+A data.table with top beta terms per topic.
1616+}
1717+\description{
1818+Convert log-probabilities to probabilities and return top terms
1919+per topic.
2020+}
2121+\details{
2222+Extract top terms per topic by beta
2323+}
+25
man/calculate_frex.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{calculate_frex}
44+\alias{calculate_frex}
55+\title{Calculate FREX top terms}
66+\usage{
77+calculate_frex(model, nb_terms = 10L, w = 0.5)
88+}
99+\arguments{
1010+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1111+1212+\item{nb_terms}{Integer scalar, number of terms to return per topic.}
1313+1414+\item{w}{Numeric scalar in [0, 1], FREX weighting parameter.}
1515+}
1616+\value{
1717+A data.table with top FREX terms per topic.
1818+}
1919+\description{
2020+Compute FREX scores from \code{model$beta$logbeta[[1]]} and return
2121+top terms per topic.
2222+}
2323+\details{
2424+Calculate FREX scores and return top terms per topic
2525+}
+27
man/calculate_lift.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{calculate_lift}
44+\alias{calculate_lift}
55+\title{Calculate lift top terms}
66+\usage{
77+calculate_lift(model, list_terms = NULL, model_data = NULL, nb_terms = 10L)
88+}
99+\arguments{
1010+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1111+1212+\item{list_terms}{data.frame/data.table with columns \code{term} and \code{count}.}
1313+1414+\item{model_data}{Optional STM data; used if \code{list_terms} is NULL.}
1515+1616+\item{nb_terms}{Integer scalar, number of terms to return per topic.}
1717+}
1818+\value{
1919+A data.table with top lift terms per topic.
2020+}
2121+\description{
2222+Compute lift as log topic probability relative to empirical term
2323+frequency and return top terms by topic.
2424+}
2525+\details{
2626+Calculate Lift measure from STM model and empirical term counts
2727+}
+22
man/calculate_score.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{calculate_score}
44+\alias{calculate_score}
55+\title{Calculate score top terms}
66+\usage{
77+calculate_score(model, nb_terms = 10L)
88+}
99+\arguments{
1010+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1111+1212+\item{nb_terms}{Integer scalar, number of terms to return per topic.}
1313+}
1414+\value{
1515+A data.table with top score terms per topic.
1616+}
1717+\description{
1818+Compute an LDA-style score and return top terms by topic.
1919+}
2020+\details{
2121+Calculate LDA-style score for terms per topic
2222+}
+24
man/compute_mean_frex.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{compute_mean_frex}
44+\alias{compute_mean_frex}
55+\title{Compute mean FREX}
66+\usage{
77+compute_mean_frex(model, nb_terms = 10L, w = 0.5)
88+}
99+\arguments{
1010+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1111+1212+\item{nb_terms}{Integer scalar, number of terms per topic used for averaging.}
1313+1414+\item{w}{Numeric scalar in [0, 1], FREX weighting parameter.}
1515+}
1616+\value{
1717+Numeric scalar mean FREX across topics.
1818+}
1919+\description{
2020+Compute the average of per-topic mean FREX values.
2121+}
2222+\details{
2323+Compute mean FREX across topics
2424+}
+21
man/compute_stm_term_counts.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{compute_stm_term_counts}
44+\alias{compute_stm_term_counts}
55+\title{Compute STM term counts}
66+\usage{
77+compute_stm_term_counts(stm_data)
88+}
99+\arguments{
1010+\item{stm_data}{A list-like stm object with \code{documents} and \code{vocab}.}
1111+}
1212+\value{
1313+A data.table with columns \code{term} and \code{count}.
1414+}
1515+\description{
1616+Aggregate token counts from \code{stm_data$documents} into a term
1717+count table.
1818+}
1919+\details{
2020+Compute STM term counts from \code{documents}
2121+}
+43
man/compute_tf_idf.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/compute_tf_idf.R
33+\name{compute_tf_idf}
44+\alias{compute_tf_idf}
55+\title{Compute TF–IDF for tokenized data}
66+\usage{
77+compute_tf_idf(
88+ df,
99+ token_col = "token",
1010+ document_col = "document",
1111+ weight_col = NULL
1212+)
1313+}
1414+\arguments{
1515+\item{df}{A \code{data.frame} or \code{data.table} with token data.}
1616+1717+\item{token_col}{Character scalar. Column name for token values.}
1818+1919+\item{document_col}{Character scalar or vector. Column(s) defining document ID.}
2020+2121+\item{weight_col}{Optional character scalar with numeric weights (for example
2222+aggregated token counts). If \code{NULL}, each row contributes weight \code{1}.}
2323+}
2424+\value{
2525+A \code{data.table} with one row per token-document pair and computed
2626+columns including \code{corpus_tf}, \code{nb_doc_word}, \code{df}, \code{idf}, and \code{tf_idf}.
2727+If \code{weight_col} is provided, \code{weighted_tf} is returned instead of \code{tf}.
2828+}
2929+\description{
3030+Compute term frequency (TF), inverse document frequency (IDF),
3131+and TF–IDF from tokenized data.
3232+}
3333+\details{
3434+Compute TF–IDF (optionally weighted) for tokens per document
3535+}
3636+\examples{
3737+dt <- data.table::data.table(
3838+ document = c("d1", "d1", "d2"),
3939+ token = c("labor", "market", "labor")
4040+)
4141+compute_tf_idf(dt)
4242+4343+}
+58
man/extract_ngrams.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/extract_ngrams.R
33+\name{extract_ngrams}
44+\alias{extract_ngrams}
55+\title{Extract n-grams from text (document-level only)}
66+\usage{
77+extract_ngrams(
88+ df,
99+ ngrams = 2L,
1010+ grouping_cols = NULL,
1111+ text_col = "text",
1212+ min_nchar = 2L,
1313+ stop_words = NULL,
1414+ chunk_size = NULL
1515+)
1616+}
1717+\arguments{
1818+\item{df}{A \code{data.frame} or \code{data.table} containing the input text.
1919+This function converts \code{df} to \code{data.table} in-place.}
2020+2121+\item{ngrams}{Integer scalar or vector. If scalar (for example \code{2L}),
2222+values from \code{1:ngrams} are computed. If vector (for example \code{c(1L, 2L)}),
2323+only those sizes are computed.}
2424+2525+\item{grouping_cols}{Character vector of grouping columns to keep in output.
2626+Use \code{NULL} for no grouping.}
2727+2828+\item{text_col}{Character scalar. Name of the column that contains text.}
2929+3030+\item{min_nchar}{Integer scalar. Minimum token length to keep.}
3131+3232+\item{stop_words}{Optional character vector of stop words.
3333+If \code{NULL}, \code{tidytext::stop_words$word} is used.}
3434+3535+\item{chunk_size}{Optional integer number of rows per processing chunk.
3636+Use \code{NULL} or \verb{<= 0} to process in a single pass.}
3737+}
3838+\value{
3939+A \code{data.table} with columns \code{grouping_cols}, \code{token}, \code{ngram},
4040+and \code{doc_freq}.
4141+}
4242+\description{
4343+Tokenise text into n-grams (unigrams, bigrams, etc.) per group,
4444+remove tokens containing digits or stop words, and return a compact
4545+data.table with one row per grouping + token + ngram and a frequency column.
4646+}
4747+\details{
4848+Extract n-grams from text with basic filtering and grouped counts
4949+5050+Tokens are lower-cased by the tokenizer. Tokens containing punctuation,
5151+digits, symbols, or stop words are removed. Multi-word tokens in output
5252+are normalized with underscores (\verb{_}).
5353+}
5454+\examples{
5555+df <- data.frame(id = 1:2, text = c("Economic growth matters", "Growth and trade"))
5656+extract_ngrams(df, ngrams = 2L, grouping_cols = "id")
5757+5858+}
+38
man/extract_top_terms.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{extract_top_terms}
44+\alias{extract_top_terms}
55+\title{Extract top terms across measures}
66+\usage{
77+extract_top_terms(
88+ model,
99+ list_terms = NULL,
1010+ model_data = NULL,
1111+ nb_terms = 10L,
1212+ frex_weight = 0.5,
1313+ measures = c("frex", "beta", "score", "lift")
1414+)
1515+}
1616+\arguments{
1717+\item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.}
1818+1919+\item{list_terms}{Optional empirical term counts (\code{term}, \code{count}).}
2020+2121+\item{model_data}{Optional STM data used when \code{list_terms} is NULL for lift.}
2222+2323+\item{nb_terms}{Integer scalar, number of terms per topic and measure.}
2424+2525+\item{frex_weight}{Numeric scalar in [0, 1] for FREX weighting.}
2626+2727+\item{measures}{Character vector of measures among \code{frex}, \code{beta}, \code{score}, \code{lift}.}
2828+}
2929+\value{
3030+A data.table with columns \code{topic}, \code{term}, \code{value}, \code{rank}, \code{measure}.
3131+}
3232+\description{
3333+Combine top terms from one or several measures (\code{frex}, \code{beta},
3434+\code{score}, \code{lift}) into one tidy table.
3535+}
3636+\details{
3737+Extract top terms across selected measures
3838+}
+73
man/extract_topic_effects.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_effects.R
33+\name{extract_topic_effects}
44+\alias{extract_topic_effects}
55+\title{Extract topic effects from STM simulations}
66+\usage{
77+extract_topic_effects(
88+ x,
99+ covariate,
1010+ model = NULL,
1111+ topics = x$topics,
1212+ method = "pointestimate",
1313+ cov.value1 = NULL,
1414+ cov.value2 = NULL,
1515+ moderator = NULL,
1616+ moderator.value = NULL,
1717+ npoints = 500,
1818+ nsims = 500,
1919+ ci.level = 0.95,
2020+ custom.labels = NULL,
2121+ labeltype = "numbers",
2222+ n = 7,
2323+ frexw = 0.5
2424+)
2525+}
2626+\arguments{
2727+\item{x}{STM estimate-like object (e.g. from \code{estimateEffect()}).}
2828+2929+\item{covariate}{Character scalar, covariate name.}
3030+3131+\item{model}{Optional STM model object used for labels.}
3232+3333+\item{topics}{Integer vector of topics to extract.}
3434+3535+\item{method}{Either \code{"pointestimate"} or \code{"difference"}.}
3636+3737+\item{cov.value1, cov.value2}{Values used when \code{method = "difference"}.}
3838+3939+\item{moderator}{Optional moderator variable name.}
4040+4141+\item{moderator.value}{Optional moderator value.}
4242+4343+\item{npoints}{Integer scalar, number of grid points for point estimates.}
4444+4545+\item{nsims}{Integer scalar, number of simulation draws.}
4646+4747+\item{ci.level}{Numeric scalar in (0, 1).}
4848+4949+\item{custom.labels}{Optional topic labels.}
5050+5151+\item{labeltype}{Label type passed to STM internals.}
5252+5353+\item{n, frexw}{Additional label parameters passed to STM internals.}
5454+}
5555+\value{
5656+A tidy data.table of simulated topic effects.
5757+}
5858+\description{
5959+Simulate topic effects for a covariate from an STM
6060+\code{estimateEffect()}-like object, and return tidy estimates with uncertainty.
6161+}
6262+\details{
6363+Extract simulated topic effects from an STM estimate object
6464+6565+This function relies on non-exported STM internals (\verb{stm:::}) to generate
6666+the contrast matrix, simulate coefficients, and produce labels.
6767+}
6868+\examples{
6969+\dontrun{
7070+# effects <- extract_topic_effects(est, covariate = "year", nsims = 200)
7171+}
7272+7373+}
+44
man/filter_group_quantile.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/filter_group_quantile.R
33+\name{filter_group_quantile}
44+\alias{filter_group_quantile}
55+\title{Filter rows by per-group quantile threshold}
66+\usage{
77+filter_group_quantile(
88+ dt,
99+ group_cols,
1010+ metric = "tf",
1111+ probs = 0.75,
1212+ inclusive = FALSE,
1313+ keep_threshold = FALSE
1414+)
1515+}
1616+\arguments{
1717+\item{dt}{A \code{data.frame} or \code{data.table}.}
1818+1919+\item{group_cols}{Character vector of grouping columns.}
2020+2121+\item{metric}{Character scalar. Numeric metric column to threshold.}
2222+2323+\item{probs}{Numeric scalar in \verb{(0, 1)} passed to \code{stats::quantile()}.}
2424+2525+\item{inclusive}{Logical scalar. If \code{TRUE}, uses \code{>=}; else uses \code{>}.}
2626+2727+\item{keep_threshold}{Logical scalar. If \code{TRUE}, includes the \code{threshold}
2828+column in the result.}
2929+}
3030+\value{
3131+A filtered \code{data.table}.
3232+}
3333+\description{
3434+Computes a group-specific quantile threshold from a numeric
3535+metric and returns rows above that threshold.
3636+}
3737+\details{
3838+Compute and apply a per-group quantile threshold to filter rows
3939+}
4040+\examples{
4141+dt <- data.table::data.table(group = c("a", "a", "b"), tf = c(1, 3, 2))
4242+filter_group_quantile(dt, group_cols = "group", metric = "tf", probs = 0.5)
4343+4444+}
+38
man/make_topic_labels.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/make_topic_labels.R
33+\name{make_topic_labels}
44+\alias{make_topic_labels}
55+\title{Create short topic labels from an \code{stm} model}
66+\usage{
77+make_topic_labels(
88+ stm_model,
99+ type = c("prob", "frex", "lift", "score"),
1010+ n_words = 3,
1111+ sep = ", "
1212+)
1313+}
1414+\arguments{
1515+\item{stm_model}{Fitted \code{stm} model object.}
1616+1717+\item{type}{Label type from \code{c("prob", "frex", "lift", "score")}.}
1818+1919+\item{n_words}{Integer scalar. Number of words per label.}
2020+2121+\item{sep}{Character scalar used between words.}
2222+}
2323+\value{
2424+A \code{data.table} with columns \code{topic} and \code{label}.
2525+}
2626+\description{
2727+Extract top words per topic with \code{stm::labelTopics()} and
2828+build compact labels such as \code{"3: labor, wages, unions"}.
2929+}
3030+\details{
3131+Create short human-readable topic labels from an stm model
3232+}
3333+\examples{
3434+\dontrun{
3535+# labels <- make_topic_labels(fit, type = "frex", n_words = 4)
3636+}
3737+3838+}
+15
man/mintR-package.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/mintR-package.R
33+\docType{package}
44+\name{mintR-package}
55+\alias{mintR}
66+\alias{mintR-package}
77+\title{mintR: Text mining helpers for STM workflows}
88+\description{
99+Package-level documentation for \code{mintR}.
1010+}
1111+\author{
1212+\strong{Maintainer}: Aurélien Goutsmedt \email{agoutsmedt@hotmail.fr}
1313+1414+}
1515+\keyword{internal}
+42
man/plot_topic_prevalence.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_prevalence_plots.R
33+\name{plot_topic_prevalence}
44+\alias{plot_topic_prevalence}
55+\title{Plot average topic prevalence}
66+\usage{
77+plot_topic_prevalence(
88+ stm_model,
99+ topic_labels,
1010+ out_file = NULL,
1111+ save = TRUE,
1212+ width = 25,
1313+ height = 20
1414+)
1515+}
1616+\arguments{
1717+\item{stm_model}{STM model object with \code{theta} matrix.}
1818+1919+\item{topic_labels}{Table with columns \code{topic} and \code{label}.}
2020+2121+\item{out_file}{Optional output path used when \code{save = TRUE}.}
2222+2323+\item{save}{Logical scalar, whether to save the figure.}
2424+2525+\item{width, height}{Numeric scalar dimensions in inches for saved file.}
2626+}
2727+\value{
2828+A list with \code{topic_prev} (data.table) and \code{plot} (ggplot object).
2929+}
3030+\description{
3131+Compute average prevalence from \code{theta} and draw a horizontal
3232+bar chart. Optionally save the figure.
3333+}
3434+\details{
3535+Plot average topic prevalence from an STM model
3636+}
3737+\examples{
3838+stm_model <- list(theta = matrix(c(0.2, 0.8, 0.4, 0.6), nrow = 2, byrow = TRUE))
3939+topic_labels <- data.frame(topic = 1:2, label = c("Labor", "Trade"))
4040+plot_topic_prevalence(stm_model, topic_labels, save = FALSE)
4141+4242+}
+51
man/plot_topic_prevalence_over_time.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_prevalence_plots.R
33+\name{plot_topic_prevalence_over_time}
44+\alias{plot_topic_prevalence_over_time}
55+\title{Plot topic prevalence over time}
66+\usage{
77+plot_topic_prevalence_over_time(
88+ estimations,
99+ topic_labels,
1010+ covariate = "year",
1111+ nsims = 500,
1212+ out_file = NULL,
1313+ width = 30,
1414+ height = 25,
1515+ wrap_width = 30,
1616+ base_size = 20
1717+)
1818+}
1919+\arguments{
2020+\item{estimations}{STM estimate object accepted by \code{extract_topic_effects()}.}
2121+2222+\item{topic_labels}{Table with columns \code{topic} and \code{label}.}
2323+2424+\item{covariate}{Character scalar, covariate used on x-axis.}
2525+2626+\item{nsims}{Integer scalar, number of simulation draws.}
2727+2828+\item{out_file}{Output path for \code{ggsave()}.}
2929+3030+\item{width, height}{Numeric scalar dimensions in inches for saved file.}
3131+3232+\item{wrap_width}{Integer scalar for wrapped facet labels.}
3333+3434+\item{base_size}{Numeric scalar for ggplot theme size.}
3535+}
3636+\value{
3737+An invisible list with \code{years_estimate}, \code{plot}, and \code{out_file}.
3838+}
3939+\description{
4040+Compute simulated topic effects over a covariate grid and
4141+draw faceted trajectories with confidence ribbons.
4242+}
4343+\details{
4444+Plot topic prevalence over a covariate grid
4545+}
4646+\examples{
4747+\dontrun{
4848+# plot_topic_prevalence_over_time(est, topic_labels, covariate = "year", out_file = "topic_year.png")
4949+}
5050+5151+}
+20
man/stm_term_counts.Rd
···11+% Generated by roxygen2: do not edit by hand
22+% Please edit documentation in R/topic_term_measures.R
33+\name{stm_term_counts}
44+\alias{stm_term_counts}
55+\title{STM term counts (alias)}
66+\usage{
77+stm_term_counts(stm_data)
88+}
99+\arguments{
1010+\item{stm_data}{A list-like stm object with \code{documents} and \code{vocab}.}
1111+}
1212+\value{
1313+A data.table with columns \code{term} and \code{count}.
1414+}
1515+\description{
1616+Alias to \code{compute_stm_term_counts()} for backward compatibility.
1717+}
1818+\details{
1919+Backward-compatible alias for STM term counts
2020+}