Initialize mintR package with text mining and topic modeling functions

+1

.Rbuildignore

··· 1 1 ^mintR\.Rproj$ 2 2 ^\.Rproj\.user$ 3 + ^LICENSE\.md$

+17 -5

DESCRIPTION

··· 1 1 Package: mintR 2 - Title: What the Package Does (One Line, Title Case) 2 + Title: mintR: function to mine text 3 3 Version: 0.0.0.9000 4 4 Authors@R: 5 - person("First", "Last", , "first.last@example.com", role = c("aut", "cre")) 6 - Description: What the package does (one paragraph). 7 - License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a 8 - license 5 + person("Aurélien", "Goutsmedt", , "agoutsmedt@hotmail.fr", role = c("aut", "cre")) 6 + Description: Various functions for text mining and analysis: tokenization, counting terms, extracting structural topic model effects, labelling topics, etc.. 7 + License: CC BY 4.0 9 8 Encoding: UTF-8 9 + Imports: 10 + cli, 11 + data.table, 12 + forcats, 13 + ggplot2, 14 + here, 15 + scales, 16 + stats, 17 + stm, 18 + stringr, 19 + tidytext, 20 + tokenizers, 21 + utils 10 22 Roxygen: list(markdown = TRUE) 11 23 RoxygenNote: 7.3.3

+395

LICENSE.md

··· 1 + Attribution 4.0 International 2 + 3 + ======================================================================= 4 + 5 + Creative Commons Corporation ("Creative Commons") is not a law firm and 6 + does not provide legal services or legal advice. Distribution of 7 + Creative Commons public licenses does not create a lawyer-client or 8 + other relationship. Creative Commons makes its licenses and related 9 + information available on an "as-is" basis. Creative Commons gives no 10 + warranties regarding its licenses, any material licensed under their 11 + terms and conditions, or any related information. Creative Commons 12 + disclaims all liability for damages resulting from their use to the 13 + fullest extent possible. 14 + 15 + Using Creative Commons Public Licenses 16 + 17 + Creative Commons public licenses provide a standard set of terms and 18 + conditions that creators and other rights holders may use to share 19 + original works of authorship and other material subject to copyright 20 + and certain other rights specified in the public license below. The 21 + following considerations are for informational purposes only, are not 22 + exhaustive, and do not form part of our licenses. 23 + 24 + Considerations for licensors: Our public licenses are 25 + intended for use by those authorized to give the public 26 + permission to use material in ways otherwise restricted by 27 + copyright and certain other rights. Our licenses are 28 + irrevocable. Licensors should read and understand the terms 29 + and conditions of the license they choose before applying it. 30 + Licensors should also secure all rights necessary before 31 + applying our licenses so that the public can reuse the 32 + material as expected. Licensors should clearly mark any 33 + material not subject to the license. This includes other CC- 34 + licensed material, or material used under an exception or 35 + limitation to copyright. More considerations for licensors: 36 + wiki.creativecommons.org/Considerations_for_licensors 37 + 38 + Considerations for the public: By using one of our public 39 + licenses, a licensor grants the public permission to use the 40 + licensed material under specified terms and conditions. If 41 + the licensor's permission is not necessary for any reason--for 42 + example, because of any applicable exception or limitation to 43 + copyright--then that use is not regulated by the license. Our 44 + licenses grant only permissions under copyright and certain 45 + other rights that a licensor has authority to grant. Use of 46 + the licensed material may still be restricted for other 47 + reasons, including because others have copyright or other 48 + rights in the material. A licensor may make special requests, 49 + such as asking that all changes be marked or described. 50 + Although not required by our licenses, you are encouraged to 51 + respect those requests where reasonable. More considerations 52 + for the public: 53 + wiki.creativecommons.org/Considerations_for_licensees 54 + 55 + ======================================================================= 56 + 57 + Creative Commons Attribution 4.0 International Public License 58 + 59 + By exercising the Licensed Rights (defined below), You accept and agree 60 + to be bound by the terms and conditions of this Creative Commons 61 + Attribution 4.0 International Public License ("Public License"). To the 62 + extent this Public License may be interpreted as a contract, You are 63 + granted the Licensed Rights in consideration of Your acceptance of 64 + these terms and conditions, and the Licensor grants You such rights in 65 + consideration of benefits the Licensor receives from making the 66 + Licensed Material available under these terms and conditions. 67 + 68 + 69 + Section 1 -- Definitions. 70 + 71 + a. Adapted Material means material subject to Copyright and Similar 72 + Rights that is derived from or based upon the Licensed Material 73 + and in which the Licensed Material is translated, altered, 74 + arranged, transformed, or otherwise modified in a manner requiring 75 + permission under the Copyright and Similar Rights held by the 76 + Licensor. For purposes of this Public License, where the Licensed 77 + Material is a musical work, performance, or sound recording, 78 + Adapted Material is always produced where the Licensed Material is 79 + synched in timed relation with a moving image. 80 + 81 + b. Adapter's License means the license You apply to Your Copyright 82 + and Similar Rights in Your contributions to Adapted Material in 83 + accordance with the terms and conditions of this Public License. 84 + 85 + c. Copyright and Similar Rights means copyright and/or similar rights 86 + closely related to copyright including, without limitation, 87 + performance, broadcast, sound recording, and Sui Generis Database 88 + Rights, without regard to how the rights are labeled or 89 + categorized. For purposes of this Public License, the rights 90 + specified in Section 2(b)(1)-(2) are not Copyright and Similar 91 + Rights. 92 + 93 + d. Effective Technological Measures means those measures that, in the 94 + absence of proper authority, may not be circumvented under laws 95 + fulfilling obligations under Article 11 of the WIPO Copyright 96 + Treaty adopted on December 20, 1996, and/or similar international 97 + agreements. 98 + 99 + e. Exceptions and Limitations means fair use, fair dealing, and/or 100 + any other exception or limitation to Copyright and Similar Rights 101 + that applies to Your use of the Licensed Material. 102 + 103 + f. Licensed Material means the artistic or literary work, database, 104 + or other material to which the Licensor applied this Public 105 + License. 106 + 107 + g. Licensed Rights means the rights granted to You subject to the 108 + terms and conditions of this Public License, which are limited to 109 + all Copyright and Similar Rights that apply to Your use of the 110 + Licensed Material and that the Licensor has authority to license. 111 + 112 + h. Licensor means the individual(s) or entity(ies) granting rights 113 + under this Public License. 114 + 115 + i. Share means to provide material to the public by any means or 116 + process that requires permission under the Licensed Rights, such 117 + as reproduction, public display, public performance, distribution, 118 + dissemination, communication, or importation, and to make material 119 + available to the public including in ways that members of the 120 + public may access the material from a place and at a time 121 + individually chosen by them. 122 + 123 + j. Sui Generis Database Rights means rights other than copyright 124 + resulting from Directive 96/9/EC of the European Parliament and of 125 + the Council of 11 March 1996 on the legal protection of databases, 126 + as amended and/or succeeded, as well as other essentially 127 + equivalent rights anywhere in the world. 128 + 129 + k. You means the individual or entity exercising the Licensed Rights 130 + under this Public License. Your has a corresponding meaning. 131 + 132 + 133 + Section 2 -- Scope. 134 + 135 + a. License grant. 136 + 137 + 1. Subject to the terms and conditions of this Public License, 138 + the Licensor hereby grants You a worldwide, royalty-free, 139 + non-sublicensable, non-exclusive, irrevocable license to 140 + exercise the Licensed Rights in the Licensed Material to: 141 + 142 + a. reproduce and Share the Licensed Material, in whole or 143 + in part; and 144 + 145 + b. produce, reproduce, and Share Adapted Material. 146 + 147 + 2. Exceptions and Limitations. For the avoidance of doubt, where 148 + Exceptions and Limitations apply to Your use, this Public 149 + License does not apply, and You do not need to comply with 150 + its terms and conditions. 151 + 152 + 3. Term. The term of this Public License is specified in Section 153 + 6(a). 154 + 155 + 4. Media and formats; technical modifications allowed. The 156 + Licensor authorizes You to exercise the Licensed Rights in 157 + all media and formats whether now known or hereafter created, 158 + and to make technical modifications necessary to do so. The 159 + Licensor waives and/or agrees not to assert any right or 160 + authority to forbid You from making technical modifications 161 + necessary to exercise the Licensed Rights, including 162 + technical modifications necessary to circumvent Effective 163 + Technological Measures. For purposes of this Public License, 164 + simply making modifications authorized by this Section 2(a) 165 + (4) never produces Adapted Material. 166 + 167 + 5. Downstream recipients. 168 + 169 + a. Offer from the Licensor -- Licensed Material. Every 170 + recipient of the Licensed Material automatically 171 + receives an offer from the Licensor to exercise the 172 + Licensed Rights under the terms and conditions of this 173 + Public License. 174 + 175 + b. No downstream restrictions. You may not offer or impose 176 + any additional or different terms or conditions on, or 177 + apply any Effective Technological Measures to, the 178 + Licensed Material if doing so restricts exercise of the 179 + Licensed Rights by any recipient of the Licensed 180 + Material. 181 + 182 + 6. No endorsement. Nothing in this Public License constitutes or 183 + may be construed as permission to assert or imply that You 184 + are, or that Your use of the Licensed Material is, connected 185 + with, or sponsored, endorsed, or granted official status by, 186 + the Licensor or others designated to receive attribution as 187 + provided in Section 3(a)(1)(A)(i). 188 + 189 + b. Other rights. 190 + 191 + 1. Moral rights, such as the right of integrity, are not 192 + licensed under this Public License, nor are publicity, 193 + privacy, and/or other similar personality rights; however, to 194 + the extent possible, the Licensor waives and/or agrees not to 195 + assert any such rights held by the Licensor to the limited 196 + extent necessary to allow You to exercise the Licensed 197 + Rights, but not otherwise. 198 + 199 + 2. Patent and trademark rights are not licensed under this 200 + Public License. 201 + 202 + 3. To the extent possible, the Licensor waives any right to 203 + collect royalties from You for the exercise of the Licensed 204 + Rights, whether directly or through a collecting society 205 + under any voluntary or waivable statutory or compulsory 206 + licensing scheme. In all other cases the Licensor expressly 207 + reserves any right to collect such royalties. 208 + 209 + 210 + Section 3 -- License Conditions. 211 + 212 + Your exercise of the Licensed Rights is expressly made subject to the 213 + following conditions. 214 + 215 + a. Attribution. 216 + 217 + 1. If You Share the Licensed Material (including in modified 218 + form), You must: 219 + 220 + a. retain the following if it is supplied by the Licensor 221 + with the Licensed Material: 222 + 223 + i. identification of the creator(s) of the Licensed 224 + Material and any others designated to receive 225 + attribution, in any reasonable manner requested by 226 + the Licensor (including by pseudonym if 227 + designated); 228 + 229 + ii. a copyright notice; 230 + 231 + iii. a notice that refers to this Public License; 232 + 233 + iv. a notice that refers to the disclaimer of 234 + warranties; 235 + 236 + v. a URI or hyperlink to the Licensed Material to the 237 + extent reasonably practicable; 238 + 239 + b. indicate if You modified the Licensed Material and 240 + retain an indication of any previous modifications; and 241 + 242 + c. indicate the Licensed Material is licensed under this 243 + Public License, and include the text of, or the URI or 244 + hyperlink to, this Public License. 245 + 246 + 2. You may satisfy the conditions in Section 3(a)(1) in any 247 + reasonable manner based on the medium, means, and context in 248 + which You Share the Licensed Material. For example, it may be 249 + reasonable to satisfy the conditions by providing a URI or 250 + hyperlink to a resource that includes the required 251 + information. 252 + 253 + 3. If requested by the Licensor, You must remove any of the 254 + information required by Section 3(a)(1)(A) to the extent 255 + reasonably practicable. 256 + 257 + 4. If You Share Adapted Material You produce, the Adapter's 258 + License You apply must not prevent recipients of the Adapted 259 + Material from complying with this Public License. 260 + 261 + 262 + Section 4 -- Sui Generis Database Rights. 263 + 264 + Where the Licensed Rights include Sui Generis Database Rights that 265 + apply to Your use of the Licensed Material: 266 + 267 + a. for the avoidance of doubt, Section 2(a)(1) grants You the right 268 + to extract, reuse, reproduce, and Share all or a substantial 269 + portion of the contents of the database; 270 + 271 + b. if You include all or a substantial portion of the database 272 + contents in a database in which You have Sui Generis Database 273 + Rights, then the database in which You have Sui Generis Database 274 + Rights (but not its individual contents) is Adapted Material; and 275 + 276 + c. You must comply with the conditions in Section 3(a) if You Share 277 + all or a substantial portion of the contents of the database. 278 + 279 + For the avoidance of doubt, this Section 4 supplements and does not 280 + replace Your obligations under this Public License where the Licensed 281 + Rights include other Copyright and Similar Rights. 282 + 283 + 284 + Section 5 -- Disclaimer of Warranties and Limitation of Liability. 285 + 286 + a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 287 + EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 288 + AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 289 + ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 290 + IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 291 + WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 292 + PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 293 + ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 294 + KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 295 + ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 296 + 297 + b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 298 + TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 299 + NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 300 + INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 301 + COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 302 + USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 303 + ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 304 + DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 305 + IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 306 + 307 + c. The disclaimer of warranties and limitation of liability provided 308 + above shall be interpreted in a manner that, to the extent 309 + possible, most closely approximates an absolute disclaimer and 310 + waiver of all liability. 311 + 312 + 313 + Section 6 -- Term and Termination. 314 + 315 + a. This Public License applies for the term of the Copyright and 316 + Similar Rights licensed here. However, if You fail to comply with 317 + this Public License, then Your rights under this Public License 318 + terminate automatically. 319 + 320 + b. Where Your right to use the Licensed Material has terminated under 321 + Section 6(a), it reinstates: 322 + 323 + 1. automatically as of the date the violation is cured, provided 324 + it is cured within 30 days of Your discovery of the 325 + violation; or 326 + 327 + 2. upon express reinstatement by the Licensor. 328 + 329 + For the avoidance of doubt, this Section 6(b) does not affect any 330 + right the Licensor may have to seek remedies for Your violations 331 + of this Public License. 332 + 333 + c. For the avoidance of doubt, the Licensor may also offer the 334 + Licensed Material under separate terms or conditions or stop 335 + distributing the Licensed Material at any time; however, doing so 336 + will not terminate this Public License. 337 + 338 + d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 339 + License. 340 + 341 + 342 + Section 7 -- Other Terms and Conditions. 343 + 344 + a. The Licensor shall not be bound by any additional or different 345 + terms or conditions communicated by You unless expressly agreed. 346 + 347 + b. Any arrangements, understandings, or agreements regarding the 348 + Licensed Material not stated herein are separate from and 349 + independent of the terms and conditions of this Public License. 350 + 351 + 352 + Section 8 -- Interpretation. 353 + 354 + a. For the avoidance of doubt, this Public License does not, and 355 + shall not be interpreted to, reduce, limit, restrict, or impose 356 + conditions on any use of the Licensed Material that could lawfully 357 + be made without permission under this Public License. 358 + 359 + b. To the extent possible, if any provision of this Public License is 360 + deemed unenforceable, it shall be automatically reformed to the 361 + minimum extent necessary to make it enforceable. If the provision 362 + cannot be reformed, it shall be severed from this Public License 363 + without affecting the enforceability of the remaining terms and 364 + conditions. 365 + 366 + c. No term or condition of this Public License will be waived and no 367 + failure to comply consented to unless expressly agreed to by the 368 + Licensor. 369 + 370 + d. Nothing in this Public License constitutes or may be interpreted 371 + as a limitation upon, or waiver of, any privileges and immunities 372 + that apply to the Licensor or You, including from the legal 373 + processes of any jurisdiction or authority. 374 + 375 + 376 + ======================================================================= 377 + 378 + Creative Commons is not a party to its public 379 + licenses. Notwithstanding, Creative Commons may elect to apply one of 380 + its public licenses to material it publishes and in those instances 381 + will be considered the “Licensor.” The text of the Creative Commons 382 + public licenses is dedicated to the public domain under the CC0 Public 383 + Domain Dedication. Except for the limited purpose of indicating that 384 + material is shared under a Creative Commons public license or as 385 + otherwise permitted by the Creative Commons policies published at 386 + creativecommons.org/policies, Creative Commons does not authorize the 387 + use of the trademark "Creative Commons" or any other trademark or logo 388 + of Creative Commons without its prior written consent including, 389 + without limitation, in connection with any unauthorized modifications 390 + to any of its public licenses or any other arrangements, 391 + understandings, or agreements concerning use of licensed material. For 392 + the avoidance of doubt, this paragraph does not form part of the 393 + public licenses. 394 + 395 + Creative Commons may be contacted at creativecommons.org.

+17

NAMESPACE

··· 1 1 # Generated by roxygen2: do not edit by hand 2 2 3 + export(average_frex) 4 + export(calculate_beta) 5 + export(calculate_frex) 6 + export(calculate_lift) 7 + export(calculate_score) 8 + export(compute_mean_frex) 9 + export(compute_stm_term_counts) 10 + export(compute_tf_idf) 11 + export(extract_ngrams) 12 + export(extract_top_terms) 13 + export(extract_topic_effects) 14 + export(filter_group_quantile) 15 + export(make_topic_labels) 16 + export(plot_topic_prevalence) 17 + export(plot_topic_prevalence_over_time) 18 + export(stm_term_counts) 19 + import(data.table)

+147

R/compute_tf_idf.R

··· 1 + #' Compute TF–IDF (optionally weighted) for tokens per document 2 + #' 3 + #' @title Compute TF–IDF for tokenized data 4 + #' @description Compute term frequency (TF), inverse document frequency (IDF), 5 + #' and TF–IDF from tokenized data. 6 + #' 7 + #' @param df A `data.frame` or `data.table` with token data. 8 + #' @param token_col Character scalar. Column name for token values. 9 + #' @param document_col Character scalar or vector. Column(s) defining document ID. 10 + #' @param weight_col Optional character scalar with numeric weights (for example 11 + #' aggregated token counts). If `NULL`, each row contributes weight `1`. 12 + #' 13 + #' @return A `data.table` with one row per token-document pair and computed 14 + #' columns including `corpus_tf`, `nb_doc_word`, `df`, `idf`, and `tf_idf`. 15 + #' If `weight_col` is provided, `weighted_tf` is returned instead of `tf`. 16 + #' 17 + #' @examples 18 + #' dt <- data.table::data.table( 19 + #' document = c("d1", "d1", "d2"), 20 + #' token = c("labor", "market", "labor") 21 + #' ) 22 + #' compute_tf_idf(dt) 23 + #' 24 + #' @export 25 + compute_tf_idf <- function( 26 + df, 27 + token_col = "token", 28 + document_col = "document", 29 + weight_col = NULL 30 + ) { 31 + df <- data.table::as.data.table(df) 32 + df <- data.table::copy(df) 33 + 34 + document_col <- as.character(document_col) 35 + token_col <- as.character(token_col) 36 + 37 + if (!all(document_col %in% colnames(df))) { 38 + stop( 39 + "Input must contain document column(s): ", 40 + paste(document_col, collapse = ", "), 41 + call. = FALSE 42 + ) 43 + } 44 + if (!token_col %in% colnames(df)) { 45 + stop("Input must contain token column: ", token_col, call. = FALSE) 46 + } 47 + 48 + if (!is.null(weight_col)) { 49 + weight_col <- as.character(weight_col) 50 + if (!weight_col %in% colnames(df)) { 51 + stop("weight_col '", weight_col, "' not found in input", call. = FALSE) 52 + } 53 + if (!is.numeric(df[[weight_col]])) { 54 + df[, (weight_col) := as.numeric(.SD[[1]]), .SDcols = weight_col] 55 + } 56 + if (anyNA(df[[weight_col]])) { 57 + warning("NA values found in weight_col; treating as 0 for weighting") 58 + df[is.na(get(weight_col)), (weight_col) := 0] 59 + } 60 + df[, .weight := abs(df[[weight_col]])] 61 + } else { 62 + df[, .weight := 1.0] 63 + } 64 + 65 + if (token_col %in% document_col) { 66 + stop("`token_col` must be distinct from `document_col`", call. = FALSE) 67 + } 68 + 69 + doc_key <- ".document_tmp_key" 70 + i <- 1L 71 + while (doc_key %in% colnames(df)) { 72 + doc_key <- paste0(".document_tmp_key", i) 73 + i <- i + 1L 74 + } 75 + 76 + if (length(document_col) == 1L) { 77 + df[, (doc_key) := as.character(.SD[[1]]), .SDcols = document_col] 78 + } else { 79 + df[, 80 + (doc_key) := do.call(paste, c(.SD, sep = "\r")), 81 + .SDcols = document_col 82 + ] 83 + } 84 + 85 + if (token_col != "token") { 86 + data.table::setnames(df, token_col, "token") 87 + token_was_renamed <- TRUE 88 + } else { 89 + token_was_renamed <- FALSE 90 + } 91 + 92 + corpus_tf_dt <- df[, .(corpus_tf = sum(.weight)), by = "token"] 93 + 94 + token_doc <- df[, 95 + .(token_doc_weight = sum(.weight)), 96 + by = c("token", doc_key) 97 + ] 98 + 99 + token_doc <- merge( 100 + token_doc, 101 + corpus_tf_dt, 102 + by = "token", 103 + all.x = TRUE, 104 + sort = FALSE 105 + ) 106 + 107 + doc_totals <- df[, .(nb_doc_word = sum(.weight)), by = doc_key] 108 + 109 + token_doc <- merge( 110 + token_doc, 111 + doc_totals, 112 + by = doc_key, 113 + all.x = TRUE, 114 + sort = FALSE 115 + ) 116 + 117 + token_doc[, tf := token_doc_weight / nb_doc_word] 118 + 119 + df_dt <- token_doc[, .(df = .N), by = token] 120 + total_docs <- data.table::uniqueN(token_doc[[doc_key]]) 121 + 122 + token_doc <- merge(token_doc, df_dt, by = "token", all.x = TRUE, sort = FALSE) 123 + 124 + token_doc[, idf := log(total_docs / df)] 125 + token_doc[, tf_idf := tf * idf] 126 + 127 + if (token_was_renamed) { 128 + data.table::setnames(token_doc, "token", token_col) 129 + } 130 + 131 + if (length(document_col) == 1L) { 132 + data.table::setnames(token_doc, doc_key, document_col) 133 + } else { 134 + token_doc[, 135 + (document_col) := data.table::tstrsplit(get(doc_key), "\r", fixed = TRUE) 136 + ] 137 + token_doc[, (doc_key) := NULL] 138 + } 139 + 140 + token_doc[, c("token_doc_weight", ".weight") := NULL] 141 + 142 + if (!is.null(weight_col)) { 143 + data.table::setnames(token_doc, "tf", "weighted_tf") 144 + } 145 + 146 + token_doc[] 147 + }

+268

R/extract_ngrams.R

··· 1 + #' Extract n-grams from text with basic filtering and grouped counts 2 + #' 3 + #' @title Extract n-grams from text (document-level only) 4 + #' @description Tokenise text into n-grams (unigrams, bigrams, etc.) per group, 5 + #' remove tokens containing digits or stop words, and return a compact 6 + #' data.table with one row per grouping + token + ngram and a frequency column. 7 + #' 8 + #' @param df A `data.frame` or `data.table` containing the input text. 9 + #' This function converts `df` to `data.table` in-place. 10 + #' @param ngrams Integer scalar or vector. If scalar (for example `2L`), 11 + #' values from `1:ngrams` are computed. If vector (for example `c(1L, 2L)`), 12 + #' only those sizes are computed. 13 + #' @param grouping_cols Character vector of grouping columns to keep in output. 14 + #' Use `NULL` for no grouping. 15 + #' @param text_col Character scalar. Name of the column that contains text. 16 + #' @param min_nchar Integer scalar. Minimum token length to keep. 17 + #' @param stop_words Optional character vector of stop words. 18 + #' If `NULL`, `tidytext::stop_words$word` is used. 19 + #' @param chunk_size Optional integer number of rows per processing chunk. 20 + #' Use `NULL` or `<= 0` to process in a single pass. 21 + #' 22 + #' @return A `data.table` with columns `grouping_cols`, `token`, `ngram`, 23 + #' and `doc_freq`. 24 + #' 25 + #' @details 26 + #' Tokens are lower-cased by the tokenizer. Tokens containing punctuation, 27 + #' digits, symbols, or stop words are removed. Multi-word tokens in output 28 + #' are normalized with underscores (`_`). 29 + #' 30 + #' @examples 31 + #' df <- data.frame(id = 1:2, text = c("Economic growth matters", "Growth and trade")) 32 + #' extract_ngrams(df, ngrams = 2L, grouping_cols = "id") 33 + #' 34 + #' @export 35 + extract_ngrams <- function( 36 + df, 37 + ngrams = 2L, 38 + grouping_cols = NULL, 39 + text_col = "text", 40 + min_nchar = 2L, 41 + stop_words = NULL, 42 + chunk_size = NULL 43 + ) { 44 + if (!requireNamespace("tokenizers", quietly = TRUE)) { 45 + stop("Please install the 'tokenizers' package.") 46 + } 47 + if (!requireNamespace("tidytext", quietly = TRUE)) { 48 + stop("Please install the 'tidytext' package.") 49 + } 50 + if (!requireNamespace("stringr", quietly = TRUE)) { 51 + stop("Please install the 'stringr' package.") 52 + } 53 + 54 + has_cli <- requireNamespace("cli", quietly = TRUE) 55 + cli_info <- function(...) if (has_cli) cli::cli_alert_info(...) else NULL 56 + cli_success <- function(...) { 57 + if (has_cli) cli::cli_alert_success(...) else NULL 58 + } 59 + 60 + cli_info("Starting n-gram extraction (document-level)") 61 + 62 + data.table::setDT(df) 63 + 64 + if (is.null(stop_words)) { 65 + stop_words <- unique(tidytext::stop_words$word) 66 + } 67 + stop_words <- tolower(stop_words) 68 + 69 + required_cols <- unique(c(grouping_cols, text_col)) 70 + missing_cols <- setdiff(required_cols, names(df)) 71 + if (length(missing_cols) > 0L) { 72 + stop( 73 + "Missing required columns in df: ", 74 + paste(missing_cols, collapse = ", ") 75 + ) 76 + } 77 + 78 + if (length(ngrams) == 1L) { 79 + ng_range <- seq_len(as.integer(ngrams)) 80 + } else { 81 + ng_range <- as.integer(ngrams) 82 + } 83 + 84 + if (is.null(grouping_cols)) { 85 + grouping_cols <- character(0) 86 + } 87 + 88 + make_empty_result <- function() { 89 + cols <- c(grouping_cols, "token", "ngram", "doc_freq") 90 + dt <- data.table::data.table() 91 + for (nm in cols) { 92 + dt[, (nm) := list(NULL)] 93 + } 94 + dt 95 + } 96 + 97 + nrows <- nrow(df) 98 + if (nrows == 0L) { 99 + return(make_empty_result()) 100 + } 101 + 102 + process_chunked <- function( 103 + df_subset, 104 + ng_range, 105 + grouping_cols, 106 + text_col, 107 + chunk_size = NULL 108 + ) { 109 + nrows_local <- nrow(df_subset) 110 + tokens_acc_list <- list() 111 + list_idx <- 0L 112 + 113 + if (!is.null(chunk_size) && chunk_size > 0L) { 114 + total_chunks <- ceiling(nrows_local / chunk_size) 115 + if (has_cli) { 116 + pb <- cli::cli_progress_bar("Processing chunks", total = total_chunks) 117 + on.exit(cli::cli_progress_done(), add = TRUE) 118 + } 119 + chunk_i <- 0L 120 + for (start in seq(1L, nrows_local, by = chunk_size)) { 121 + chunk_i <- chunk_i + 1L 122 + end <- min(start + chunk_size - 1L, nrows_local) 123 + if (has_cli) { 124 + cli::cli_progress_update() 125 + cli::cli_alert_info( 126 + "Processing chunk {chunk_i} of {total_chunks} (rows {start}-{end})" 127 + ) 128 + } 129 + chunk <- df_subset[start:end, c(grouping_cols, text_col), with = FALSE] 130 + for (n in ng_range) { 131 + tmp <- data.table::copy(chunk) 132 + tmp[, 133 + token := tokenizers::tokenize_ngrams( 134 + get(text_col), 135 + n = as.integer(n), 136 + lowercase = TRUE 137 + ) 138 + ] 139 + tmp[, (text_col) := NULL] 140 + tmp[, ngram := as.integer(n)] 141 + tmp_unnest <- tmp[, 142 + .(token = unlist(token)), 143 + by = c(grouping_cols, "ngram") 144 + ] 145 + if (nrow(tmp_unnest) == 0L) { 146 + next 147 + } 148 + chunk_counts <- tmp_unnest[, 149 + .(doc_freq = .N), 150 + by = c(grouping_cols, "token", "ngram") 151 + ] 152 + list_idx <- list_idx + 1L 153 + tokens_acc_list[[list_idx]] <- chunk_counts 154 + rm(tmp, tmp_unnest, chunk_counts) 155 + gc() 156 + } 157 + rm(chunk) 158 + gc() 159 + } 160 + if (has_cli) { 161 + cli_success(sprintf( 162 + "Completed chunked processing (%d chunks)", 163 + total_chunks 164 + )) 165 + } 166 + } else { 167 + for (n in ng_range) { 168 + tmp <- df_subset[, c(grouping_cols, text_col), with = FALSE] 169 + tmp[, 170 + token := tokenizers::tokenize_ngrams( 171 + get(text_col), 172 + n = as.integer(n), 173 + lowercase = TRUE 174 + ) 175 + ] 176 + tmp[, (text_col) := NULL] 177 + tmp[, ngram := as.integer(n)] 178 + tmp_unnest <- tmp[, 179 + .(token = unlist(token)), 180 + by = c(grouping_cols, "ngram") 181 + ] 182 + if (nrow(tmp_unnest) == 0L) { 183 + next 184 + } 185 + chunk_counts <- tmp_unnest[, 186 + .(doc_freq = .N), 187 + by = c(grouping_cols, "token", "ngram") 188 + ] 189 + list_idx <- list_idx + 1L 190 + tokens_acc_list[[list_idx]] <- chunk_counts 191 + rm(tmp, tmp_unnest, chunk_counts, tmp) 192 + gc() 193 + } 194 + if (has_cli) cli_success("Completed processing all n-gram sizes") 195 + } 196 + 197 + if (length(tokens_acc_list) == 0L) { 198 + return(NULL) 199 + } 200 + 201 + combined <- data.table::rbindlist( 202 + tokens_acc_list, 203 + use.names = TRUE, 204 + fill = TRUE 205 + ) 206 + combined[, 207 + .(doc_freq = sum(doc_freq)), 208 + by = c(grouping_cols, "token", "ngram") 209 + ] 210 + } 211 + 212 + tokens_acc <- process_chunked( 213 + df, 214 + ng_range, 215 + grouping_cols, 216 + text_col, 217 + chunk_size = chunk_size 218 + ) 219 + 220 + if (is.null(tokens_acc) || nrow(tokens_acc) == 0L) { 221 + return(make_empty_result()) 222 + } 223 + 224 + unique_tokens <- unique(tokens_acc$token) 225 + unique_tokens <- as.character(unique_tokens) 226 + 227 + keep_idx <- nchar(unique_tokens) >= as.integer(min_nchar) 228 + 229 + if (any(keep_idx)) { 230 + toks_to_check <- unique_tokens[keep_idx] 231 + } else { 232 + return(make_empty_result()) 233 + } 234 + 235 + toks_to_check <- stringr::str_squish(toks_to_check) 236 + 237 + pattern_special <- stringr::regex("[^\\p{L} ]+", dotall = FALSE) 238 + has_special <- stringr::str_detect(toks_to_check, pattern_special) 239 + has_special <- has_special | stringr::str_detect(toks_to_check, "_") 240 + 241 + if (length(stop_words) == 0L) { 242 + has_stop <- rep(FALSE, length(toks_to_check)) 243 + } else { 244 + sw_esc <- stringr::str_replace_all( 245 + stop_words, 246 + "([\\^$.|?*+()\\[\\]{}\\\\])", 247 + "\\\\\\1" 248 + ) 249 + pattern <- paste0("\\b(", paste(sw_esc, collapse = "|"), ")\\b") 250 + has_stop <- stringr::str_detect(toks_to_check, pattern) 251 + } 252 + 253 + keep_mask <- !has_stop & !has_special 254 + keep_tokens <- toks_to_check[keep_mask] 255 + 256 + if (length(keep_tokens) == 0L) { 257 + return(make_empty_result()) 258 + } 259 + 260 + tokens_acc <- tokens_acc[token %in% keep_tokens] 261 + tokens_acc[, token := stringr::str_replace_all(token, " ", "_")] 262 + 263 + final_cols <- c(grouping_cols, "token", "ngram", "doc_freq") 264 + final_cols <- final_cols[final_cols %in% names(tokens_acc)] 265 + data.table::setcolorder(tokens_acc, final_cols) 266 + 267 + tokens_acc[] 268 + }

+55

R/filter_group_quantile.R

··· 1 + #' Compute and apply a per-group quantile threshold to filter rows 2 + #' 3 + #' @title Filter rows by per-group quantile threshold 4 + #' @description Computes a group-specific quantile threshold from a numeric 5 + #' metric and returns rows above that threshold. 6 + #' 7 + #' @param dt A `data.frame` or `data.table`. 8 + #' @param group_cols Character vector of grouping columns. 9 + #' @param metric Character scalar. Numeric metric column to threshold. 10 + #' @param probs Numeric scalar in `(0, 1)` passed to `stats::quantile()`. 11 + #' @param inclusive Logical scalar. If `TRUE`, uses `>=`; else uses `>`. 12 + #' @param keep_threshold Logical scalar. If `TRUE`, includes the `threshold` 13 + #' column in the result. 14 + #' 15 + #' @return A filtered `data.table`. 16 + #' 17 + #' @examples 18 + #' dt <- data.table::data.table(group = c("a", "a", "b"), tf = c(1, 3, 2)) 19 + #' filter_group_quantile(dt, group_cols = "group", metric = "tf", probs = 0.5) 20 + #' 21 + #' @export 22 + filter_group_quantile <- function( 23 + dt, 24 + group_cols, 25 + metric = "tf", 26 + probs = 0.75, 27 + inclusive = FALSE, 28 + keep_threshold = FALSE 29 + ) { 30 + if (!data.table::is.data.table(dt)) { 31 + dt <- data.table::as.data.table(dt) 32 + } 33 + 34 + dt[, 35 + threshold := as.numeric(stats::quantile( 36 + .SD[[1]], 37 + probs = probs, 38 + na.rm = TRUE 39 + )), 40 + by = group_cols, 41 + .SDcols = metric 42 + ] 43 + 44 + if (inclusive) { 45 + res <- dt[get(metric) >= threshold] 46 + } else { 47 + res <- dt[get(metric) > threshold] 48 + } 49 + 50 + if (!keep_threshold) { 51 + res[, threshold := NULL] 52 + } 53 + 54 + res[] 55 + }

+34

R/globals.R

··· 1 + # Global variable declarations for data.table NSE 2 + # These silence R CMD check NOTES about undefined global variables. 3 + .datatable.aware <- TRUE 4 + 5 + if (getRversion() >= "2.15.1") { 6 + utils::globalVariables(c( 7 + ".N", 8 + ".SD", 9 + "threshold", 10 + "topic", 11 + "term", 12 + "token", 13 + "ngram", 14 + "doc_freq", 15 + "count", 16 + "term_index", 17 + "label", 18 + "prevalence", 19 + "covariate.value", 20 + "estimate", 21 + "ci.lower", 22 + "ci.upper", 23 + "topic_label", 24 + "value", 25 + "rank", 26 + "measure", 27 + "beta", 28 + "frex", 29 + "score", 30 + "lift", 31 + "mean", 32 + "label_short" 33 + )) 34 + }

+49

R/make_topic_labels.R

··· 1 + #' Create short human-readable topic labels from an stm model 2 + #' 3 + #' @title Create short topic labels from an `stm` model 4 + #' @description Extract top words per topic with `stm::labelTopics()` and 5 + #' build compact labels such as `"3: labor, wages, unions"`. 6 + #' 7 + #' @param stm_model Fitted `stm` model object. 8 + #' @param type Label type from `c("prob", "frex", "lift", "score")`. 9 + #' @param n_words Integer scalar. Number of words per label. 10 + #' @param sep Character scalar used between words. 11 + #' 12 + #' @return A `data.table` with columns `topic` and `label`. 13 + #' 14 + #' @examples 15 + #' \dontrun{ 16 + #' # labels <- make_topic_labels(fit, type = "frex", n_words = 4) 17 + #' } 18 + #' 19 + #' @export 20 + make_topic_labels <- function( 21 + stm_model, 22 + type = c("prob", "frex", "lift", "score"), 23 + n_words = 3, 24 + sep = ", " 25 + ) { 26 + type <- match.arg(type) 27 + if (n_words < 1) { 28 + stop("n_words must be >= 1") 29 + } 30 + 31 + lt <- stm::labelTopics(stm_model, n = n_words) 32 + mat <- lt[[type]] 33 + 34 + dt <- data.table::as.data.table(mat) 35 + data.table::setnames(dt, paste0("w", seq_len(ncol(dt)))) 36 + dt[, topic := .I] 37 + 38 + sdcols <- paste0("w", seq_len(n_words)) 39 + dt[, 40 + label_short := apply(.SD, 1, function(x) { 41 + paste0(stats::na.omit(x), collapse = sep) 42 + }), 43 + .SDcols = sdcols 44 + ] 45 + 46 + dt[, label := paste0(topic, ": ", label_short)] 47 + topic_labels <- dt[, .(topic, label)] 48 + topic_labels 49 + }

+7

R/mintR-package.R

··· 1 + #' mintR: Text mining helpers for STM workflows 2 + #' 3 + #' Package-level documentation for `mintR`. 4 + #' 5 + #' @keywords internal 6 + #' @import data.table 7 + "_PACKAGE"

+124

R/topic_effects.R

··· 1 + #' Extract simulated topic effects from an STM estimate object 2 + #' 3 + #' @title Extract topic effects from STM simulations 4 + #' @description Simulate topic effects for a covariate from an STM 5 + #' `estimateEffect()`-like object, and return tidy estimates with uncertainty. 6 + #' 7 + #' @param x STM estimate-like object (e.g. from `estimateEffect()`). 8 + #' @param covariate Character scalar, covariate name. 9 + #' @param model Optional STM model object used for labels. 10 + #' @param topics Integer vector of topics to extract. 11 + #' @param method Either `"pointestimate"` or `"difference"`. 12 + #' @param cov.value1,cov.value2 Values used when `method = "difference"`. 13 + #' @param moderator Optional moderator variable name. 14 + #' @param moderator.value Optional moderator value. 15 + #' @param npoints Integer scalar, number of grid points for point estimates. 16 + #' @param nsims Integer scalar, number of simulation draws. 17 + #' @param ci.level Numeric scalar in (0, 1). 18 + #' @param custom.labels Optional topic labels. 19 + #' @param labeltype Label type passed to STM internals. 20 + #' @param n,frexw Additional label parameters passed to STM internals. 21 + #' @return A tidy data.table of simulated topic effects. 22 + #' 23 + #' @details 24 + #' This function relies on non-exported STM internals (`stm:::`) to generate 25 + #' the contrast matrix, simulate coefficients, and produce labels. 26 + #' 27 + #' @examples 28 + #' \dontrun{ 29 + #' # effects <- extract_topic_effects(est, covariate = "year", nsims = 200) 30 + #' } 31 + #' 32 + #' @export 33 + extract_topic_effects <- function( 34 + x, 35 + covariate, 36 + model = NULL, 37 + topics = x$topics, 38 + method = "pointestimate", 39 + cov.value1 = NULL, 40 + cov.value2 = NULL, 41 + moderator = NULL, 42 + moderator.value = NULL, 43 + npoints = 500, 44 + nsims = 500, 45 + ci.level = 0.95, 46 + custom.labels = NULL, 47 + labeltype = "numbers", 48 + n = 7, 49 + frexw = 0.5 50 + ) { 51 + cthis <- stm:::produce_cmatrix( 52 + prep = x, 53 + covariate = covariate, 54 + method = method, 55 + cov.value1 = cov.value1, 56 + cov.value2 = cov.value2, 57 + moderator = moderator, 58 + npoints = npoints, 59 + moderator.value = moderator.value 60 + ) 61 + 62 + simbetas <- stm:::simBetas(parameters = x$parameters, nsims = nsims) 63 + uvals <- cthis$cdata[[covariate]] 64 + offset <- (1 - ci.level) / 2 65 + 66 + labels <- stm:::createLabels( 67 + labeltype = labeltype, 68 + covariate = covariate, 69 + method = method, 70 + cdata = cthis$cdata, 71 + cov.value1 = cov.value1, 72 + cov.value2 = cov.value2, 73 + model = model, 74 + n = n, 75 + topics = x$topics, 76 + custom.labels = custom.labels, 77 + frexw = frexw 78 + ) 79 + 80 + out <- lapply(topics, function(i) { 81 + sims <- cthis$cmatrix %*% t(simbetas[[which(x$topics == i)]]) 82 + 83 + if (method == "difference") { 84 + diff <- sims[1, ] - sims[2, ] 85 + out_inner <- data.frame( 86 + method = method, 87 + topic = i, 88 + covariate = covariate, 89 + covariate.value = paste0(cov.value1, "-", cov.value2), 90 + estimate = mean(diff), 91 + std.error = stats::sd(diff), 92 + ci.level = ci.level, 93 + ci.lower = stats::quantile(diff, offset), 94 + ci.upper = stats::quantile(diff, 1 - offset), 95 + label = labels[which(x$topics == i)] 96 + ) 97 + } else { 98 + out_inner <- data.frame( 99 + method = method, 100 + topic = i, 101 + covariate = covariate, 102 + covariate.value = uvals, 103 + estimate = rowMeans(sims), 104 + std.error = apply(sims, 1, stats::sd), 105 + ci.level = ci.level, 106 + ci.lower = apply(sims, 1, stats::quantile, probs = offset), 107 + ci.upper = apply(sims, 1, stats::quantile, probs = (1 - offset)), 108 + label = labels[which(x$topics == i)] 109 + ) 110 + } 111 + 112 + if (!is.null(moderator)) { 113 + out_inner$moderator <- moderator 114 + out_inner$moderator.value <- moderator.value 115 + } 116 + 117 + rownames(out_inner) <- NULL 118 + out_inner 119 + }) 120 + 121 + out <- do.call("rbind", out) 122 + data.table::setDT(out) 123 + out 124 + }

+187

R/topic_prevalence_plots.R

··· 1 + #' Plot average topic prevalence from an STM model 2 + #' 3 + #' @title Plot average topic prevalence 4 + #' @description Compute average prevalence from `theta` and draw a horizontal 5 + #' bar chart. Optionally save the figure. 6 + #' 7 + #' @param stm_model STM model object with `theta` matrix. 8 + #' @param topic_labels Table with columns `topic` and `label`. 9 + #' @param out_file Optional output path used when `save = TRUE`. 10 + #' @param save Logical scalar, whether to save the figure. 11 + #' @param width,height Numeric scalar dimensions in inches for saved file. 12 + #' @return A list with `topic_prev` (data.table) and `plot` (ggplot object). 13 + #' 14 + #' @examples 15 + #' stm_model <- list(theta = matrix(c(0.2, 0.8, 0.4, 0.6), nrow = 2, byrow = TRUE)) 16 + #' topic_labels <- data.frame(topic = 1:2, label = c("Labor", "Trade")) 17 + #' plot_topic_prevalence(stm_model, topic_labels, save = FALSE) 18 + #' 19 + #' @export 20 + plot_topic_prevalence <- function( 21 + stm_model, 22 + topic_labels, 23 + out_file = NULL, 24 + save = TRUE, 25 + width = 25, 26 + height = 20 27 + ) { 28 + if (is.null(stm_model$theta)) { 29 + stop( 30 + "`stm_model$theta` not found. Fit STM with theta available before plotting.", 31 + call. = FALSE 32 + ) 33 + } 34 + 35 + theta <- stm_model$theta 36 + 37 + topic_prev <- data.table::data.table( 38 + topic = seq_len(ncol(theta)), 39 + prevalence = colMeans(theta) 40 + ) 41 + 42 + topic_labels_dt <- data.table::as.data.table(topic_labels) 43 + topic_prev <- merge(topic_prev, topic_labels_dt, by = "topic", all.x = TRUE) 44 + data.table::setorder(topic_prev, -prevalence) 45 + 46 + p <- ggplot2::ggplot( 47 + topic_prev, 48 + ggplot2::aes(x = forcats::fct_reorder(label, prevalence), y = prevalence) 49 + ) + 50 + ggplot2::geom_col() + 51 + ggplot2::coord_flip() + 52 + ggplot2::scale_y_continuous( 53 + labels = scales::percent_format(), 54 + expand = c(0, 0) 55 + ) + 56 + ggplot2::labs( 57 + x = "Topic (top words)", 58 + y = "Average topic prevalence", 59 + title = "Topic prevalence (average across docs)" 60 + ) + 61 + ggplot2::theme_minimal(base_size = 20) 62 + 63 + if (isTRUE(save)) { 64 + if (is.null(out_file)) { 65 + out_file <- here::here( 66 + "figures", 67 + "topic_model", 68 + "stm_target_journals_topic_prevalence.png" 69 + ) 70 + } 71 + 72 + ggplot2::ggsave( 73 + plot = p, 74 + filename = out_file, 75 + width = width, 76 + height = height, 77 + units = "in", 78 + dpi = 300 79 + ) 80 + } 81 + 82 + list(topic_prev = topic_prev, plot = p) 83 + } 84 + 85 + #' Plot topic prevalence over a covariate grid 86 + #' 87 + #' @title Plot topic prevalence over time 88 + #' @description Compute simulated topic effects over a covariate grid and 89 + #' draw faceted trajectories with confidence ribbons. 90 + #' 91 + #' @param estimations STM estimate object accepted by `extract_topic_effects()`. 92 + #' @param topic_labels Table with columns `topic` and `label`. 93 + #' @param covariate Character scalar, covariate used on x-axis. 94 + #' @param nsims Integer scalar, number of simulation draws. 95 + #' @param out_file Output path for `ggsave()`. 96 + #' @param width,height Numeric scalar dimensions in inches for saved file. 97 + #' @param wrap_width Integer scalar for wrapped facet labels. 98 + #' @param base_size Numeric scalar for ggplot theme size. 99 + #' @return An invisible list with `years_estimate`, `plot`, and `out_file`. 100 + #' 101 + #' @examples 102 + #' \dontrun{ 103 + #' # plot_topic_prevalence_over_time(est, topic_labels, covariate = "year", out_file = "topic_year.png") 104 + #' } 105 + #' 106 + #' @export 107 + plot_topic_prevalence_over_time <- function( 108 + estimations, 109 + topic_labels, 110 + covariate = "year", 111 + nsims = 500, 112 + out_file = NULL, 113 + width = 30, 114 + height = 25, 115 + wrap_width = 30, 116 + base_size = 20 117 + ) { 118 + years_estimate <- extract_topic_effects( 119 + x = estimations, 120 + covariate = covariate, 121 + method = "pointestimate", 122 + npoints = 500, 123 + nsims = nsims 124 + ) 125 + 126 + years_estimate <- merge( 127 + years_estimate, 128 + topic_labels, 129 + by = "topic", 130 + all.x = TRUE 131 + ) 132 + 133 + if ("label.y" %in% names(years_estimate)) { 134 + data.table::setnames(years_estimate, "label.y", "topic_label") 135 + } else if ("label" %in% names(years_estimate)) { 136 + data.table::setnames(years_estimate, "label", "topic_label") 137 + } 138 + 139 + if ("label.x" %in% names(years_estimate)) { 140 + years_estimate[, label.x := NULL] 141 + } 142 + 143 + years_estimate[, topic := as.integer(topic)] 144 + years_estimate[, 145 + topic_label := stringr::str_wrap( 146 + as.character(topic_label), 147 + width = wrap_width 148 + ) 149 + ] 150 + years_estimate <- years_estimate[order(topic)] 151 + years_estimate[, 152 + topic_label := factor(topic_label, levels = unique(topic_label)) 153 + ] 154 + 155 + p <- ggplot2::ggplot( 156 + years_estimate, 157 + ggplot2::aes(x = as.numeric(covariate.value), y = estimate) 158 + ) + 159 + ggplot2::geom_line() + 160 + ggplot2::geom_ribbon( 161 + ggplot2::aes(ymin = ci.lower, ymax = ci.upper), 162 + alpha = 0.4 163 + ) + 164 + ggplot2::labs(x = NULL, y = "Estimated topic prevalence over time") + 165 + ggplot2::facet_wrap(~topic_label) + 166 + ggplot2::scale_x_continuous(expand = c(0, 0)) + 167 + ggplot2::scale_y_continuous( 168 + expand = c(0, 0), 169 + labels = scales::percent_format() 170 + ) + 171 + ggplot2::theme_minimal(base_size = base_size) 172 + 173 + ggplot2::ggsave( 174 + plot = p, 175 + filename = out_file, 176 + width = width, 177 + height = height, 178 + units = "in", 179 + dpi = 300 180 + ) 181 + 182 + invisible(list( 183 + years_estimate = years_estimate, 184 + plot = p, 185 + out_file = out_file 186 + )) 187 + }

+542

R/topic_term_measures.R

··· 1 + #' Calculate FREX scores and return top terms per topic 2 + #' 3 + #' @title Calculate FREX top terms 4 + #' @description Compute FREX scores from `model$beta$logbeta[[1]]` and return 5 + #' top terms per topic. 6 + #' 7 + #' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`. 8 + #' @param nb_terms Integer scalar, number of terms to return per topic. 9 + #' @param w Numeric scalar in \[0, 1\], FREX weighting parameter. 10 + #' @return A data.table with top FREX terms per topic. 11 + #' 12 + #' @examples 13 + # ' logbeta <- matrix(rnorm(10), nrow = 5) 14 + # ' mdl <- list(beta = list(logbeta = list(logbeta)), vocab = paste0("w", 1:5)) 15 + # ' calculate_frex(mdl, nb_terms = 3) 16 + #' 17 + #' @export 18 + calculate_frex <- function(model, nb_terms = 10L, w = 0.5) { 19 + if ( 20 + !is.list(model) || 21 + is.null(model$beta) || 22 + is.null(model$beta$logbeta) || 23 + length(model$beta$logbeta) < 1 || 24 + is.null(model$vocab) 25 + ) { 26 + stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE) 27 + } 28 + if (!is.numeric(w) || length(w) != 1L || w < 0 || w > 1) { 29 + stop("`w` must be a numeric scalar in [0, 1].", call. = FALSE) 30 + } 31 + 32 + nb_terms <- as.integer(nb_terms) 33 + if (is.na(nb_terms) || nb_terms < 1L) { 34 + stop("`nb_terms` must be an integer >= 1.", call. = FALSE) 35 + } 36 + 37 + logbeta <- model$beta$logbeta[[1]] 38 + if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) { 39 + return(data.table::data.table( 40 + topic = integer(0), 41 + term = character(0), 42 + frex = numeric(0), 43 + rank = integer(0), 44 + mean = numeric(0) 45 + )) 46 + } 47 + 48 + col_lse <- function(mat) { 49 + apply(mat, 2, function(x) { 50 + x_max <- max(x) 51 + x_max + log(sum(exp(x - x_max))) 52 + }) 53 + } 54 + excl <- t(t(logbeta) - col_lse(logbeta)) 55 + 56 + freqscore <- apply(logbeta, 1, data.table::frank) / ncol(logbeta) 57 + exclscore <- apply(excl, 1, data.table::frank) / ncol(logbeta) 58 + 59 + if (!identical(dim(freqscore), dim(logbeta))) { 60 + freqscore <- t(freqscore) 61 + } 62 + if (!identical(dim(exclscore), dim(logbeta))) { 63 + exclscore <- t(exclscore) 64 + } 65 + 66 + frex_mat <- 1 / (w / freqscore + (1 - w) / exclscore) 67 + 68 + frex_dt <- data.table::as.data.table(t(frex_mat)) 69 + data.table::setnames(frex_dt, paste0("V", seq_len(ncol(frex_dt)))) 70 + frex_dt[, term := as.character(model$vocab)] 71 + 72 + frex_long <- data.table::melt( 73 + frex_dt, 74 + id.vars = "term", 75 + variable.name = "topic", 76 + value.name = "frex", 77 + variable.factor = FALSE 78 + ) 79 + frex_long[, topic := as.integer(sub("^V", "", topic))] 80 + 81 + data.table::setorder(frex_long, topic, -frex) 82 + top_dt <- frex_long[, head(.SD, nb_terms), by = topic] 83 + 84 + top_dt[, rank := seq_len(.N), by = topic] 85 + top_dt[, mean := mean(frex), by = topic] 86 + 87 + top_dt[, .( 88 + topic = as.integer(topic), 89 + term = as.character(term), 90 + frex = as.numeric(frex), 91 + rank = as.integer(rank), 92 + mean = as.numeric(mean) 93 + )] 94 + } 95 + 96 + #' Compute mean FREX across topics 97 + #' 98 + #' @title Compute mean FREX 99 + #' @description Compute the average of per-topic mean FREX values. 100 + #' 101 + #' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`. 102 + #' @param nb_terms Integer scalar, number of terms per topic used for averaging. 103 + #' @param w Numeric scalar in \[0, 1], FREX weighting parameter. 104 + #' @return Numeric scalar mean FREX across topics. 105 + #' @export 106 + compute_mean_frex <- function(model, nb_terms = 10L, w = 0.5) { 107 + frex_dt <- calculate_frex(model, nb_terms = nb_terms, w = w) 108 + mean(unique(frex_dt[, .(topic, mean)])$mean) 109 + } 110 + 111 + #' Backward-compatible alias for mean FREX 112 + #' 113 + #' @title Average FREX (alias) 114 + #' @description Alias to `compute_mean_frex()` for backward compatibility. 115 + #' 116 + #' @inheritParams compute_mean_frex 117 + #' @return Numeric scalar mean FREX across topics. 118 + #' @export 119 + average_frex <- function(model, nb_terms = 10L, w = 0.5) { 120 + compute_mean_frex(model = model, nb_terms = nb_terms, w = w) 121 + } 122 + 123 + #' Extract top terms per topic by beta 124 + #' 125 + #' @title Calculate beta top terms 126 + #' @description Convert log-probabilities to probabilities and return top terms 127 + #' per topic. 128 + #' 129 + #' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`. 130 + #' @param nb_terms Integer scalar, number of terms to return per topic. 131 + #' @return A data.table with top beta terms per topic. 132 + #' @export 133 + calculate_beta <- function(model, nb_terms = 10L) { 134 + if ( 135 + !is.list(model) || 136 + is.null(model$beta) || 137 + is.null(model$beta$logbeta) || 138 + length(model$beta$logbeta) < 1 || 139 + is.null(model$vocab) 140 + ) { 141 + stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE) 142 + } 143 + 144 + nb_terms <- as.integer(nb_terms) 145 + if (is.na(nb_terms) || nb_terms < 1L) { 146 + stop("`nb_terms` must be an integer >= 1.", call. = FALSE) 147 + } 148 + 149 + logbeta <- model$beta$logbeta[[1]] 150 + if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) { 151 + return(data.table::data.table( 152 + topic = integer(0), 153 + term = character(0), 154 + beta = numeric(0), 155 + rank = integer(0) 156 + )) 157 + } 158 + 159 + beta_mat <- exp(logbeta) 160 + 161 + beta_dt <- data.table::as.data.table(t(beta_mat)) 162 + data.table::setnames(beta_dt, paste0("V", seq_len(ncol(beta_dt)))) 163 + beta_dt[, term := as.character(model$vocab)] 164 + 165 + beta_long <- data.table::melt( 166 + beta_dt, 167 + id.vars = "term", 168 + variable.name = "topic", 169 + value.name = "beta", 170 + variable.factor = FALSE 171 + ) 172 + beta_long[, topic := as.integer(sub("^V", "", topic))] 173 + 174 + data.table::setorder(beta_long, topic, -beta) 175 + top_dt <- beta_long[, head(.SD, nb_terms), by = topic] 176 + top_dt[, rank := seq_len(.N), by = topic] 177 + 178 + top_dt[, .( 179 + topic = as.integer(topic), 180 + term = as.character(term), 181 + beta = as.numeric(beta), 182 + rank = as.integer(rank) 183 + )] 184 + } 185 + 186 + #' Calculate LDA-style score for terms per topic 187 + #' 188 + #' @title Calculate score top terms 189 + #' @description Compute an LDA-style score and return top terms by topic. 190 + #' 191 + #' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`. 192 + #' @param nb_terms Integer scalar, number of terms to return per topic. 193 + #' @return A data.table with top score terms per topic. 194 + #' @export 195 + calculate_score <- function(model, nb_terms = 10L) { 196 + if ( 197 + !is.list(model) || 198 + is.null(model$beta) || 199 + is.null(model$beta$logbeta) || 200 + length(model$beta$logbeta) < 1 || 201 + is.null(model$vocab) 202 + ) { 203 + stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE) 204 + } 205 + 206 + nb_terms <- as.integer(nb_terms) 207 + if (is.na(nb_terms) || nb_terms < 1L) { 208 + stop("`nb_terms` must be an integer >= 1.", call. = FALSE) 209 + } 210 + 211 + logbeta <- model$beta$logbeta[[1]] 212 + if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) { 213 + return(data.table::data.table( 214 + topic = integer(0), 215 + term = character(0), 216 + score = numeric(0), 217 + rank = integer(0) 218 + )) 219 + } 220 + 221 + col_means <- colMeans(logbeta) 222 + score_mat <- exp(logbeta) * 223 + (logbeta - 224 + matrix( 225 + col_means, 226 + nrow = nrow(logbeta), 227 + ncol = ncol(logbeta), 228 + byrow = TRUE 229 + )) 230 + 231 + score_dt <- data.table::as.data.table(t(score_mat)) 232 + data.table::setnames(score_dt, paste0("V", seq_len(ncol(score_dt)))) 233 + score_dt[, term := as.character(model$vocab)] 234 + 235 + score_long <- data.table::melt( 236 + score_dt, 237 + id.vars = "term", 238 + variable.name = "topic", 239 + value.name = "score", 240 + variable.factor = FALSE 241 + ) 242 + score_long[, topic := as.integer(sub("^V", "", topic))] 243 + 244 + data.table::setorder(score_long, topic, -score) 245 + top_dt <- score_long[, head(.SD, nb_terms), by = topic] 246 + top_dt[, rank := seq_len(.N), by = topic] 247 + 248 + top_dt[, .( 249 + topic = as.integer(topic), 250 + term = as.character(term), 251 + score = as.numeric(score), 252 + rank = as.integer(rank) 253 + )] 254 + } 255 + 256 + #' Compute STM term counts from `documents` 257 + #' 258 + #' @title Compute STM term counts 259 + #' @description Aggregate token counts from `stm_data$documents` into a term 260 + #' count table. 261 + #' 262 + #' @param stm_data A list-like stm object with `documents` and `vocab`. 263 + #' @return A data.table with columns `term` and `count`. 264 + #' @export 265 + compute_stm_term_counts <- function(stm_data) { 266 + if (is.null(stm_data$documents) || length(stm_data$documents) == 0) { 267 + return(data.table::data.table(term = character(), count = integer())) 268 + } 269 + 270 + doc_term_list <- lapply(stm_data$documents, function(doc) { 271 + if (is.null(doc) || length(doc) == 0) { 272 + return(NULL) 273 + } 274 + m <- as.matrix(doc) 275 + data.table::data.table( 276 + term_index = as.integer(m[1, ]), 277 + count = as.integer(m[2, ]) 278 + ) 279 + }) 280 + 281 + doc_term_list <- Filter(Negate(is.null), doc_term_list) 282 + if (length(doc_term_list) == 0L) { 283 + return(data.table::data.table(term = character(), count = integer())) 284 + } 285 + 286 + term_counts <- data.table::rbindlist(doc_term_list) 287 + term_counts <- term_counts[, .(count = sum(count)), by = term_index] 288 + term_counts[, term := stm_data$vocab[term_index]] 289 + term_counts[, .(term, count)][order(-count)] 290 + } 291 + 292 + #' Backward-compatible alias for STM term counts 293 + #' 294 + #' @title STM term counts (alias) 295 + #' @description Alias to `compute_stm_term_counts()` for backward compatibility. 296 + #' 297 + #' @inheritParams compute_stm_term_counts 298 + #' @return A data.table with columns `term` and `count`. 299 + #' @export 300 + stm_term_counts <- function(stm_data) { 301 + compute_stm_term_counts(stm_data) 302 + } 303 + 304 + #' Calculate Lift measure from STM model and empirical term counts 305 + #' 306 + #' @title Calculate lift top terms 307 + #' @description Compute lift as log topic probability relative to empirical term 308 + #' frequency and return top terms by topic. 309 + #' 310 + #' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`. 311 + #' @param list_terms data.frame/data.table with columns `term` and `count`. 312 + #' @param model_data Optional STM data; used if `list_terms` is NULL. 313 + #' @param nb_terms Integer scalar, number of terms to return per topic. 314 + #' @return A data.table with top lift terms per topic. 315 + #' @export 316 + calculate_lift <- function( 317 + model, 318 + list_terms = NULL, 319 + model_data = NULL, 320 + nb_terms = 10L 321 + ) { 322 + if ( 323 + !is.list(model) || 324 + is.null(model$beta) || 325 + is.null(model$beta$logbeta) || 326 + length(model$beta$logbeta) < 1 || 327 + is.null(model$vocab) 328 + ) { 329 + stop("`model` must contain `beta$logbeta[[1]]` and `vocab`.", call. = FALSE) 330 + } 331 + 332 + nb_terms <- as.integer(nb_terms) 333 + if (is.na(nb_terms) || nb_terms < 1L) { 334 + stop("`nb_terms` must be an integer >= 1.", call. = FALSE) 335 + } 336 + 337 + logbeta <- model$beta$logbeta[[1]] 338 + if (!is.matrix(logbeta) || nrow(logbeta) == 0L || ncol(logbeta) == 0L) { 339 + return(data.table::data.table( 340 + topic = integer(0), 341 + term = character(0), 342 + lift = numeric(0), 343 + rank = integer(0) 344 + )) 345 + } 346 + 347 + if (is.null(list_terms)) { 348 + if (!is.null(model_data)) { 349 + cli::cli_alert_info( 350 + "`list_terms` is NULL. Calculating counts with `compute_stm_term_counts()`." 351 + ) 352 + lt_dt <- compute_stm_term_counts(model_data) 353 + } else { 354 + cli::cli_abort( 355 + "`list_terms` is NULL and `model_data` is also NULL. Cannot compute empirical term counts." 356 + ) 357 + } 358 + } else { 359 + lt_dt <- data.table::as.data.table(list_terms) 360 + if (!all(c("term", "count") %in% colnames(lt_dt))) { 361 + stop( 362 + "`list_terms` must contain columns 'term' and 'count'.", 363 + call. = FALSE 364 + ) 365 + } 366 + lt_dt <- lt_dt[, .(count = sum(as.numeric(count))), by = term] 367 + } 368 + 369 + vocab <- as.character(model$vocab) 370 + counts_aligned <- lt_dt[match(vocab, lt_dt$term), count] 371 + counts_aligned[is.na(counts_aligned)] <- 0 372 + total_counts <- sum(counts_aligned, na.rm = TRUE) 373 + 374 + if (total_counts == 0) { 375 + return(data.table::data.table( 376 + topic = integer(0), 377 + term = character(0), 378 + lift = numeric(0), 379 + rank = integer(0) 380 + )) 381 + } 382 + 383 + if (nrow(logbeta) == length(vocab)) { 384 + logbeta_terms <- logbeta 385 + } else if (ncol(logbeta) == length(vocab)) { 386 + logbeta_terms <- t(logbeta) 387 + } else { 388 + stop( 389 + "Cannot align logbeta and vocab: neither rows nor columns match vocab length.", 390 + call. = FALSE 391 + ) 392 + } 393 + 394 + emp_log <- log(counts_aligned) - log(total_counts) 395 + emp_log_mat <- matrix( 396 + emp_log, 397 + nrow = nrow(logbeta_terms), 398 + ncol = ncol(logbeta_terms), 399 + byrow = FALSE 400 + ) 401 + 402 + lift_mat <- logbeta_terms - emp_log_mat 403 + 404 + lift_dt_wide <- data.table::as.data.table(lift_mat) 405 + data.table::setnames(lift_dt_wide, paste0("V", seq_len(ncol(lift_dt_wide)))) 406 + lift_dt_wide[, term := vocab] 407 + 408 + lift_long <- data.table::melt( 409 + lift_dt_wide, 410 + id.vars = "term", 411 + variable.name = "topic", 412 + value.name = "lift", 413 + variable.factor = FALSE 414 + ) 415 + lift_long[, topic := as.integer(sub("^V", "", topic))] 416 + 417 + data.table::setorder(lift_long, topic, -lift) 418 + top_dt <- lift_long[, head(.SD, nb_terms), by = topic] 419 + top_dt[, rank := seq_len(.N), by = topic] 420 + 421 + top_dt[, .( 422 + topic = as.integer(topic), 423 + term = as.character(term), 424 + lift = as.numeric(lift), 425 + rank = as.integer(rank) 426 + )] 427 + } 428 + 429 + #' Extract top terms across selected measures 430 + #' 431 + #' @title Extract top terms across measures 432 + #' @description Combine top terms from one or several measures (`frex`, `beta`, 433 + #' `score`, `lift`) into one tidy table. 434 + #' 435 + #' @param model STM-like model with `beta$logbeta[[1]]` and `vocab`. 436 + #' @param list_terms Optional empirical term counts (`term`, `count`). 437 + #' @param model_data Optional STM data used when `list_terms` is NULL for lift. 438 + #' @param nb_terms Integer scalar, number of terms per topic and measure. 439 + #' @param frex_weight Numeric scalar in \[0, 1\] for FREX weighting. 440 + #' @param measures Character vector of measures among `frex`, `beta`, `score`, `lift`. 441 + #' @return A data.table with columns `topic`, `term`, `value`, `rank`, `measure`. 442 + #' 443 + #' @examples 444 + # ' logbeta <- matrix(rnorm(10), nrow = 5) 445 + # ' mdl <- list(beta = list(logbeta = list(logbeta)), vocab = paste0("w", 1:5)) 446 + # ' extract_top_terms(mdl, nb_terms = 2, measures = c("beta", "score")) 447 + #' 448 + #' @export 449 + extract_top_terms <- function( 450 + model, 451 + list_terms = NULL, 452 + model_data = NULL, 453 + nb_terms = 10L, 454 + frex_weight = 0.5, 455 + measures = c("frex", "beta", "score", "lift") 456 + ) { 457 + allowed_measures <- c("frex", "beta", "score", "lift") 458 + measures <- tolower(as.character(measures)) 459 + 460 + if (length(measures) == 0L) { 461 + stop( 462 + "`measures` must contain at least one measure from: frex, beta, score, lift.", 463 + call. = FALSE 464 + ) 465 + } 466 + if (!all(measures %in% allowed_measures)) { 467 + stop( 468 + "Invalid measure(s): ", 469 + paste(setdiff(measures, allowed_measures), collapse = ", "), 470 + ". Allowed: frex, beta, score, lift.", 471 + call. = FALSE 472 + ) 473 + } 474 + 475 + nb_terms <- as.integer(nb_terms) 476 + if (is.na(nb_terms) || nb_terms < 1L) { 477 + stop("`nb_terms` must be an integer >= 1.", call. = FALSE) 478 + } 479 + 480 + results_list <- list() 481 + 482 + if ("frex" %in% measures) { 483 + frex_dt <- calculate_frex(model, nb_terms = nb_terms, w = frex_weight) 484 + if (nrow(frex_dt) > 0L) { 485 + frex_dt[, measure := "frex"] 486 + data.table::setnames(frex_dt, "frex", "value") 487 + if ("mean" %in% names(frex_dt)) { 488 + frex_dt[, mean := NULL] 489 + } 490 + frex_dt <- frex_dt[, .(topic, term, value, rank, measure)] 491 + } 492 + results_list[["frex"]] <- frex_dt 493 + } 494 + 495 + if ("beta" %in% measures) { 496 + beta_dt <- calculate_beta(model, nb_terms = nb_terms) 497 + if (nrow(beta_dt) > 0L) { 498 + beta_dt[, measure := "beta"] 499 + data.table::setnames(beta_dt, "beta", "value") 500 + beta_dt <- beta_dt[, .(topic, term, value, rank, measure)] 501 + } 502 + results_list[["beta"]] <- beta_dt 503 + } 504 + 505 + if ("score" %in% measures) { 506 + score_dt <- calculate_score(model, nb_terms = nb_terms) 507 + if (nrow(score_dt) > 0L) { 508 + score_dt[, measure := "score"] 509 + data.table::setnames(score_dt, "score", "value") 510 + score_dt <- score_dt[, .(topic, term, value, rank, measure)] 511 + } 512 + results_list[["score"]] <- score_dt 513 + } 514 + 515 + if ("lift" %in% measures) { 516 + lift_dt <- calculate_lift( 517 + model = model, 518 + list_terms = list_terms, 519 + model_data = model_data, 520 + nb_terms = nb_terms 521 + ) 522 + if (nrow(lift_dt) > 0L) { 523 + lift_dt[, measure := "lift"] 524 + data.table::setnames(lift_dt, "lift", "value") 525 + lift_dt <- lift_dt[, .(topic, term, value, rank, measure)] 526 + } 527 + results_list[["lift"]] <- lift_dt 528 + } 529 + 530 + present <- Filter(function(x) !is.null(x), results_list) 531 + if (length(present) == 0L) { 532 + return(data.table::data.table( 533 + topic = integer(0), 534 + term = character(0), 535 + value = numeric(0), 536 + rank = integer(0), 537 + measure = character(0) 538 + )) 539 + } 540 + 541 + data.table::rbindlist(present, use.names = TRUE, fill = TRUE) 542 + }

+24

man/average_frex.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{average_frex} 4 + \alias{average_frex} 5 + \title{Average FREX (alias)} 6 + \usage{ 7 + average_frex(model, nb_terms = 10L, w = 0.5) 8 + } 9 + \arguments{ 10 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 11 + 12 + \item{nb_terms}{Integer scalar, number of terms per topic used for averaging.} 13 + 14 + \item{w}{Numeric scalar in [0, 1], FREX weighting parameter.} 15 + } 16 + \value{ 17 + Numeric scalar mean FREX across topics. 18 + } 19 + \description{ 20 + Alias to \code{compute_mean_frex()} for backward compatibility. 21 + } 22 + \details{ 23 + Backward-compatible alias for mean FREX 24 + }

+23

man/calculate_beta.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{calculate_beta} 4 + \alias{calculate_beta} 5 + \title{Calculate beta top terms} 6 + \usage{ 7 + calculate_beta(model, nb_terms = 10L) 8 + } 9 + \arguments{ 10 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 11 + 12 + \item{nb_terms}{Integer scalar, number of terms to return per topic.} 13 + } 14 + \value{ 15 + A data.table with top beta terms per topic. 16 + } 17 + \description{ 18 + Convert log-probabilities to probabilities and return top terms 19 + per topic. 20 + } 21 + \details{ 22 + Extract top terms per topic by beta 23 + }

+25

man/calculate_frex.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{calculate_frex} 4 + \alias{calculate_frex} 5 + \title{Calculate FREX top terms} 6 + \usage{ 7 + calculate_frex(model, nb_terms = 10L, w = 0.5) 8 + } 9 + \arguments{ 10 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 11 + 12 + \item{nb_terms}{Integer scalar, number of terms to return per topic.} 13 + 14 + \item{w}{Numeric scalar in [0, 1], FREX weighting parameter.} 15 + } 16 + \value{ 17 + A data.table with top FREX terms per topic. 18 + } 19 + \description{ 20 + Compute FREX scores from \code{model$beta$logbeta[[1]]} and return 21 + top terms per topic. 22 + } 23 + \details{ 24 + Calculate FREX scores and return top terms per topic 25 + }

+27

man/calculate_lift.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{calculate_lift} 4 + \alias{calculate_lift} 5 + \title{Calculate lift top terms} 6 + \usage{ 7 + calculate_lift(model, list_terms = NULL, model_data = NULL, nb_terms = 10L) 8 + } 9 + \arguments{ 10 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 11 + 12 + \item{list_terms}{data.frame/data.table with columns \code{term} and \code{count}.} 13 + 14 + \item{model_data}{Optional STM data; used if \code{list_terms} is NULL.} 15 + 16 + \item{nb_terms}{Integer scalar, number of terms to return per topic.} 17 + } 18 + \value{ 19 + A data.table with top lift terms per topic. 20 + } 21 + \description{ 22 + Compute lift as log topic probability relative to empirical term 23 + frequency and return top terms by topic. 24 + } 25 + \details{ 26 + Calculate Lift measure from STM model and empirical term counts 27 + }

+22

man/calculate_score.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{calculate_score} 4 + \alias{calculate_score} 5 + \title{Calculate score top terms} 6 + \usage{ 7 + calculate_score(model, nb_terms = 10L) 8 + } 9 + \arguments{ 10 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 11 + 12 + \item{nb_terms}{Integer scalar, number of terms to return per topic.} 13 + } 14 + \value{ 15 + A data.table with top score terms per topic. 16 + } 17 + \description{ 18 + Compute an LDA-style score and return top terms by topic. 19 + } 20 + \details{ 21 + Calculate LDA-style score for terms per topic 22 + }

+24

man/compute_mean_frex.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{compute_mean_frex} 4 + \alias{compute_mean_frex} 5 + \title{Compute mean FREX} 6 + \usage{ 7 + compute_mean_frex(model, nb_terms = 10L, w = 0.5) 8 + } 9 + \arguments{ 10 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 11 + 12 + \item{nb_terms}{Integer scalar, number of terms per topic used for averaging.} 13 + 14 + \item{w}{Numeric scalar in [0, 1], FREX weighting parameter.} 15 + } 16 + \value{ 17 + Numeric scalar mean FREX across topics. 18 + } 19 + \description{ 20 + Compute the average of per-topic mean FREX values. 21 + } 22 + \details{ 23 + Compute mean FREX across topics 24 + }

+21

man/compute_stm_term_counts.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{compute_stm_term_counts} 4 + \alias{compute_stm_term_counts} 5 + \title{Compute STM term counts} 6 + \usage{ 7 + compute_stm_term_counts(stm_data) 8 + } 9 + \arguments{ 10 + \item{stm_data}{A list-like stm object with \code{documents} and \code{vocab}.} 11 + } 12 + \value{ 13 + A data.table with columns \code{term} and \code{count}. 14 + } 15 + \description{ 16 + Aggregate token counts from \code{stm_data$documents} into a term 17 + count table. 18 + } 19 + \details{ 20 + Compute STM term counts from \code{documents} 21 + }

+43

man/compute_tf_idf.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/compute_tf_idf.R 3 + \name{compute_tf_idf} 4 + \alias{compute_tf_idf} 5 + \title{Compute TF–IDF for tokenized data} 6 + \usage{ 7 + compute_tf_idf( 8 + df, 9 + token_col = "token", 10 + document_col = "document", 11 + weight_col = NULL 12 + ) 13 + } 14 + \arguments{ 15 + \item{df}{A \code{data.frame} or \code{data.table} with token data.} 16 + 17 + \item{token_col}{Character scalar. Column name for token values.} 18 + 19 + \item{document_col}{Character scalar or vector. Column(s) defining document ID.} 20 + 21 + \item{weight_col}{Optional character scalar with numeric weights (for example 22 + aggregated token counts). If \code{NULL}, each row contributes weight \code{1}.} 23 + } 24 + \value{ 25 + A \code{data.table} with one row per token-document pair and computed 26 + columns including \code{corpus_tf}, \code{nb_doc_word}, \code{df}, \code{idf}, and \code{tf_idf}. 27 + If \code{weight_col} is provided, \code{weighted_tf} is returned instead of \code{tf}. 28 + } 29 + \description{ 30 + Compute term frequency (TF), inverse document frequency (IDF), 31 + and TF–IDF from tokenized data. 32 + } 33 + \details{ 34 + Compute TF–IDF (optionally weighted) for tokens per document 35 + } 36 + \examples{ 37 + dt <- data.table::data.table( 38 + document = c("d1", "d1", "d2"), 39 + token = c("labor", "market", "labor") 40 + ) 41 + compute_tf_idf(dt) 42 + 43 + }

+58

man/extract_ngrams.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/extract_ngrams.R 3 + \name{extract_ngrams} 4 + \alias{extract_ngrams} 5 + \title{Extract n-grams from text (document-level only)} 6 + \usage{ 7 + extract_ngrams( 8 + df, 9 + ngrams = 2L, 10 + grouping_cols = NULL, 11 + text_col = "text", 12 + min_nchar = 2L, 13 + stop_words = NULL, 14 + chunk_size = NULL 15 + ) 16 + } 17 + \arguments{ 18 + \item{df}{A \code{data.frame} or \code{data.table} containing the input text. 19 + This function converts \code{df} to \code{data.table} in-place.} 20 + 21 + \item{ngrams}{Integer scalar or vector. If scalar (for example \code{2L}), 22 + values from \code{1:ngrams} are computed. If vector (for example \code{c(1L, 2L)}), 23 + only those sizes are computed.} 24 + 25 + \item{grouping_cols}{Character vector of grouping columns to keep in output. 26 + Use \code{NULL} for no grouping.} 27 + 28 + \item{text_col}{Character scalar. Name of the column that contains text.} 29 + 30 + \item{min_nchar}{Integer scalar. Minimum token length to keep.} 31 + 32 + \item{stop_words}{Optional character vector of stop words. 33 + If \code{NULL}, \code{tidytext::stop_words$word} is used.} 34 + 35 + \item{chunk_size}{Optional integer number of rows per processing chunk. 36 + Use \code{NULL} or \verb{<= 0} to process in a single pass.} 37 + } 38 + \value{ 39 + A \code{data.table} with columns \code{grouping_cols}, \code{token}, \code{ngram}, 40 + and \code{doc_freq}. 41 + } 42 + \description{ 43 + Tokenise text into n-grams (unigrams, bigrams, etc.) per group, 44 + remove tokens containing digits or stop words, and return a compact 45 + data.table with one row per grouping + token + ngram and a frequency column. 46 + } 47 + \details{ 48 + Extract n-grams from text with basic filtering and grouped counts 49 + 50 + Tokens are lower-cased by the tokenizer. Tokens containing punctuation, 51 + digits, symbols, or stop words are removed. Multi-word tokens in output 52 + are normalized with underscores (\verb{_}). 53 + } 54 + \examples{ 55 + df <- data.frame(id = 1:2, text = c("Economic growth matters", "Growth and trade")) 56 + extract_ngrams(df, ngrams = 2L, grouping_cols = "id") 57 + 58 + }

+38

man/extract_top_terms.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{extract_top_terms} 4 + \alias{extract_top_terms} 5 + \title{Extract top terms across measures} 6 + \usage{ 7 + extract_top_terms( 8 + model, 9 + list_terms = NULL, 10 + model_data = NULL, 11 + nb_terms = 10L, 12 + frex_weight = 0.5, 13 + measures = c("frex", "beta", "score", "lift") 14 + ) 15 + } 16 + \arguments{ 17 + \item{model}{STM-like model with \code{beta$logbeta[[1]]} and \code{vocab}.} 18 + 19 + \item{list_terms}{Optional empirical term counts (\code{term}, \code{count}).} 20 + 21 + \item{model_data}{Optional STM data used when \code{list_terms} is NULL for lift.} 22 + 23 + \item{nb_terms}{Integer scalar, number of terms per topic and measure.} 24 + 25 + \item{frex_weight}{Numeric scalar in [0, 1] for FREX weighting.} 26 + 27 + \item{measures}{Character vector of measures among \code{frex}, \code{beta}, \code{score}, \code{lift}.} 28 + } 29 + \value{ 30 + A data.table with columns \code{topic}, \code{term}, \code{value}, \code{rank}, \code{measure}. 31 + } 32 + \description{ 33 + Combine top terms from one or several measures (\code{frex}, \code{beta}, 34 + \code{score}, \code{lift}) into one tidy table. 35 + } 36 + \details{ 37 + Extract top terms across selected measures 38 + }

+73

man/extract_topic_effects.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_effects.R 3 + \name{extract_topic_effects} 4 + \alias{extract_topic_effects} 5 + \title{Extract topic effects from STM simulations} 6 + \usage{ 7 + extract_topic_effects( 8 + x, 9 + covariate, 10 + model = NULL, 11 + topics = x$topics, 12 + method = "pointestimate", 13 + cov.value1 = NULL, 14 + cov.value2 = NULL, 15 + moderator = NULL, 16 + moderator.value = NULL, 17 + npoints = 500, 18 + nsims = 500, 19 + ci.level = 0.95, 20 + custom.labels = NULL, 21 + labeltype = "numbers", 22 + n = 7, 23 + frexw = 0.5 24 + ) 25 + } 26 + \arguments{ 27 + \item{x}{STM estimate-like object (e.g. from \code{estimateEffect()}).} 28 + 29 + \item{covariate}{Character scalar, covariate name.} 30 + 31 + \item{model}{Optional STM model object used for labels.} 32 + 33 + \item{topics}{Integer vector of topics to extract.} 34 + 35 + \item{method}{Either \code{"pointestimate"} or \code{"difference"}.} 36 + 37 + \item{cov.value1, cov.value2}{Values used when \code{method = "difference"}.} 38 + 39 + \item{moderator}{Optional moderator variable name.} 40 + 41 + \item{moderator.value}{Optional moderator value.} 42 + 43 + \item{npoints}{Integer scalar, number of grid points for point estimates.} 44 + 45 + \item{nsims}{Integer scalar, number of simulation draws.} 46 + 47 + \item{ci.level}{Numeric scalar in (0, 1).} 48 + 49 + \item{custom.labels}{Optional topic labels.} 50 + 51 + \item{labeltype}{Label type passed to STM internals.} 52 + 53 + \item{n, frexw}{Additional label parameters passed to STM internals.} 54 + } 55 + \value{ 56 + A tidy data.table of simulated topic effects. 57 + } 58 + \description{ 59 + Simulate topic effects for a covariate from an STM 60 + \code{estimateEffect()}-like object, and return tidy estimates with uncertainty. 61 + } 62 + \details{ 63 + Extract simulated topic effects from an STM estimate object 64 + 65 + This function relies on non-exported STM internals (\verb{stm:::}) to generate 66 + the contrast matrix, simulate coefficients, and produce labels. 67 + } 68 + \examples{ 69 + \dontrun{ 70 + # effects <- extract_topic_effects(est, covariate = "year", nsims = 200) 71 + } 72 + 73 + }

+44

man/filter_group_quantile.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/filter_group_quantile.R 3 + \name{filter_group_quantile} 4 + \alias{filter_group_quantile} 5 + \title{Filter rows by per-group quantile threshold} 6 + \usage{ 7 + filter_group_quantile( 8 + dt, 9 + group_cols, 10 + metric = "tf", 11 + probs = 0.75, 12 + inclusive = FALSE, 13 + keep_threshold = FALSE 14 + ) 15 + } 16 + \arguments{ 17 + \item{dt}{A \code{data.frame} or \code{data.table}.} 18 + 19 + \item{group_cols}{Character vector of grouping columns.} 20 + 21 + \item{metric}{Character scalar. Numeric metric column to threshold.} 22 + 23 + \item{probs}{Numeric scalar in \verb{(0, 1)} passed to \code{stats::quantile()}.} 24 + 25 + \item{inclusive}{Logical scalar. If \code{TRUE}, uses \code{>=}; else uses \code{>}.} 26 + 27 + \item{keep_threshold}{Logical scalar. If \code{TRUE}, includes the \code{threshold} 28 + column in the result.} 29 + } 30 + \value{ 31 + A filtered \code{data.table}. 32 + } 33 + \description{ 34 + Computes a group-specific quantile threshold from a numeric 35 + metric and returns rows above that threshold. 36 + } 37 + \details{ 38 + Compute and apply a per-group quantile threshold to filter rows 39 + } 40 + \examples{ 41 + dt <- data.table::data.table(group = c("a", "a", "b"), tf = c(1, 3, 2)) 42 + filter_group_quantile(dt, group_cols = "group", metric = "tf", probs = 0.5) 43 + 44 + }

+38

man/make_topic_labels.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/make_topic_labels.R 3 + \name{make_topic_labels} 4 + \alias{make_topic_labels} 5 + \title{Create short topic labels from an \code{stm} model} 6 + \usage{ 7 + make_topic_labels( 8 + stm_model, 9 + type = c("prob", "frex", "lift", "score"), 10 + n_words = 3, 11 + sep = ", " 12 + ) 13 + } 14 + \arguments{ 15 + \item{stm_model}{Fitted \code{stm} model object.} 16 + 17 + \item{type}{Label type from \code{c("prob", "frex", "lift", "score")}.} 18 + 19 + \item{n_words}{Integer scalar. Number of words per label.} 20 + 21 + \item{sep}{Character scalar used between words.} 22 + } 23 + \value{ 24 + A \code{data.table} with columns \code{topic} and \code{label}. 25 + } 26 + \description{ 27 + Extract top words per topic with \code{stm::labelTopics()} and 28 + build compact labels such as \code{"3: labor, wages, unions"}. 29 + } 30 + \details{ 31 + Create short human-readable topic labels from an stm model 32 + } 33 + \examples{ 34 + \dontrun{ 35 + # labels <- make_topic_labels(fit, type = "frex", n_words = 4) 36 + } 37 + 38 + }

+15

man/mintR-package.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/mintR-package.R 3 + \docType{package} 4 + \name{mintR-package} 5 + \alias{mintR} 6 + \alias{mintR-package} 7 + \title{mintR: Text mining helpers for STM workflows} 8 + \description{ 9 + Package-level documentation for \code{mintR}. 10 + } 11 + \author{ 12 + \strong{Maintainer}: Aurélien Goutsmedt \email{agoutsmedt@hotmail.fr} 13 + 14 + } 15 + \keyword{internal}

+42

man/plot_topic_prevalence.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_prevalence_plots.R 3 + \name{plot_topic_prevalence} 4 + \alias{plot_topic_prevalence} 5 + \title{Plot average topic prevalence} 6 + \usage{ 7 + plot_topic_prevalence( 8 + stm_model, 9 + topic_labels, 10 + out_file = NULL, 11 + save = TRUE, 12 + width = 25, 13 + height = 20 14 + ) 15 + } 16 + \arguments{ 17 + \item{stm_model}{STM model object with \code{theta} matrix.} 18 + 19 + \item{topic_labels}{Table with columns \code{topic} and \code{label}.} 20 + 21 + \item{out_file}{Optional output path used when \code{save = TRUE}.} 22 + 23 + \item{save}{Logical scalar, whether to save the figure.} 24 + 25 + \item{width, height}{Numeric scalar dimensions in inches for saved file.} 26 + } 27 + \value{ 28 + A list with \code{topic_prev} (data.table) and \code{plot} (ggplot object). 29 + } 30 + \description{ 31 + Compute average prevalence from \code{theta} and draw a horizontal 32 + bar chart. Optionally save the figure. 33 + } 34 + \details{ 35 + Plot average topic prevalence from an STM model 36 + } 37 + \examples{ 38 + stm_model <- list(theta = matrix(c(0.2, 0.8, 0.4, 0.6), nrow = 2, byrow = TRUE)) 39 + topic_labels <- data.frame(topic = 1:2, label = c("Labor", "Trade")) 40 + plot_topic_prevalence(stm_model, topic_labels, save = FALSE) 41 + 42 + }

+51

man/plot_topic_prevalence_over_time.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_prevalence_plots.R 3 + \name{plot_topic_prevalence_over_time} 4 + \alias{plot_topic_prevalence_over_time} 5 + \title{Plot topic prevalence over time} 6 + \usage{ 7 + plot_topic_prevalence_over_time( 8 + estimations, 9 + topic_labels, 10 + covariate = "year", 11 + nsims = 500, 12 + out_file = NULL, 13 + width = 30, 14 + height = 25, 15 + wrap_width = 30, 16 + base_size = 20 17 + ) 18 + } 19 + \arguments{ 20 + \item{estimations}{STM estimate object accepted by \code{extract_topic_effects()}.} 21 + 22 + \item{topic_labels}{Table with columns \code{topic} and \code{label}.} 23 + 24 + \item{covariate}{Character scalar, covariate used on x-axis.} 25 + 26 + \item{nsims}{Integer scalar, number of simulation draws.} 27 + 28 + \item{out_file}{Output path for \code{ggsave()}.} 29 + 30 + \item{width, height}{Numeric scalar dimensions in inches for saved file.} 31 + 32 + \item{wrap_width}{Integer scalar for wrapped facet labels.} 33 + 34 + \item{base_size}{Numeric scalar for ggplot theme size.} 35 + } 36 + \value{ 37 + An invisible list with \code{years_estimate}, \code{plot}, and \code{out_file}. 38 + } 39 + \description{ 40 + Compute simulated topic effects over a covariate grid and 41 + draw faceted trajectories with confidence ribbons. 42 + } 43 + \details{ 44 + Plot topic prevalence over a covariate grid 45 + } 46 + \examples{ 47 + \dontrun{ 48 + # plot_topic_prevalence_over_time(est, topic_labels, covariate = "year", out_file = "topic_year.png") 49 + } 50 + 51 + }

+20

man/stm_term_counts.Rd

··· 1 + % Generated by roxygen2: do not edit by hand 2 + % Please edit documentation in R/topic_term_measures.R 3 + \name{stm_term_counts} 4 + \alias{stm_term_counts} 5 + \title{STM term counts (alias)} 6 + \usage{ 7 + stm_term_counts(stm_data) 8 + } 9 + \arguments{ 10 + \item{stm_data}{A list-like stm object with \code{documents} and \code{vocab}.} 11 + } 12 + \value{ 13 + A data.table with columns \code{term} and \code{count}. 14 + } 15 + \description{ 16 + Alias to \code{compute_stm_term_counts()} for backward compatibility. 17 + } 18 + \details{ 19 + Backward-compatible alias for STM term counts 20 + }

Configure Feed

Configure Feed