============================== Release Notes for LSA 0.67 August 1, 2012 ============================== # # # # # # # # # # # # # # # # # # # # # # # # Open Issues # # # # # # Wishlist & Unresolved --------------------- -- robert koblischke: * svd(...,sparse=SVDLIBCInterface("/home/fwild/svdlibc")) Und SVDLIBCInterface ist eine subklasse von SparseSVDInterface die eine methode zum transformieren der matrix, zum berechnen der svd und zum transformieren der svd in die R-LSA struktur). --- /// * error handling for empty files (no term docs)? * error handling for empty textvectors? * bugfix neccessary: textmatrix with controlled vocabulary fails if no term is left * maybe GF*IDF boundaries instead of frequency boundaries * generalise architecture with text-processing chain * input document sanitizing routines or at least a testing environment that can tell which files will produce errors. * corpora package: global weights should always come from the original textmatrix, especially when folding in additional texts (see essay scoring example). * corpora package: should decide automatically whether it is a textfile, a directory, or a string. * normalized local weights! (phillip türtscher) * integrate with tm package * add phrase detection to textvector function ("my phrase"), should not strip any special chars and should stay case sensitive (Neal Snider, Stanford) * rewrite fold_in dqs = t(docvecs) %*% LSAspace$tk %*% diag(LSAspace$sk) dtm = LSAspace$tk %*% diag(LSAspace$sk) %*% t(dqs) with crossprod(x,y) = t(x) %*% y tcrossprod(x,y) = x %*% t(y) * replace 0 with . in the print routine * check: are html special entities removed in xml remove: e.g. ’ # # # # # # # # # # # # # # # # # # # # # # # # Changes # # # # Changes in 0.70 to 0.72 --------------- o Fridolin Wild (2014-03-30) * fixed smaller warnings, including binding problem of textvector variables alnumx and specialchars, renamed HISTORY to ChangeLog, tm dependancy as suggests Changes in 0.69 --------------- o Fridolin Wild (2014-03-28) * fixed remaining snowballC dependancies * fixed documentation, namespace Changes in 0.68 --------------- o Fridolin Wild (2014-03-21) * fixed snowballC dependancy * added arab stopword list stopwords_ar Changes in 0.67 --------------- o Fridolin Wild (2012-08-01) * fixed some warnings for the package build Changes in 0.66 --------------- o Fridolin Wild (2012-07-23) * added textvector support for vietnamese and polish (thanks to Grażyna Paliwoda-Pękosz, Cracow University of Economics for the Polish request; and Hien Pham for the Vietnamese) Changes in 0.65 --------------- o Fridolin Wild (2010-10-07) * new dutch stopword list (thanks to Adriana Berlanga, Open University of the Netherlands, and Jan Hensgens, AURUS) Changes in 0.64 --------------- o Fridolin Wild (2009-09-14) * associate.R and cosine.R updated to deal with subsets thanks to Yue Shan, National Cheng Kung University, Taiwan Changes in 0.63 --------------- o Fridolin Wild (2009-09-04) * Rstem replaced with Snowball (in textmatrix, query), thanks to Kurt Hornik. * patched test routines to work on all OS Changes in 0.62 --------------- o Fridolin Wild (2009-05-28) * french stopwords added (thanks to Haykel Demnati, ISG Tunis) Changes in 0.61 --------------- o Fridolin Wild (2008-11-26) * phrase detection added (thanks to Eileen Hlavka, PRGS, RAND Corporation and Neil Snider, Stanford). So far phrases can be provided as character vector, the textvector routine changes them into the format word1_word2_word3 (and the like) to replace them in the texts. When phrases are used, underscores are no longer stripped from the texts! Changes in 0.60 --------------- o Fridolin Wild (2008-09-04) * bug fixed dimcalc_share when using extremely small dimensions (e.g. 2). o Fridolin Wild (2008-08-31) * bug fixed encoding problems on windows o Kurt Hornik (2008-03-12) * T => TRUE, F => FALSE Changes in 0.59 --------------- o Fridolin Wild, Kurt Hornik (2007-12-18) * several patches for encoding problems o Fridolin Wild (2007-12-11) * bug fix (R crashed when calling lsa_corpus demo): essay scoring demo now calls data files to avoid this unicode problem (seems to be a bug in R). * stopword lists converted to .rda data files * unicode bugfix in tests * unicode bugfix for german umlaut conversion from html-entities in textvector() * demo index readlines bugfix (two blank lines added) * landauer demo: X was using dimcalc_share() instead of dimcalc_raw() o Fridolin Wild (2007-11-28) * Dutch stopword list added (thanks to Marco Kalz, Open University Netherlands) * UTF-8 support enforced in stopword list, package description, textmatrix * stemming bug fixed (stemming was _after_ filtering by controlled vocabulary) * testing routine added for one-term matrices * special characters cleaned in textvector() * Optimised support for Arabic buckwalter transliterations (referring to the earlier request of Neal Snider, Stanford, below). Included the following characters to 'be' alphanumerics: ' $ | _ - ~ > < & { } * ` * utf-8 conform umlaut replacement in textmatrix() * added warning for 'empty' files (empty after filtering) to textvector() Changes in 0.58 --------------- o Fridolin Wild (2006-08-01) * added simple tag handling: tags are automatically removed (requested by Simon Lin, Northwestern @ Feb 23, 2006) * added arabic support for Buckwalter transliterations (requested by Neal Snider, Stanford @ Feb 21, 2006) * changed textmatrix() / textvector() standard language to english * textmatrix can now automatically remove terms with only numbers (requested by Simon Lin, Northwestern @ Feb 23, 2006) * extended special character stripping ('#', '+', ...) * added upper and lower boundaries for global frequencies * demo for essay scoring added * data set with essays (corpus.6) added o Fridolin Wild (2006-07-31) * added random sample function for corpus selection. index can be returned to allow for re-use of the sample. * added dimcalc_fraction() * added support to textmatrix() to run not only over directories, but also over a single file or a vector of files (or a mixed vector with files and directories) * added maxWordLength filtering * added maxDocFreq filtering o Jeff Verhulst (2006-04-21) * bugfix: print.textmatrix() bug appeared: 2nd of jan 2006 (Claudia Mayr) fix provided by: Jeff Verhulst, J&J Pharma R&D IM (2006) Changes in 0.57 (first public release) -------------------------------------- o 2005-11-23: * a lot of minor changes to make documentation better * smaller code changes * renamed core functions to lsa(), as.textmatrix(), fold_in() o 2005-11-22: * chose NOT to integrate separator lines (would splash the handling!) changed summary.textmatrix from matrix to vector output o 2005-11-12: * documentation refactured, added documentation for several new methods. * removed meanmax.R (doesn't fit the package) * checked query() to ensure it's working o 2005-11-11: * bugfix of textmatrix() to work properly with the vocabulary list * textmatrix(): integrated the vocabulary order/sort functions... o 2005-11-08: * added high-level functions: * lsa_fold-in * lsa * refacturing: * eliminated pseudo_docs -> integrate into textmatrix * connections for textmatrix turned out to be impossible * summary method * print method * rewrote "pseudo_docs" to table / factor * added vocabulary filter to textmatrix / textvector * in triples.r: use of "With(environment, { bla })" turned out impossible * getTriples: use of "return list(S=S, P=P, O=O)" turned out impossible o 2005-10-04: added nchar(..., type="chars") to count characters, not bytes Changes in 0.47 --------------- o 2005-08-26: * renamed dt_triples to textvector and dt_matrix to textmatrix Changes in 0.46 --------------- o 2005-08-25: * added "\\[|\\]|\\{|\\}" to gsub in textvector --------------------------------------------------