CRAN Package Check Results for Package nc

Last updated on 2025-12-19 19:50:56 CET.

Flavor Version Tinstall Tcheck Ttotal Status Flags
r-devel-linux-x86_64-debian-clang 2025.3.24 2.76 103.52 106.28 OK
r-devel-linux-x86_64-debian-gcc 2025.3.24 1.97 70.66 72.63 ERROR
r-devel-linux-x86_64-fedora-clang 2025.3.24 5.00 168.59 173.59 OK
r-devel-linux-x86_64-fedora-gcc 2025.3.24 171.08 ERROR
r-devel-windows-x86_64 2025.3.24 4.00 684.00 688.00 OK
r-patched-linux-x86_64 2025.3.24 2.73 98.01 100.74 OK
r-release-linux-x86_64 2025.3.24 2.58 99.66 102.24 OK
r-release-macos-arm64 2025.3.24 OK
r-release-macos-x86_64 2025.3.24 2.00 164.00 166.00 OK
r-release-windows-x86_64 2025.3.24 5.00 603.00 608.00 OK
r-oldrel-macos-arm64 2025.3.24 OK
r-oldrel-macos-x86_64 2025.3.24 2.00 185.00 187.00 OK
r-oldrel-windows-x86_64 2025.3.24 5.00 789.00 794.00 OK

Check Details

Version: 2025.3.24
Check: examples
Result: ERROR Running examples in ‘nc-Ex.R’ failed The error most likely occurred in: > base::assign(".ptime", proc.time(), pos = "CheckExEnv") > ### Name: capture_all_str > ### Title: Capture all matches in a single subject string > ### Aliases: capture_all_str > > ### ** Examples > > > data.table::setDTthreads(1) > > chr.pos.vec <- c( + "chr10:213,054,000-213,055,000", + "chrM:111,000-222,000", + "this will not match", + NA, # neither will this. + "chr1:110-111 chr2:220-222") # two possible matches. > keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x)) > ## By default elements of subject are treated as separate lines (and > ## NAs are removed). Named arguments are used to create capture > ## groups, and conversion functions such as keep.digits are used to > ## convert the previously named group. > int.pattern <- list("[0-9,]+", keep.digits) > (match.dt <- nc::capture_all_str( + chr.pos.vec, + chrom="chr.*?", + ":", + chromStart=int.pattern, + "-", + chromEnd=int.pattern)) chrom chromStart chromEnd <char> <int> <int> 1: chr10 213054000 213055000 2: chrM 111000 222000 3: chr1 110 111 4: chr2 220 222 > str(match.dt) Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables: $ chrom : chr "chr10" "chrM" "chr1" "chr2" $ chromStart: int 213054000 111000 110 220 $ chromEnd : int 213055000 222000 111 222 - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> > > ## Extract all fields from each alignment block, using two regex > ## patterns, then dcast. > info.txt.gz <- system.file( + "extdata", "SweeD_Info.txt.gz", package="nc") > info.vec <- readLines(info.txt.gz) > info.vec[24:40] [1] " Alignment 1" "" [3] "\t\tChromosome:\t\tscaffold_0" "\t\tSequences:\t\t14" [5] "\t\tSites:\t\t\t1670366" "\t\tDiscarded sites:\t1264068" [7] "" "\t\tProcessing:\t\t155.53 seconds" [9] "" "\t\tPosition:\t\t8.936200e+07" [11] "\t\tLikelihood:\t\t4.105582e+02" "\t\tAlpha:\t\t\t6.616326e-06" [13] "" "" [15] " Alignment 2" "" [17] "\t\tChromosome:\t\tscaffold_1" > info.dt <- nc::capture_all_str( + sub("Alignment ", "//", info.vec), + "//", + alignment="[0-9]+", + fields="[^/]+") > (fields.dt <- info.dt[, nc::capture_all_str( + fields, + "\t+", + variable="[^:]+", + ":\t*", + value=".*"), + by=alignment]) alignment variable value <char> <char> <char> 1: 1 Chromosome scaffold_0 2: 1 Sequences 14 3: 1 Sites 1670366 4: 1 Discarded sites 1264068 5: 1 Processing 155.53 seconds 6: 1 Position 8.936200e+07 7: 1 Likelihood 4.105582e+02 8: 1 Alpha 6.616326e-06 9: 2 Chromosome scaffold_1 10: 2 Sequences 14 11: 2 Sites 1447008 12: 2 Discarded sites 1093595 13: 2 Processing 138.83 seconds 14: 2 Position 8.722482e+07 15: 2 Likelihood 2.531514e+02 16: 2 Alpha 1.031963e-05 17: 3 Chromosome scaffold_2 18: 3 Sequences 14 19: 3 Sites 1379975 20: 3 Discarded sites 1043204 21: 3 Processing 134.50 seconds 22: 3 Position 8.461182e+07 23: 3 Likelihood 2.945708e+02 24: 3 Alpha 8.684652e-06 25: 4 Chromosome scaffold_3 26: 4 Sequences 14 27: 4 Sites 1293978 28: 4 Discarded sites 988465 29: 4 Processing 120.76 seconds 30: 4 Position 4.182126e+07 31: 4 Likelihood 6.110444e+02 32: 4 Alpha 3.335514e-06 33: 5 Chromosome scaffold_4 34: 5 Sequences 14 35: 5 Sites 1319920 36: 5 Discarded sites 1011446 37: 5 Processing 126.99 seconds 38: 5 Position 6.978721e+07 39: 5 Likelihood 2.884914e+02 40: 5 Alpha 1.062780e-05 41: 6 Chromosome scaffold_5 42: 6 Sequences 14 43: 6 Sites 1295460 44: 6 Discarded sites 990655 45: 6 Processing 119.64 seconds 46: 6 Position 8.837822e+07 47: 6 Likelihood 3.304343e+02 48: 6 Alpha 7.572795e-06 49: 7 Chromosome scaffold_6 50: 7 Sequences 14 51: 7 Sites 1197964 52: 7 Discarded sites 908454 53: 7 Processing 115.17 seconds 54: 7 Position 3.444713e+07 55: 7 Likelihood 3.261829e+02 56: 7 Alpha 3.427719e-06 57: 8 Chromosome scaffold_7 58: 8 Sequences 14 59: 8 Sites 1315248 60: 8 Discarded sites 998530 61: 8 Processing 125.20 seconds 62: 8 Position 2.337819e+07 63: 8 Likelihood 4.023517e+02 64: 8 Alpha 5.350802e-06 65: 9 Chromosome scaffold_8 66: 9 Sequences 14 67: 9 Sites 1110658 68: 9 Discarded sites 845039 69: 9 Processing 109.15 seconds 70: 9 Position 8.152571e+07 71: 9 Likelihood 3.114815e+02 72: 9 Alpha 3.899136e-06 73: 10 Chromosome scaffold_9 74: 10 Sequences 14 75: 10 Sites 1091036 76: 10 Discarded sites 833765 77: 10 Processing 104.91 seconds 78: 10 Position 2.669453e+07 79: 10 Likelihood 1.829336e+02 80: 10 Alpha 8.380941e-06 alignment variable value > (fields.wide <- data.table::dcast(fields.dt, alignment ~ variable)) Key: <alignment> alignment Alpha Chromosome Discarded sites Likelihood Position <char> <char> <char> <char> <char> <char> 1: 1 6.616326e-06 scaffold_0 1264068 4.105582e+02 8.936200e+07 2: 10 8.380941e-06 scaffold_9 833765 1.829336e+02 2.669453e+07 3: 2 1.031963e-05 scaffold_1 1093595 2.531514e+02 8.722482e+07 4: 3 8.684652e-06 scaffold_2 1043204 2.945708e+02 8.461182e+07 5: 4 3.335514e-06 scaffold_3 988465 6.110444e+02 4.182126e+07 6: 5 1.062780e-05 scaffold_4 1011446 2.884914e+02 6.978721e+07 7: 6 7.572795e-06 scaffold_5 990655 3.304343e+02 8.837822e+07 8: 7 3.427719e-06 scaffold_6 908454 3.261829e+02 3.444713e+07 9: 8 5.350802e-06 scaffold_7 998530 4.023517e+02 2.337819e+07 10: 9 3.899136e-06 scaffold_8 845039 3.114815e+02 8.152571e+07 Processing Sequences Sites <char> <char> <char> 1: 155.53 seconds 14 1670366 2: 104.91 seconds 14 1091036 3: 138.83 seconds 14 1447008 4: 134.50 seconds 14 1379975 5: 120.76 seconds 14 1293978 6: 126.99 seconds 14 1319920 7: 119.64 seconds 14 1295460 8: 115.17 seconds 14 1197964 9: 125.20 seconds 14 1315248 10: 109.15 seconds 14 1110658 > > ## Capture all csv tables in report -- the file name can be given as > ## the subject to nc::capture_all_str, which calls readLines to get > ## data to parse. > (report.txt.gz <- system.file( + "extdata", "SweeD_Report.txt.gz", package="nc")) [1] "/home/hornik/tmp/R.check/r-devel-gcc/Work/build/Packages/nc/extdata/SweeD_Report.txt.gz" > (report.dt <- nc::capture_all_str( + report.txt.gz, + "//", + alignment="[0-9]+", + "\n", + csv="[^/]+" + )[, { + data.table::fread(text=csv) + }, by=alignment]) alignment Position Likelihood Alpha <char> <num> <num> <num> 1: 1 700.0 4.637328e-03 2.763840e+02 2: 1 130585.6 3.781283e-01 8.490200e-04 3: 1 260471.2 3.602315e-02 4.691340e-03 4: 1 390356.9 7.618749e-01 5.377668e-04 5: 1 520242.5 2.979971e-08 1.411765e-01 --- 9996: 10 82991564.8 8.051006e-03 1.357819e-03 9997: 10 83074967.8 7.048433e-03 1.825764e-03 9998: 10 83158370.8 1.012360e-07 7.999999e-03 9999: 10 83241773.8 3.977189e-08 9.999997e-01 10000: 10 83325174.0 3.980538e-08 1.200000e+03 > > ## Join report with info fields. > report.dt[fields.wide, on=.(alignment)] alignment Position Likelihood Alpha i.Alpha Chromosome <char> <num> <num> <num> <char> <char> 1: 1 700.0 4.637328e-03 2.763840e+02 6.616326e-06 scaffold_0 2: 1 130585.6 3.781283e-01 8.490200e-04 6.616326e-06 scaffold_0 3: 1 260471.2 3.602315e-02 4.691340e-03 6.616326e-06 scaffold_0 4: 1 390356.9 7.618749e-01 5.377668e-04 6.616326e-06 scaffold_0 5: 1 520242.5 2.979971e-08 1.411765e-01 6.616326e-06 scaffold_0 --- 9996: 9 85297670.3 1.078915e-01 1.730811e-02 3.899136e-06 scaffold_8 9997: 9 85383396.6 2.282976e-02 2.002634e-02 3.899136e-06 scaffold_8 9998: 9 85469122.8 1.573487e+00 1.169200e-03 3.899136e-06 scaffold_8 9999: 9 85554849.1 6.892966e-02 5.344763e-03 3.899136e-06 scaffold_8 10000: 9 85640578.0 0.000000e+00 1.200000e+03 3.899136e-06 scaffold_8 Discarded sites i.Likelihood i.Position Processing Sequences <char> <char> <char> <char> <char> 1: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 2: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 3: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 4: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 5: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 --- 9996: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 9997: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 9998: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 9999: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 10000: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 Sites <char> 1: 1670366 2: 1670366 3: 1670366 4: 1670366 5: 1670366 --- 9996: 1110658 9997: 1110658 9998: 1110658 9999: 1110658 10000: 1110658 > > ## parsing nbib citation file. > (pmc.nbib <- system.file( + "extdata", "PMC3045577.nbib", package="nc")) [1] "/home/hornik/tmp/R.check/r-devel-gcc/Work/build/Packages/nc/extdata/PMC3045577.nbib" > blank <- "\n " > pmc.dt <- nc::capture_all_str( + pmc.nbib, + Abbreviation="[A-Z]+", + " *- ", + value=list( + ".*", + list(blank, ".*"), "*"), + function(x)sub(blank, "", x)) > str(pmc.dt) Classes ‘data.table’ and 'data.frame': 50 obs. of 2 variables: $ Abbreviation: chr "PMID" "OWN" "STAT" "DCOM" ... $ value : chr "21113027" "NLM" "MEDLINE" "20110512" ... - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> > > ## What do the variable fields mean? It is explained on > ## https://www.nlm.nih.gov/bsd/mms/medlineelements.html which has a > ## local copy in this package (downloaded 18 Sep 2019). > fields.html <- system.file( + "extdata", "MEDLINE_Fields.html", package="nc") > if(interactive())browseURL(fields.html) > fields.vec <- readLines(fields.html) > > ## It is pretty easy to capture fields and abbreviations if gsub > ## used to remove some tags first. > no.strong <- gsub("</?strong>", "", fields.vec) > no.comments <- gsub("<!--.*?-->", "", no.strong) > ## grep then capture_first_vec can be used if each desired row in > ## the output comes from a single line of the input file. > (h3.vec <- grep("<h3", no.comments, value=TRUE)) [1] "<h3><a id=\"ab\" name=\"ab\"></a>Abstract (AB)</h3>" [2] "<h3><a id=\"ci\" name=\"ci\"></a>Copyright Information (CI)</h3>" [3] "<h3><a id=\"ad\" name=\"ad\"></a>Affiliation (AD)</h3>" [4] "<h3><a id=\"irad\" name=\"irad\"></a>Investigator Affiliation (IRAD)</h3>" [5] "<h3><a id=\"aid\" name=\"aid\"></a>Article Identifier (AID)</h3>" [6] "<h3><a id=\"au\" name=\"au\"></a>Author (AU)</h3>" [7] "<h3><a id=\"auid\" name=\"auid\"></a>Author Identifier (AUID)</h3>" [8] "<h3><a id=\"fau\" name=\"fau\"></a>Full Author (FAU)</h3>" [9] "<h3><a id=\"cc2\" name=\"bti\"></a>Book Title (BTI)</h3>" [10] "<h3><a id=\"cc4\" name=\"cti\"></a>Collection Title (CTI)</h3>" [11] "<h3><a id=\"cc\" name=\"cc\"></a>Comments/Corrections (See fields and field tags listed below.)</h3>" [12] "<h3><a id=\"coi\" name=\"coi\"></a>Conflict of Interest Statement (COIS)</h3>" [13] "<h3><a id=\"cn\" name=\"cn\"></a>Corporate Author (CN)</h3>" [14] "<h3><a id=\"dcom2\" name=\"crdt\"></a>Create Date (CRDT)</h3>" [15] "<h3><a id=\"dcom\" name=\"dcom\"></a>Date Completed (DCOM)</h3>" [16] "<h3><a id=\"da\" name=\"da\"></a>Date Created (DA)</h3>" [17] "<h3><a id=\"lr\" name=\"lr\"></a>Date Last Revised (LR)</h3>" [18] "<h3><a id=\"dep\" name=\"dep\"></a>Date of Electronic Publication (DEP)</h3>" [19] "<h3><a id=\"dp\" name=\"dp\"></a>Date of Publication (DP)</h3>" [20] "<h3><a id=\"edat2\" name=\"ed\"></a>Editor (ED) and Full Editor Name (FED)</h3>" [21] "<h3><a id=\"edat3\" name=\"en\"></a>Edition (EN)</h3>" [22] "<h3><a id=\"edat\" name=\"edat\"></a>Entrez Date (EDAT)</h3>" [23] "<h3><a id=\"gs\" name=\"gs\"></a>Gene Symbol (GS): not currently input</h3>" [24] "<h3><a id=\"gn\" name=\"gn\"></a>General Note (GN)</h3>" [25] "<h3><a id=\"gr\" name=\"gr\"></a>Grant Number (GR)</h3>" [26] "<h3><a id=\"ir\" name=\"ir\"></a>Investigator Name (IR) and Full Investigator Name (FIR)</h3>" [27] "<h3><a id=\"is2\" name=\"isbn\"></a>ISBN (ISBN)</h3>" [28] "<h3><a id=\"is\" name=\"is\"></a>ISSN (IS)</h3>" [29] "<h3><a id=\"ip\" name=\"ip\"></a>Issue (IP)</h3>" [30] "<h3><a id=\"ta\" name=\"ta\"></a>Journal Title Abbreviation (TA)</h3>" [31] "<h3><a id=\"jt\" name=\"jt\"></a>Journal Title (JT)</h3>" [32] "<h3><a id=\"la\" name=\"la\"></a>Language (LA)</h3>" [33] "<h3><a id=\"la3\" name=\"lid\"></a>Location Identifier (LID)</h3>" [34] "<h3><a id=\"la2\" name=\"mid\"></a>Manuscript Identifier (MID)</h3>" [35] "<h3><a id=\"mhda\" name=\"mhda\"></a>MeSH Date (MHDA)</h3>" [36] "<h3><a id=\"mh\" name=\"mh\"></a>MeSH Terms (MH)</h3>" [37] "<h3><a id=\"jid\" name=\"jid\"></a>NLM Unique ID (JID)</h3>" [38] "<h3><a id=\"rf\" name=\"rf\"></a>Number of References (RF)</h3>" [39] "<h3><a id=\"oab\" name=\"oab\"></a>Other Abstract (OAB)</h3>" [40] "<h3><a id=\"oci\" name=\"oci\"></a>Other Copyright Information (OCI)</h3>" [41] "<h3><a id=\"oid\" name=\"oid\"></a>Other ID (OID)</h3>" [42] "<h3><a id=\"ot\" name=\"ot\"></a>Other Term (OT)</h3>" [43] "<h3><a id=\"oto\" name=\"oto\"></a>Other Term Owner (OTO)</h3>" [44] "<h3><a id=\"own\" name=\"own\"></a>Owner (OWN)</h3>" [45] "<h3><a id=\"pg\" name=\"pg\"></a>Pagination (PG)</h3>" [46] "<h3><a id=\"ps\" name=\"ps\"></a>Personal Name as Subject (PS)</h3>" [47] "<h3><a id=\"fps\" name=\"fps\"></a>Full Personal Name as Subject (FPS)</h3>" [48] "<h3><a id=\"pl\" name=\"pl\"></a>Place of Publication (PL)</h3>" [49] "<h3><a id=\"phst\" name=\"phst\"></a>Publication History Status (PHST)</h3>" [50] "<h3><a id=\"pst\" name=\"pst\"></a>Publication Status (PST)</h3>" [51] "<h3><a id=\"pt\" name=\"pt\"></a>Publication Type (PT)</h3>" [52] "<h3><a id=\"pubm\" name=\"pubm\"></a>Publishing Model (PUBM)</h3>" [53] "<h3><a id=\"pmid2\" name=\"pmc\"></a>PubMed Central Identifer (PMC)</h3>" [54] "<h3><a id=\"pmid3\" name=\"pmcr\"></a>PubMed Central Release (PMCR)</h3>" [55] "<h3><a id=\"pmid\" name=\"pmid\"></a>PubMed Unique Identifier (PMID)</h3>" [56] "<h3><a id=\"rn\" name=\"rn\"></a>Registry Number/EC Number (RN)</h3>" [57] "<h3><a id=\"nm\" name=\"nm\"></a>Substance Name (NM)</h3>" [58] "<h3><a id=\"si\" name=\"si\"></a>Secondary Source ID (SI)</h3>" [59] "<h3><a id=\"so\" name=\"so\"></a>Source (SO)</h3>" [60] "<h3><a id=\"sfm\" name=\"sfm\"></a>Space Flight Mission (SFM)</h3>" [61] "<h3><a id=\"stat\" name=\"stat\"></a>Status (STAT)</h3>" [62] "<h3><a id=\"sb\" name=\"sb\"></a>Subset (SB)</h3>" [63] "<h3><a id=\"ti\" name=\"ti\"></a>Title (TI)</h3>" [64] "<h3><a id=\"tt\" name=\"tt\"></a>Transliterated Title (TT)</h3>" [65] "<h3><a id=\"vi\" name=\"vi\"></a>Volume (VI)</h3>" [66] "<h3><a id=\"cc3\" name=\"vti\"></a>Volume Title (VTI)</h3>" > h3.pattern <- list( + nc::field("name", '="', '[^"]+'), + '"></a>', + fields.abbrevs="[^<]+") > first.fields.dt <- nc::capture_first_vec( + h3.vec, h3.pattern) > field.abbrev.pattern <- list( + Field=".*?", + " \\(", + Abbreviation="[^)]+", + "\\)", + "(?: and |$)?") > (first.each.field <- first.fields.dt[, nc::capture_all_str( + fields.abbrevs, field.abbrev.pattern), + by=fields.abbrevs]) fields.abbrevs <char> 1: Abstract (AB) 2: Copyright Information (CI) 3: Affiliation (AD) 4: Investigator Affiliation (IRAD) 5: Article Identifier (AID) 6: Author (AU) 7: Author Identifier (AUID) 8: Full Author (FAU) 9: Book Title (BTI) 10: Collection Title (CTI) 11: Comments/Corrections (See fields and field tags listed below.) 12: Conflict of Interest Statement (COIS) 13: Corporate Author (CN) 14: Create Date (CRDT) 15: Date Completed (DCOM) 16: Date Created (DA) 17: Date Last Revised (LR) 18: Date of Electronic Publication (DEP) 19: Date of Publication (DP) 20: Editor (ED) and Full Editor Name (FED) 21: Editor (ED) and Full Editor Name (FED) 22: Edition (EN) 23: Entrez Date (EDAT) 24: Gene Symbol (GS): not currently input 25: General Note (GN) 26: Grant Number (GR) 27: Investigator Name (IR) and Full Investigator Name (FIR) 28: Investigator Name (IR) and Full Investigator Name (FIR) 29: ISBN (ISBN) 30: ISSN (IS) 31: Issue (IP) 32: Journal Title Abbreviation (TA) 33: Journal Title (JT) 34: Language (LA) 35: Location Identifier (LID) 36: Manuscript Identifier (MID) 37: MeSH Date (MHDA) 38: MeSH Terms (MH) 39: NLM Unique ID (JID) 40: Number of References (RF) 41: Other Abstract (OAB) 42: Other Copyright Information (OCI) 43: Other ID (OID) 44: Other Term (OT) 45: Other Term Owner (OTO) 46: Owner (OWN) 47: Pagination (PG) 48: Personal Name as Subject (PS) 49: Full Personal Name as Subject (FPS) 50: Place of Publication (PL) 51: Publication History Status (PHST) 52: Publication Status (PST) 53: Publication Type (PT) 54: Publishing Model (PUBM) 55: PubMed Central Identifer (PMC) 56: PubMed Central Release (PMCR) 57: PubMed Unique Identifier (PMID) 58: Registry Number/EC Number (RN) 59: Substance Name (NM) 60: Secondary Source ID (SI) 61: Source (SO) 62: Space Flight Mission (SFM) 63: Status (STAT) 64: Subset (SB) 65: Title (TI) 66: Transliterated Title (TT) 67: Volume (VI) 68: Volume Title (VTI) fields.abbrevs Field Abbreviation <char> <char> 1: Abstract AB 2: Copyright Information CI 3: Affiliation AD 4: Investigator Affiliation IRAD 5: Article Identifier AID 6: Author AU 7: Author Identifier AUID 8: Full Author FAU 9: Book Title BTI 10: Collection Title CTI 11: Comments/Corrections See fields and field tags listed below. 12: Conflict of Interest Statement COIS 13: Corporate Author CN 14: Create Date CRDT 15: Date Completed DCOM 16: Date Created DA 17: Date Last Revised LR 18: Date of Electronic Publication DEP 19: Date of Publication DP 20: Editor ED 21: Full Editor Name FED 22: Edition EN 23: Entrez Date EDAT 24: Gene Symbol GS 25: General Note GN 26: Grant Number GR 27: Investigator Name IR 28: Full Investigator Name FIR 29: ISBN ISBN 30: ISSN IS 31: Issue IP 32: Journal Title Abbreviation TA 33: Journal Title JT 34: Language LA 35: Location Identifier LID 36: Manuscript Identifier MID 37: MeSH Date MHDA 38: MeSH Terms MH 39: NLM Unique ID JID 40: Number of References RF 41: Other Abstract OAB 42: Other Copyright Information OCI 43: Other ID OID 44: Other Term OT 45: Other Term Owner OTO 46: Owner OWN 47: Pagination PG 48: Personal Name as Subject PS 49: Full Personal Name as Subject FPS 50: Place of Publication PL 51: Publication History Status PHST 52: Publication Status PST 53: Publication Type PT 54: Publishing Model PUBM 55: PubMed Central Identifer PMC 56: PubMed Central Release PMCR 57: PubMed Unique Identifier PMID 58: Registry Number/EC Number RN 59: Substance Name NM 60: Secondary Source ID SI 61: Source SO 62: Space Flight Mission SFM 63: Status STAT 64: Subset SB 65: Title TI 66: Transliterated Title TT 67: Volume VI 68: Volume Title VTI Field Abbreviation > > ## If we want to capture the information after the initial h3 line > ## of the input, e.g. the rest column below which contains a > ## description/example for each field, then capture_all_str can be > ## used on the full input file. > h3.fields.dt <- nc::capture_all_str( + no.comments, + h3.pattern, + '</h3>\n', + rest="(?:.*\n)+?", #exercise: get the examples. + "<hr />\n") > (h3.each.field <- h3.fields.dt[, nc::capture_all_str( + fields.abbrevs, field.abbrev.pattern), + by=fields.abbrevs]) fields.abbrevs <char> 1: Abstract (AB) 2: Copyright Information (CI) 3: Affiliation (AD) 4: Investigator Affiliation (IRAD) 5: Article Identifier (AID) 6: Author (AU) 7: Author Identifier (AUID) 8: Full Author (FAU) 9: Book Title (BTI) 10: Collection Title (CTI) 11: Comments/Corrections (See fields and field tags listed below.) 12: Conflict of Interest Statement (COIS) 13: Corporate Author (CN) 14: Create Date (CRDT) 15: Date Completed (DCOM) 16: Date Created (DA) 17: Date Last Revised (LR) 18: Date of Electronic Publication (DEP) 19: Date of Publication (DP) 20: Editor (ED) and Full Editor Name (FED) 21: Editor (ED) and Full Editor Name (FED) 22: Edition (EN) 23: Entrez Date (EDAT) 24: Gene Symbol (GS): not currently input 25: General Note (GN) 26: Grant Number (GR) 27: Investigator Name (IR) and Full Investigator Name (FIR) 28: Investigator Name (IR) and Full Investigator Name (FIR) 29: ISBN (ISBN) 30: ISSN (IS) 31: Issue (IP) 32: Journal Title Abbreviation (TA) 33: Journal Title (JT) 34: Language (LA) 35: Location Identifier (LID) 36: Manuscript Identifier (MID) 37: MeSH Date (MHDA) 38: MeSH Terms (MH) 39: NLM Unique ID (JID) 40: Number of References (RF) 41: Other Abstract (OAB) 42: Other Copyright Information (OCI) 43: Other ID (OID) 44: Other Term (OT) 45: Other Term Owner (OTO) 46: Owner (OWN) 47: Pagination (PG) 48: Personal Name as Subject (PS) 49: Full Personal Name as Subject (FPS) 50: Place of Publication (PL) 51: Publication History Status (PHST) 52: Publication Status (PST) 53: Publication Type (PT) 54: Publishing Model (PUBM) 55: PubMed Central Identifer (PMC) 56: PubMed Central Release (PMCR) 57: PubMed Unique Identifier (PMID) 58: Registry Number/EC Number (RN) 59: Substance Name (NM) 60: Secondary Source ID (SI) 61: Source (SO) 62: Space Flight Mission (SFM) 63: Status (STAT) 64: Subset (SB) 65: Title (TI) 66: Transliterated Title (TT) 67: Volume (VI) 68: Volume Title (VTI) fields.abbrevs Field Abbreviation <char> <char> 1: Abstract AB 2: Copyright Information CI 3: Affiliation AD 4: Investigator Affiliation IRAD 5: Article Identifier AID 6: Author AU 7: Author Identifier AUID 8: Full Author FAU 9: Book Title BTI 10: Collection Title CTI 11: Comments/Corrections See fields and field tags listed below. 12: Conflict of Interest Statement COIS 13: Corporate Author CN 14: Create Date CRDT 15: Date Completed DCOM 16: Date Created DA 17: Date Last Revised LR 18: Date of Electronic Publication DEP 19: Date of Publication DP 20: Editor ED 21: Full Editor Name FED 22: Edition EN 23: Entrez Date EDAT 24: Gene Symbol GS 25: General Note GN 26: Grant Number GR 27: Investigator Name IR 28: Full Investigator Name FIR 29: ISBN ISBN 30: ISSN IS 31: Issue IP 32: Journal Title Abbreviation TA 33: Journal Title JT 34: Language LA 35: Location Identifier LID 36: Manuscript Identifier MID 37: MeSH Date MHDA 38: MeSH Terms MH 39: NLM Unique ID JID 40: Number of References RF 41: Other Abstract OAB 42: Other Copyright Information OCI 43: Other ID OID 44: Other Term OT 45: Other Term Owner OTO 46: Owner OWN 47: Pagination PG 48: Personal Name as Subject PS 49: Full Personal Name as Subject FPS 50: Place of Publication PL 51: Publication History Status PHST 52: Publication Status PST 53: Publication Type PT 54: Publishing Model PUBM 55: PubMed Central Identifer PMC 56: PubMed Central Release PMCR 57: PubMed Unique Identifier PMID 58: Registry Number/EC Number RN 59: Substance Name NM 60: Secondary Source ID SI 61: Source SO 62: Space Flight Mission SFM 63: Status STAT 64: Subset SB 65: Title TI 66: Transliterated Title TT 67: Volume VI 68: Volume Title VTI Field Abbreviation > > ## Either method of capturing abbreviations gives the same result. > identical(first.each.field, h3.each.field) [1] TRUE > > ## but the capture_all_str method returns the additional rest column > ## which contains data after the initial h3 line. > names(first.fields.dt) [1] "name" "fields.abbrevs" > names(h3.fields.dt) [1] "name" "fields.abbrevs" "rest" > cat(h3.fields.dt[fields.abbrevs=="Volume (VI)", rest]) <p>The volume number of the journal in which the article was published is recorded here.</p> <p class="examplekm">Examples:<br />VI - 7<br />VI - 5 Spec No<br />VI - 49 Suppl 20</p> <p>Some records (especially records from <a href="/databases/databases_oldmedline.html">OLDMEDLINE</a>) contain the Issue field but lack the Volume field; some contain the Volume field but lack the Issue field; and some records contain Volume and Issue data in the Volume element.</p> > > ## There are 66 Field rows across three tables. > a.href <- list('<a href=[^>]+>') > (td.vec <- fields.vec[240:280]) [1] "<td><a href=\"#ab\">Abstract</a></td>" [2] "<td><a href=\"#ab\">(AB)</a></td>" [3] "</tr>" [4] "<tr style=\"background-color: #cccccc;\">" [5] "<td><a href=\"#ci\">Copyright Information</a></td>" [6] "<td>" [7] "<div><a href=\"#ci\">(CI)</a></div>" [8] "</td>" [9] "</tr>" [10] "<tr>" [11] "<td><a href=\"#ad\">Affiliation</a></td>" [12] "<td>" [13] "<div><a href=\"#ad\">(AD)</a></div>" [14] "</td>" [15] "</tr>" [16] "<tr style=\"background-color: #cccccc;\">" [17] "<td><a href=\"#irad\">Investigator Affiliation</a></td>" [18] "<td>" [19] "<div><a href=\"#irad\">(IRAD)</a></div>" [20] "</td>" [21] "</tr>" [22] "<tr>" [23] "<td><a href=\"#aid\">Article Identifier</a></td>" [24] "<td>" [25] "<div><a href=\"#aid\">(AID)</a></div>" [26] "</td>" [27] "</tr>" [28] "<tr style=\"background-color: #cccccc;\">" [29] "<td><a href=\"#au\">Author</a></td>" [30] "<td>" [31] "<div><a href=\"#au\">(AU)</a></div>" [32] "</td>" [33] "</tr>" [34] "<tr>" [35] "<td><a href=\"#auid\">Author Identifier</a></td>" [36] "<td><a href=\"#auid\">(AUID)</a></td>" [37] "</tr>" [38] "<tr>" [39] "<td style=\"background-color: #cccccc;\"><a href=\"#fau\">Full Author</a></td>" [40] "<td style=\"background-color: #cccccc;\">" [41] "<div><a href=\"#fau\">(FAU)</a></div>" > fields.pattern <- list( + "<td.*?>", + a.href, + Fields="[^()<]+", + "</a></td>\n") > (td.only.Fields <- nc::capture_all_str(fields.vec, fields.pattern)) Fields <char> 1: Abstract 2: Copyright Information 3: Affiliation 4: Investigator Affiliation 5: Article Identifier 6: Author 7: Author Identifier 8: Full Author 9: Book Title 10: Collection Title 11: Comments/Corrections 12: Conflict of Interest Statement 13: Corporate Author 14: Create Date 15: Date Completed 16: Date Created 17: Date Last Revised 18: Date of Electronic Publication 19: Date of Publication 20: Edition 21: Editor and Full Editor Name 22: Entrez Date 23: Gene Symbol 24: General Note 25: Grant Number 26: Investigator Name and Full Investigator Name 27: ISBN 28: ISSN 29: Issue 30: Journal Title Abbreviation 31: Journal Title 32: Language 33: Location Identifier 34: Manuscript Identifier 35: MeSH Date 36: MeSH Terms 37: NLM Unique ID 38: Number of References 39: Other Abstract 40: Other Copyright Information 41: Other ID 42: Other Term 43: Other Term Owner 44: Owner 45: Pagination 46: Personal Name as Subject 47: Full Personal Name as Subject 48: Place of Publication 49: Publication History Status 50: Publication Status 51: Publication Type 52: Publishing Model 53: PubMed Central Identifier 54: PubMed Central Release 55: PubMed Unique Identifier 56: Registry Number/EC Number 57: Substance Name 58: Secondary Source ID 59: Source 60: Space Flight Mission 61: Status 62: Subset 63: Title 64: Transliterated Title 65: Volume 66: Volume Title Fields > > ## Extract Fields and Abbreviations. Careful: most fields have one > ## abbreviation, but one field has none, and two fields have two. > (td.fields.dt <- nc::capture_all_str( + fields.vec, + fields.pattern, + "<td[^>]*>", + "(?:\n<div>)?", + a.href, "?", + abbrevs=".*?", + "</")) Fields abbrevs <char> <char> 1: Abstract (AB) 2: Copyright Information (CI) 3: Affiliation (AD) 4: Investigator Affiliation (IRAD) 5: Article Identifier (AID) 6: Author (AU) 7: Author Identifier (AUID) 8: Full Author (FAU) 9: Book Title (BTI) 10: Collection Title (CTI) 11: Comments/Corrections &nbsp; 12: Conflict of Interest Statement (COIS) 13: Corporate Author (CN) 14: Create Date (CRDT) 15: Date Completed (DCOM) 16: Date Created (DA) 17: Date Last Revised (LR) 18: Date of Electronic Publication (DEP) 19: Date of Publication (DP) 20: Edition (EN) 21: Editor and Full Editor Name (ED)<br />(FED) 22: Entrez Date (EDAT) 23: Gene Symbol (GS) 24: General Note (GN) 25: Grant Number (GR) 26: Investigator Name and Full Investigator Name (IR) (FIR) 27: ISBN (ISBN) 28: ISSN (IS) 29: Issue (IP) 30: Journal Title Abbreviation (TA) 31: Journal Title (JT) 32: Language (LA) 33: Location Identifier (LID) 34: Manuscript Identifier (MID) 35: MeSH Date (MHDA) 36: MeSH Terms (MH) 37: NLM Unique ID (JID) 38: Number of References (RF) 39: Other Abstract (OAB) 40: Other Copyright Information (OCI) 41: Other ID (OID) 42: Other Term (OT) 43: Other Term Owner (OTO) 44: Owner (OWN) 45: Pagination (PG) 46: Personal Name as Subject (PS) 47: Full Personal Name as Subject (FPS) 48: Place of Publication (PL) 49: Publication History Status (PHST) 50: Publication Status (PST) 51: Publication Type (PT) 52: Publishing Model (PUBM) 53: PubMed Central Identifier (PMC) 54: PubMed Central Release (PMCR) 55: PubMed Unique Identifier (PMID) 56: Registry Number/EC Number (RN) 57: Substance Name (NM) 58: Secondary Source ID (SI) 59: Source (SO) 60: Space Flight Mission (SFM) 61: Status (STAT) 62: Subset (SB) 63: Title (TI) 64: Transliterated Title (TT) 65: Volume (VI) 66: Volume Title (VTI) Fields abbrevs > > ## Get each individual abbreviation from the previously captured td > ## data. > td.each.field <- td.fields.dt[, { + f <- nc::capture_all_str( + Fields, + Field=".*?", + "(?:$| and )") + a <- nc::capture_all_str( + abbrevs, + "\\(", + Abbreviation="[^)]+", + "\\)") + if(nrow(a)==0)list() else cbind(f, a) + }, by=Fields] > str(td.each.field) Classes ‘data.table’ and 'data.frame': 67 obs. of 3 variables: $ Fields : chr "Abstract" "Copyright Information" "Affiliation" "Investigator Affiliation" ... $ Field : chr "Abstract" "Copyright Information" "Affiliation" "Investigator Affiliation" ... $ Abbreviation: chr "AB" "CI" "AD" "IRAD" ... - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> > td.each.field[td.fields.dt, .( + count=.N + ), on=.(Fields), by=.EACHI][order(count)] Fields count <char> <int> 1: Comments/Corrections 0 2: Abstract 1 3: Copyright Information 1 4: Affiliation 1 5: Investigator Affiliation 1 6: Article Identifier 1 7: Author 1 8: Author Identifier 1 9: Full Author 1 10: Book Title 1 11: Collection Title 1 12: Conflict of Interest Statement 1 13: Corporate Author 1 14: Create Date 1 15: Date Completed 1 16: Date Created 1 17: Date Last Revised 1 18: Date of Electronic Publication 1 19: Date of Publication 1 20: Edition 1 21: Entrez Date 1 22: Gene Symbol 1 23: General Note 1 24: Grant Number 1 25: ISBN 1 26: ISSN 1 27: Issue 1 28: Journal Title Abbreviation 1 29: Journal Title 1 30: Language 1 31: Location Identifier 1 32: Manuscript Identifier 1 33: MeSH Date 1 34: MeSH Terms 1 35: NLM Unique ID 1 36: Number of References 1 37: Other Abstract 1 38: Other Copyright Information 1 39: Other ID 1 40: Other Term 1 41: Other Term Owner 1 42: Owner 1 43: Pagination 1 44: Personal Name as Subject 1 45: Full Personal Name as Subject 1 46: Place of Publication 1 47: Publication History Status 1 48: Publication Status 1 49: Publication Type 1 50: Publishing Model 1 51: PubMed Central Identifier 1 52: PubMed Central Release 1 53: PubMed Unique Identifier 1 54: Registry Number/EC Number 1 55: Substance Name 1 56: Secondary Source ID 1 57: Source 1 58: Space Flight Mission 1 59: Status 1 60: Subset 1 61: Title 1 62: Transliterated Title 1 63: Volume 1 64: Volume Title 1 65: Editor and Full Editor Name 2 66: Investigator Name and Full Investigator Name 2 Fields count > > ## There is a typo in the data captured from the h3 headings. > td.each.field[!Field %in% h3.each.field$Field] Fields Field Abbreviation <char> <char> <char> 1: PubMed Central Identifier PubMed Central Identifier PMC > h3.each.field[!Field %in% td.each.field$Field] fields.abbrevs <char> 1: Comments/Corrections (See fields and field tags listed below.) 2: PubMed Central Identifer (PMC) Field Abbreviation <char> <char> 1: Comments/Corrections See fields and field tags listed below. 2: PubMed Central Identifer PMC > > ## Abbreviations are consistent. > td.each.field[!Abbreviation %in% h3.each.field$Abbreviation] Empty data.table (0 rows and 3 cols): Fields,Field,Abbreviation > h3.each.field[!Abbreviation %in% td.each.field$Abbreviation] fields.abbrevs <char> 1: Comments/Corrections (See fields and field tags listed below.) Field Abbreviation <char> <char> 1: Comments/Corrections See fields and field tags listed below. > > ## There is a a table that provides a description of each comment > ## type. > (comment.vec <- fields.vec[840:860]) [1] "<tr>" [2] "<th><strong>Comment or Correction Type</strong></th>" [3] "<th><strong>MEDLINE Display Field Tag</strong></th>" [4] "<th><strong>Description</strong></th>" [5] "</tr>" [6] "<tr>" [7] "<td><strong>Comment in</strong></td>" [8] "<td><strong>(CIN)</strong></td>" [9] "<td>cites the reference containing a commentary about the article (appears on citation for original article); began use with journal issues published in 1989.</td>" [10] "</tr>" [11] "<tr>" [12] "<td><strong>Comment on</strong></td>" [13] "<td><strong>(CON)</strong></td>" [14] "<td>cites the reference upon which the article comments; began use with journal issues published in 1989.</td>" [15] "</tr>" [16] "<tr>" [17] "<td><strong>Erratum in</strong></td>" [18] "<td><strong>(EIN)</strong></td>" [19] "<td>cites a published erratum to the article (appears on citation for original article); began use in 1987.</td>" [20] "</tr>" [21] "<tr>" > comment.dt <- nc::capture_all_str( + fields.vec, + "<td><strong>", + Field="[^<]+", + "</strong></td>\n", + "<td><strong>\\(", + Abbreviation="[^)]+", + "\\)</strong></td>\n", + "<td>", + description=".*", + "</td>\n") > str(comment.dt) Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables: $ Field : chr "Comment in" "Comment on" "Erratum in" "Erratum for" ... $ Abbreviation: chr "CIN" "CON" "EIN" "EFR" ... $ description : chr "cites the reference containing a commentary about the article (appears on citation for original article); began"| __truncated__ "cites the reference upon which the article comments; began use with journal issues published in 1989." "cites a published erratum to the article (appears on citation for original article); began use in 1987." "cites the original article for which there is a published erratum. As of 2016, partial retractions are considered errata." ... - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> > > ## Join to original PMC citation file in order to see what the > ## abbreviations used in that file mean. > all.abbrevs <- rbind( + td.each.field[, .(Field, Abbreviation)], + comment.dt[, .(Field, Abbreviation)]) > all.abbrevs[pmc.dt, .( + Abbreviation, + Field, + value=substr(value, 1, 20) + ), on=.(Abbreviation)] Abbreviation Field value <char> <char> <char> 1: PMID PubMed Unique Identifier 21113027 2: OWN Owner NLM 3: STAT Status MEDLINE 4: DCOM Date Completed 20110512 5: LR Date Last Revised 20181113 6: IS ISSN 1362-4962 (Electroni 7: IS ISSN 0305-1048 (Print) 8: IS ISSN 0305-1048 (Linking) 9: VI Volume 39 10: IP Issue 4 11: DP Date of Publication 2011 Mar 12: TI Title A manually curated C 13: PG Pagination e25 14: LID Location Identifier 10.1093/nar/gkq1187 15: AB Abstract Chromatin immunoprec 16: FAU Full Author Rye, Morten Beck 17: AU Author Rye MB 18: AD Affiliation Department of Cancer 19: FAU Full Author Sætrom, Pål 20: AU Author Sætrom P 21: FAU Full Author Drabløs, Finn 22: AU Author Drabløs F 23: LA Language eng 24: PT Publication Type Evaluation Studies 25: PT Publication Type Journal Article 26: PT Publication Type Research Support, No 27: DEP Date of Electronic Publication 20101126 28: TA Journal Title Abbreviation Nucleic Acids Res 29: JT Journal Title Nucleic acids resear 30: JID NLM Unique ID 0411011 31: RN Registry Number/EC Number 0 (Transcription Fac 32: SB Subset IM 33: MH MeSH Terms Benchmarking 34: MH MeSH Terms Binding Sites 35: MH MeSH Terms *Chromatin Immunopre 36: MH MeSH Terms *High-Throughput Nuc 37: MH MeSH Terms *Software 38: MH MeSH Terms Transcription Factor 39: PMC PubMed Central Identifier PMC3045577 40: EDAT Entrez Date 2010/11/30 06:00 41: MHDA MeSH Date 2011/05/13 06:00 42: CRDT Create Date 2010/11/30 06:00 43: PHST Publication History Status 2010/11/30 06:00 [en 44: PHST Publication History Status 2010/11/30 06:00 [pu 45: PHST Publication History Status 2011/05/13 06:00 [me 46: AID Article Identifier 10.1093/nar/gkq1187 47: AID Article Identifier gkq1187 [pii] 48: AID Article Identifier gkq1187 [pii] 49: PST Publication Status ppublish 50: SO Source Nucleic Acids Res. 2 Abbreviation Field value > > ## There is a listing of examples for each comment type. > (comment.ex.dt <- nc::capture_all_str( + fields.vec[938], + "br />\\s*", + Abbreviation="[A-Z]+", + "\\s*-\\s*", + citation="[^<]+?", + list( + "[.] ", + nc::field("PMID", ": ", "[0-9]+") + ), "?", + "<")) Abbreviation citation <char> <char> 1: CON Dev Cell. 2002 Jul;3(1):85-97 2: CIN N Engl J Med. 2003 Jul 17;349(3):211-2 3: CRI Orthop Nurs. 2003 May-Jun;22(3):232-9 4: CRF Biochemistry. 1994 May 10;33(18):5614-22 5: EIN Acta Obstet Gynecol Scand. 2003 Jan;82(1):102 6: EFR J Arthroplasty. 2002 Jun;17(4):524-6 7: RIN J Biochem Mol Biol. 2002 Nov 30;35(6):642 8: ROF Ware FE, Lehrman MA. J Biol Chem. 1996 Jun 14;271(24):13935-8 9: UIN Cochrane Database Syst Rev. 2002;(3):CD003688 10: UOF Cochrane Database Syst Rev. 2002;(2):CD003680 11: SPIN Ann Intern Med. 2003 Jun 3;138(11):I60 12: ORI Ann Intern Med. 2003 Jun 3;138(11):907-16 PMID <char> 1: 12110170 2: 12867604 3: 12872752 4: 8180186 5: 6: 12066289 7: 12476908 8: 8663248 9: 12137706 10: 12076500 11: 12779314 12: 12779301 > > ## Join abbreviations to see what kind of comments. > all.abbrevs[comment.ex.dt, on=.(Abbreviation)] Field Abbreviation <char> <char> 1: Comment on CON 2: Comment in CIN 3: Corrected and Republished in CRI 4: Corrected and Republished from CRF 5: Erratum in EIN 6: Erratum for EFR 7: Retraction in RIN 8: Retraction of ROF 9: Update in UIN 10: Update of UOF 11: Summary for patients in SPIN 12: Original report in ORI citation PMID <char> <char> 1: Dev Cell. 2002 Jul;3(1):85-97 12110170 2: N Engl J Med. 2003 Jul 17;349(3):211-2 12867604 3: Orthop Nurs. 2003 May-Jun;22(3):232-9 12872752 4: Biochemistry. 1994 May 10;33(18):5614-22 8180186 5: Acta Obstet Gynecol Scand. 2003 Jan;82(1):102 6: J Arthroplasty. 2002 Jun;17(4):524-6 12066289 7: J Biochem Mol Biol. 2002 Nov 30;35(6):642 12476908 8: Ware FE, Lehrman MA. J Biol Chem. 1996 Jun 14;271(24):13935-8 8663248 9: Cochrane Database Syst Rev. 2002;(3):CD003688 12137706 10: Cochrane Database Syst Rev. 2002;(2):CD003680 12076500 11: Ann Intern Med. 2003 Jun 3;138(11):I60 12779314 12: Ann Intern Med. 2003 Jun 3;138(11):907-16 12779301 > > ## parsing bibtex file. > refs.bib <- system.file( + "extdata", "namedCapture-refs.bib", package="nc") > refs.vec <- readLines(refs.bib) > at.lines <- grep("@", refs.vec, value=TRUE) > str(at.lines) chr [1:24] " @Manual{namedCapture," " @Manual{TRE," " @Manual{re2r," ... > refs.dt <- nc::capture_all_str( + refs.vec, + "@", + type="[^{]+", + "[{]", + ref="[^,]+", + ",\n", + fields="(?:.*\n)+?.*", + "[}]\\s*(?:$|\n)") > str(refs.dt) Classes ‘data.table’ and 'data.frame': 24 obs. of 3 variables: $ type : chr "Manual" "Manual" "Manual" "Manual" ... $ ref : chr "namedCapture" "TRE" "re2r" "rematch2" ... $ fields: chr " title = {namedCapture: Named Capture Regular Expressions},\n author = {Toby Dylan Hocking},\n year = "| __truncated__ " title = {TRE: The free and portable approximate regex matching library},\n author = {Ville Laurikari},\n"| __truncated__ " title = {re2r: RE2 Regular Expression},\n author = {Qin Wenfeng},\n year = {2017},\n note = {R pac"| __truncated__ " title = {rematch2: Tidy Output from Regular Expression Matching},\n author = {Gábor Csárdi},\n year ="| __truncated__ ... - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> > > ## parsing each field of each entry. > eq.lines <- grep("=", refs.vec, value=TRUE) > str(eq.lines) chr [1:140] " title = {namedCapture: Named Capture Regular Expressions}," ... > strip <- function(x)sub("^\\s*\\{*", "", sub("\\}*,?$", "", x)) > refs.fields <- refs.dt[, nc::capture_all_str( + fields, + "\\s+", + variable="\\S+", + "\\s+=", + value=".*", strip), + by=.(type, ref)] > str(refs.fields) Classes ‘data.table’ and 'data.frame': 140 obs. of 4 variables: $ type : chr "Manual" "Manual" "Manual" "Manual" ... $ ref : chr "namedCapture" "namedCapture" "namedCapture" "namedCapture" ... $ variable: chr "title" "author" "year" "note" ... $ value : chr "namedCapture: Named Capture Regular Expressions" "Toby Dylan Hocking" "2019" "R package version 2019.01.14" ... - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> > with(refs.fields[ref=="HockingUseR2011"], structure( + as.list(value), names=variable)) $author [1] "Toby Dylan Hocking" $title [1] "Fast, named capture regular expressions in R 2.14" $year [1] "2011" $url [1] "http://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Lightening/2-StatisticsAndProg\\_3-Hocking.pdf" $booktitle [1] "useR 2011 conference proceedings" > ## the URL of my talk is now > ## https://user2011.r-project.org/TalkSlides/Lightening/2-StatisticsAndProg_3-Hocking.pdf > > if(!grepl("solaris", R.version$platform)){#To avoid CRAN check error on solaris + ## Parsing wikimedia tables: each begins with {| and ends with |}. + emoji.txt.gz <- system.file( + "extdata", "wikipedia-emoji-text.txt.gz", package="nc") + tables <- nc::capture_all_str( + emoji.txt.gz, + "\n[{][|]", + first=".*", + '\n[|][+] style="', + nc::field("font-size", ":", '.*?'), + '" [|] ', + title=".*", + lines="(?:\n.*)*?", + "\n[|][}]") + str(tables) + ## Rows are separated by |- + rows.dt <- tables[, { + row.vec <- strsplit(lines, "|-", fixed=TRUE)[[1]][-1] + .(row.i=seq_along(row.vec), row=row.vec) + }, by=title] + str(rows.dt) + ## Try to parse columns from each row. Doesn't work for second table + ## https://en.wikipedia.org/w/index.php?title=Emoji&oldid=920745513#Skin_color + ## because some entries have rowspan=2. + contents.dt <- rows.dt[, nc::capture_all_str( + row, + "[|] ", + content=".*?", + "(?: [|]|\n|$)"), + by=.(title, row.i)] + contents.dt[, .(cols=.N), by=.(title, row.i)] + ## Make data table from + ## https://en.wikipedia.org/w/index.php?title=Emoji&oldid=920745513#Emoji_versus_text_presentation + contents.dt[, col.i := 1:.N, by=.(title, row.i)] + data.table::dcast( + contents.dt[title=="Sample emoji variation sequences"], + row.i ~ col.i, + value.var="content") + } Classes ‘data.table’ and 'data.frame': 2 obs. of 4 variables: $ first : chr " border=\"1\" cellspacing=\"0\" cellpadding=\"5\" class=\"wikitable nounderlines\" style=\"border-collapse:coll"| __truncated__ " border=\"1\" cellspacing=\"0\" cellpadding=\"5\" class=\"wikitable nounderlines\" style=\"border-collapse:coll"| __truncated__ $ font-size: chr " 67%" "small" $ title : chr "Sample emoji variation sequences" "Sample use of Fitzpatrick modifiers" $ lines : chr "\n|- style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:right\" | U+ || 2139 || 23"| __truncated__ "\n|-style=\"background:#F8F8F8;font-size:67%\"\n! scope=\"col\" colspan=\"2\" style=\"text-align:left\" | Code "| __truncated__ - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> Classes ‘data.table’ and 'data.frame': 19 obs. of 3 variables: $ title: chr "Sample emoji variation sequences" "Sample emoji variation sequences" "Sample emoji variation sequences" "Sample emoji variation sequences" ... $ row.i: int 1 2 3 4 5 6 1 2 3 4 ... $ row : chr " style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:right\" | U+ || 2139 || 231B |"| __truncated__ " style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:left\" | default&nbsp;presenta"| __truncated__ "\n! scope=\"col\" style=\"background:#F8F8F8;font-size: 67%;text-align:left\" | base&nbsp;code&nbsp;point\n| ℹ "| __truncated__ "\n! scope=\"col\" style=\"background:#F8F8F8;font-size: 67%;text-align:left\" | base+VS15 (text)\n| {{emoji pre"| __truncated__ ... - attr(*, ".internal.selfref")=<pointer: 0x56302a20f070> Error in `[.data.table`(contents.dt, , `:=`(col.i, 1:.N), by = .(title, : attempt access index 3/3 in VECTOR_ELT Calls: [ -> [.data.table Execution halted Flavor: r-devel-linux-x86_64-debian-gcc

Version: 2025.3.24
Check: re-building of vignette outputs
Result: ERROR Error(s) in re-building vignettes: ... --- re-building ‘v0-overview.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v0-overview.Rmd’ --- re-building ‘v1-capture-first.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v1-capture-first.Rmd’ --- re-building ‘v2-capture-all.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v2-capture-all.Rmd’ --- re-building ‘v3-capture-melt.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v3-capture-melt.Rmd’ --- re-building ‘v4-comparisons.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v4-comparisons.Rmd’ --- re-building ‘v5-helpers.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building 'v5-helpers.Rmd' --- re-building ‘v6-engines.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v6-engines.Rmd’ --- re-building ‘v7-capture-glob.Rmd’ using rmarkdown Quitting from v7-capture-glob.Rmd:257-272 [unnamed-chunk-18] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <error/rlang_error> Error in `[.data.table`: ! attempt access index 6/6 in VECTOR_ELT --- Backtrace: ▆ 1. ├─...[] 2. └─data.table:::`[.data.table`(...) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Error: processing vignette 'v7-capture-glob.Rmd' failed with diagnostics: attempt access index 6/6 in VECTOR_ELT --- failed re-building ‘v7-capture-glob.Rmd’ SUMMARY: processing the following file failed: ‘v7-capture-glob.Rmd’ Error: Vignette re-building failed. Execution halted Flavor: r-devel-linux-x86_64-debian-gcc

Version: 2025.3.24
Check: examples
Result: ERROR Running examples in ‘nc-Ex.R’ failed The error most likely occurred in: > ### Name: capture_all_str > ### Title: Capture all matches in a single subject string > ### Aliases: capture_all_str > > ### ** Examples > > > data.table::setDTthreads(1) > > chr.pos.vec <- c( + "chr10:213,054,000-213,055,000", + "chrM:111,000-222,000", + "this will not match", + NA, # neither will this. + "chr1:110-111 chr2:220-222") # two possible matches. > keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x)) > ## By default elements of subject are treated as separate lines (and > ## NAs are removed). Named arguments are used to create capture > ## groups, and conversion functions such as keep.digits are used to > ## convert the previously named group. > int.pattern <- list("[0-9,]+", keep.digits) > (match.dt <- nc::capture_all_str( + chr.pos.vec, + chrom="chr.*?", + ":", + chromStart=int.pattern, + "-", + chromEnd=int.pattern)) chrom chromStart chromEnd <char> <int> <int> 1: chr10 213054000 213055000 2: chrM 111000 222000 3: chr1 110 111 4: chr2 220 222 > str(match.dt) Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables: $ chrom : chr "chr10" "chrM" "chr1" "chr2" $ chromStart: int 213054000 111000 110 220 $ chromEnd : int 213055000 222000 111 222 - attr(*, ".internal.selfref")=<pointer: 0x2921e210> > > ## Extract all fields from each alignment block, using two regex > ## patterns, then dcast. > info.txt.gz <- system.file( + "extdata", "SweeD_Info.txt.gz", package="nc") > info.vec <- readLines(info.txt.gz) > info.vec[24:40] [1] " Alignment 1" "" [3] "\t\tChromosome:\t\tscaffold_0" "\t\tSequences:\t\t14" [5] "\t\tSites:\t\t\t1670366" "\t\tDiscarded sites:\t1264068" [7] "" "\t\tProcessing:\t\t155.53 seconds" [9] "" "\t\tPosition:\t\t8.936200e+07" [11] "\t\tLikelihood:\t\t4.105582e+02" "\t\tAlpha:\t\t\t6.616326e-06" [13] "" "" [15] " Alignment 2" "" [17] "\t\tChromosome:\t\tscaffold_1" > info.dt <- nc::capture_all_str( + sub("Alignment ", "//", info.vec), + "//", + alignment="[0-9]+", + fields="[^/]+") > (fields.dt <- info.dt[, nc::capture_all_str( + fields, + "\t+", + variable="[^:]+", + ":\t*", + value=".*"), + by=alignment]) alignment variable value <char> <char> <char> 1: 1 Chromosome scaffold_0 2: 1 Sequences 14 3: 1 Sites 1670366 4: 1 Discarded sites 1264068 5: 1 Processing 155.53 seconds 6: 1 Position 8.936200e+07 7: 1 Likelihood 4.105582e+02 8: 1 Alpha 6.616326e-06 9: 2 Chromosome scaffold_1 10: 2 Sequences 14 11: 2 Sites 1447008 12: 2 Discarded sites 1093595 13: 2 Processing 138.83 seconds 14: 2 Position 8.722482e+07 15: 2 Likelihood 2.531514e+02 16: 2 Alpha 1.031963e-05 17: 3 Chromosome scaffold_2 18: 3 Sequences 14 19: 3 Sites 1379975 20: 3 Discarded sites 1043204 21: 3 Processing 134.50 seconds 22: 3 Position 8.461182e+07 23: 3 Likelihood 2.945708e+02 24: 3 Alpha 8.684652e-06 25: 4 Chromosome scaffold_3 26: 4 Sequences 14 27: 4 Sites 1293978 28: 4 Discarded sites 988465 29: 4 Processing 120.76 seconds 30: 4 Position 4.182126e+07 31: 4 Likelihood 6.110444e+02 32: 4 Alpha 3.335514e-06 33: 5 Chromosome scaffold_4 34: 5 Sequences 14 35: 5 Sites 1319920 36: 5 Discarded sites 1011446 37: 5 Processing 126.99 seconds 38: 5 Position 6.978721e+07 39: 5 Likelihood 2.884914e+02 40: 5 Alpha 1.062780e-05 41: 6 Chromosome scaffold_5 42: 6 Sequences 14 43: 6 Sites 1295460 44: 6 Discarded sites 990655 45: 6 Processing 119.64 seconds 46: 6 Position 8.837822e+07 47: 6 Likelihood 3.304343e+02 48: 6 Alpha 7.572795e-06 49: 7 Chromosome scaffold_6 50: 7 Sequences 14 51: 7 Sites 1197964 52: 7 Discarded sites 908454 53: 7 Processing 115.17 seconds 54: 7 Position 3.444713e+07 55: 7 Likelihood 3.261829e+02 56: 7 Alpha 3.427719e-06 57: 8 Chromosome scaffold_7 58: 8 Sequences 14 59: 8 Sites 1315248 60: 8 Discarded sites 998530 61: 8 Processing 125.20 seconds 62: 8 Position 2.337819e+07 63: 8 Likelihood 4.023517e+02 64: 8 Alpha 5.350802e-06 65: 9 Chromosome scaffold_8 66: 9 Sequences 14 67: 9 Sites 1110658 68: 9 Discarded sites 845039 69: 9 Processing 109.15 seconds 70: 9 Position 8.152571e+07 71: 9 Likelihood 3.114815e+02 72: 9 Alpha 3.899136e-06 73: 10 Chromosome scaffold_9 74: 10 Sequences 14 75: 10 Sites 1091036 76: 10 Discarded sites 833765 77: 10 Processing 104.91 seconds 78: 10 Position 2.669453e+07 79: 10 Likelihood 1.829336e+02 80: 10 Alpha 8.380941e-06 alignment variable value > (fields.wide <- data.table::dcast(fields.dt, alignment ~ variable)) Key: <alignment> alignment Alpha Chromosome Discarded sites Likelihood Position <char> <char> <char> <char> <char> <char> 1: 1 6.616326e-06 scaffold_0 1264068 4.105582e+02 8.936200e+07 2: 10 8.380941e-06 scaffold_9 833765 1.829336e+02 2.669453e+07 3: 2 1.031963e-05 scaffold_1 1093595 2.531514e+02 8.722482e+07 4: 3 8.684652e-06 scaffold_2 1043204 2.945708e+02 8.461182e+07 5: 4 3.335514e-06 scaffold_3 988465 6.110444e+02 4.182126e+07 6: 5 1.062780e-05 scaffold_4 1011446 2.884914e+02 6.978721e+07 7: 6 7.572795e-06 scaffold_5 990655 3.304343e+02 8.837822e+07 8: 7 3.427719e-06 scaffold_6 908454 3.261829e+02 3.444713e+07 9: 8 5.350802e-06 scaffold_7 998530 4.023517e+02 2.337819e+07 10: 9 3.899136e-06 scaffold_8 845039 3.114815e+02 8.152571e+07 Processing Sequences Sites <char> <char> <char> 1: 155.53 seconds 14 1670366 2: 104.91 seconds 14 1091036 3: 138.83 seconds 14 1447008 4: 134.50 seconds 14 1379975 5: 120.76 seconds 14 1293978 6: 126.99 seconds 14 1319920 7: 119.64 seconds 14 1295460 8: 115.17 seconds 14 1197964 9: 125.20 seconds 14 1315248 10: 109.15 seconds 14 1110658 > > ## Capture all csv tables in report -- the file name can be given as > ## the subject to nc::capture_all_str, which calls readLines to get > ## data to parse. > (report.txt.gz <- system.file( + "extdata", "SweeD_Report.txt.gz", package="nc")) [1] "/data/gannet/ripley/R/packages/tests-devel/nc.Rcheck/nc/extdata/SweeD_Report.txt.gz" > (report.dt <- nc::capture_all_str( + report.txt.gz, + "//", + alignment="[0-9]+", + "\n", + csv="[^/]+" + )[, { + data.table::fread(text=csv) + }, by=alignment]) alignment Position Likelihood Alpha <char> <num> <num> <num> 1: 1 700.0 4.637328e-03 2.763840e+02 2: 1 130585.6 3.781283e-01 8.490200e-04 3: 1 260471.2 3.602315e-02 4.691340e-03 4: 1 390356.9 7.618749e-01 5.377668e-04 5: 1 520242.5 2.979971e-08 1.411765e-01 --- 9996: 10 82991564.8 8.051006e-03 1.357819e-03 9997: 10 83074967.8 7.048433e-03 1.825764e-03 9998: 10 83158370.8 1.012360e-07 7.999999e-03 9999: 10 83241773.8 3.977189e-08 9.999997e-01 10000: 10 83325174.0 3.980538e-08 1.200000e+03 > > ## Join report with info fields. > report.dt[fields.wide, on=.(alignment)] alignment Position Likelihood Alpha i.Alpha Chromosome <char> <num> <num> <num> <char> <char> 1: 1 700.0 4.637328e-03 2.763840e+02 6.616326e-06 scaffold_0 2: 1 130585.6 3.781283e-01 8.490200e-04 6.616326e-06 scaffold_0 3: 1 260471.2 3.602315e-02 4.691340e-03 6.616326e-06 scaffold_0 4: 1 390356.9 7.618749e-01 5.377668e-04 6.616326e-06 scaffold_0 5: 1 520242.5 2.979971e-08 1.411765e-01 6.616326e-06 scaffold_0 --- 9996: 9 85297670.3 1.078915e-01 1.730811e-02 3.899136e-06 scaffold_8 9997: 9 85383396.6 2.282976e-02 2.002634e-02 3.899136e-06 scaffold_8 9998: 9 85469122.8 1.573487e+00 1.169200e-03 3.899136e-06 scaffold_8 9999: 9 85554849.1 6.892966e-02 5.344763e-03 3.899136e-06 scaffold_8 10000: 9 85640578.0 0.000000e+00 1.200000e+03 3.899136e-06 scaffold_8 Discarded sites i.Likelihood i.Position Processing Sequences <char> <char> <char> <char> <char> 1: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 2: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 3: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 4: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 5: 1264068 4.105582e+02 8.936200e+07 155.53 seconds 14 --- 9996: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 9997: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 9998: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 9999: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 10000: 845039 3.114815e+02 8.152571e+07 109.15 seconds 14 Sites <char> 1: 1670366 2: 1670366 3: 1670366 4: 1670366 5: 1670366 --- 9996: 1110658 9997: 1110658 9998: 1110658 9999: 1110658 10000: 1110658 > > ## parsing nbib citation file. > (pmc.nbib <- system.file( + "extdata", "PMC3045577.nbib", package="nc")) [1] "/data/gannet/ripley/R/packages/tests-devel/nc.Rcheck/nc/extdata/PMC3045577.nbib" > blank <- "\n " > pmc.dt <- nc::capture_all_str( + pmc.nbib, + Abbreviation="[A-Z]+", + " *- ", + value=list( + ".*", + list(blank, ".*"), "*"), + function(x)sub(blank, "", x)) > str(pmc.dt) Classes ‘data.table’ and 'data.frame': 50 obs. of 2 variables: $ Abbreviation: chr "PMID" "OWN" "STAT" "DCOM" ... $ value : chr "21113027" "NLM" "MEDLINE" "20110512" ... - attr(*, ".internal.selfref")=<pointer: 0x2921e210> > > ## What do the variable fields mean? It is explained on > ## https://www.nlm.nih.gov/bsd/mms/medlineelements.html which has a > ## local copy in this package (downloaded 18 Sep 2019). > fields.html <- system.file( + "extdata", "MEDLINE_Fields.html", package="nc") > if(interactive())browseURL(fields.html) > fields.vec <- readLines(fields.html) > > ## It is pretty easy to capture fields and abbreviations if gsub > ## used to remove some tags first. > no.strong <- gsub("</?strong>", "", fields.vec) > no.comments <- gsub("<!--.*?-->", "", no.strong) > ## grep then capture_first_vec can be used if each desired row in > ## the output comes from a single line of the input file. > (h3.vec <- grep("<h3", no.comments, value=TRUE)) [1] "<h3><a id=\"ab\" name=\"ab\"></a>Abstract (AB)</h3>" [2] "<h3><a id=\"ci\" name=\"ci\"></a>Copyright Information (CI)</h3>" [3] "<h3><a id=\"ad\" name=\"ad\"></a>Affiliation (AD)</h3>" [4] "<h3><a id=\"irad\" name=\"irad\"></a>Investigator Affiliation (IRAD)</h3>" [5] "<h3><a id=\"aid\" name=\"aid\"></a>Article Identifier (AID)</h3>" [6] "<h3><a id=\"au\" name=\"au\"></a>Author (AU)</h3>" [7] "<h3><a id=\"auid\" name=\"auid\"></a>Author Identifier (AUID)</h3>" [8] "<h3><a id=\"fau\" name=\"fau\"></a>Full Author (FAU)</h3>" [9] "<h3><a id=\"cc2\" name=\"bti\"></a>Book Title (BTI)</h3>" [10] "<h3><a id=\"cc4\" name=\"cti\"></a>Collection Title (CTI)</h3>" [11] "<h3><a id=\"cc\" name=\"cc\"></a>Comments/Corrections (See fields and field tags listed below.)</h3>" [12] "<h3><a id=\"coi\" name=\"coi\"></a>Conflict of Interest Statement (COIS)</h3>" [13] "<h3><a id=\"cn\" name=\"cn\"></a>Corporate Author (CN)</h3>" [14] "<h3><a id=\"dcom2\" name=\"crdt\"></a>Create Date (CRDT)</h3>" [15] "<h3><a id=\"dcom\" name=\"dcom\"></a>Date Completed (DCOM)</h3>" [16] "<h3><a id=\"da\" name=\"da\"></a>Date Created (DA)</h3>" [17] "<h3><a id=\"lr\" name=\"lr\"></a>Date Last Revised (LR)</h3>" [18] "<h3><a id=\"dep\" name=\"dep\"></a>Date of Electronic Publication (DEP)</h3>" [19] "<h3><a id=\"dp\" name=\"dp\"></a>Date of Publication (DP)</h3>" [20] "<h3><a id=\"edat2\" name=\"ed\"></a>Editor (ED) and Full Editor Name (FED)</h3>" [21] "<h3><a id=\"edat3\" name=\"en\"></a>Edition (EN)</h3>" [22] "<h3><a id=\"edat\" name=\"edat\"></a>Entrez Date (EDAT)</h3>" [23] "<h3><a id=\"gs\" name=\"gs\"></a>Gene Symbol (GS): not currently input</h3>" [24] "<h3><a id=\"gn\" name=\"gn\"></a>General Note (GN)</h3>" [25] "<h3><a id=\"gr\" name=\"gr\"></a>Grant Number (GR)</h3>" [26] "<h3><a id=\"ir\" name=\"ir\"></a>Investigator Name (IR) and Full Investigator Name (FIR)</h3>" [27] "<h3><a id=\"is2\" name=\"isbn\"></a>ISBN (ISBN)</h3>" [28] "<h3><a id=\"is\" name=\"is\"></a>ISSN (IS)</h3>" [29] "<h3><a id=\"ip\" name=\"ip\"></a>Issue (IP)</h3>" [30] "<h3><a id=\"ta\" name=\"ta\"></a>Journal Title Abbreviation (TA)</h3>" [31] "<h3><a id=\"jt\" name=\"jt\"></a>Journal Title (JT)</h3>" [32] "<h3><a id=\"la\" name=\"la\"></a>Language (LA)</h3>" [33] "<h3><a id=\"la3\" name=\"lid\"></a>Location Identifier (LID)</h3>" [34] "<h3><a id=\"la2\" name=\"mid\"></a>Manuscript Identifier (MID)</h3>" [35] "<h3><a id=\"mhda\" name=\"mhda\"></a>MeSH Date (MHDA)</h3>" [36] "<h3><a id=\"mh\" name=\"mh\"></a>MeSH Terms (MH)</h3>" [37] "<h3><a id=\"jid\" name=\"jid\"></a>NLM Unique ID (JID)</h3>" [38] "<h3><a id=\"rf\" name=\"rf\"></a>Number of References (RF)</h3>" [39] "<h3><a id=\"oab\" name=\"oab\"></a>Other Abstract (OAB)</h3>" [40] "<h3><a id=\"oci\" name=\"oci\"></a>Other Copyright Information (OCI)</h3>" [41] "<h3><a id=\"oid\" name=\"oid\"></a>Other ID (OID)</h3>" [42] "<h3><a id=\"ot\" name=\"ot\"></a>Other Term (OT)</h3>" [43] "<h3><a id=\"oto\" name=\"oto\"></a>Other Term Owner (OTO)</h3>" [44] "<h3><a id=\"own\" name=\"own\"></a>Owner (OWN)</h3>" [45] "<h3><a id=\"pg\" name=\"pg\"></a>Pagination (PG)</h3>" [46] "<h3><a id=\"ps\" name=\"ps\"></a>Personal Name as Subject (PS)</h3>" [47] "<h3><a id=\"fps\" name=\"fps\"></a>Full Personal Name as Subject (FPS)</h3>" [48] "<h3><a id=\"pl\" name=\"pl\"></a>Place of Publication (PL)</h3>" [49] "<h3><a id=\"phst\" name=\"phst\"></a>Publication History Status (PHST)</h3>" [50] "<h3><a id=\"pst\" name=\"pst\"></a>Publication Status (PST)</h3>" [51] "<h3><a id=\"pt\" name=\"pt\"></a>Publication Type (PT)</h3>" [52] "<h3><a id=\"pubm\" name=\"pubm\"></a>Publishing Model (PUBM)</h3>" [53] "<h3><a id=\"pmid2\" name=\"pmc\"></a>PubMed Central Identifer (PMC)</h3>" [54] "<h3><a id=\"pmid3\" name=\"pmcr\"></a>PubMed Central Release (PMCR)</h3>" [55] "<h3><a id=\"pmid\" name=\"pmid\"></a>PubMed Unique Identifier (PMID)</h3>" [56] "<h3><a id=\"rn\" name=\"rn\"></a>Registry Number/EC Number (RN)</h3>" [57] "<h3><a id=\"nm\" name=\"nm\"></a>Substance Name (NM)</h3>" [58] "<h3><a id=\"si\" name=\"si\"></a>Secondary Source ID (SI)</h3>" [59] "<h3><a id=\"so\" name=\"so\"></a>Source (SO)</h3>" [60] "<h3><a id=\"sfm\" name=\"sfm\"></a>Space Flight Mission (SFM)</h3>" [61] "<h3><a id=\"stat\" name=\"stat\"></a>Status (STAT)</h3>" [62] "<h3><a id=\"sb\" name=\"sb\"></a>Subset (SB)</h3>" [63] "<h3><a id=\"ti\" name=\"ti\"></a>Title (TI)</h3>" [64] "<h3><a id=\"tt\" name=\"tt\"></a>Transliterated Title (TT)</h3>" [65] "<h3><a id=\"vi\" name=\"vi\"></a>Volume (VI)</h3>" [66] "<h3><a id=\"cc3\" name=\"vti\"></a>Volume Title (VTI)</h3>" > h3.pattern <- list( + nc::field("name", '="', '[^"]+'), + '"></a>', + fields.abbrevs="[^<]+") > first.fields.dt <- nc::capture_first_vec( + h3.vec, h3.pattern) > field.abbrev.pattern <- list( + Field=".*?", + " \\(", + Abbreviation="[^)]+", + "\\)", + "(?: and |$)?") > (first.each.field <- first.fields.dt[, nc::capture_all_str( + fields.abbrevs, field.abbrev.pattern), + by=fields.abbrevs]) fields.abbrevs <char> 1: Abstract (AB) 2: Copyright Information (CI) 3: Affiliation (AD) 4: Investigator Affiliation (IRAD) 5: Article Identifier (AID) 6: Author (AU) 7: Author Identifier (AUID) 8: Full Author (FAU) 9: Book Title (BTI) 10: Collection Title (CTI) 11: Comments/Corrections (See fields and field tags listed below.) 12: Conflict of Interest Statement (COIS) 13: Corporate Author (CN) 14: Create Date (CRDT) 15: Date Completed (DCOM) 16: Date Created (DA) 17: Date Last Revised (LR) 18: Date of Electronic Publication (DEP) 19: Date of Publication (DP) 20: Editor (ED) and Full Editor Name (FED) 21: Editor (ED) and Full Editor Name (FED) 22: Edition (EN) 23: Entrez Date (EDAT) 24: Gene Symbol (GS): not currently input 25: General Note (GN) 26: Grant Number (GR) 27: Investigator Name (IR) and Full Investigator Name (FIR) 28: Investigator Name (IR) and Full Investigator Name (FIR) 29: ISBN (ISBN) 30: ISSN (IS) 31: Issue (IP) 32: Journal Title Abbreviation (TA) 33: Journal Title (JT) 34: Language (LA) 35: Location Identifier (LID) 36: Manuscript Identifier (MID) 37: MeSH Date (MHDA) 38: MeSH Terms (MH) 39: NLM Unique ID (JID) 40: Number of References (RF) 41: Other Abstract (OAB) 42: Other Copyright Information (OCI) 43: Other ID (OID) 44: Other Term (OT) 45: Other Term Owner (OTO) 46: Owner (OWN) 47: Pagination (PG) 48: Personal Name as Subject (PS) 49: Full Personal Name as Subject (FPS) 50: Place of Publication (PL) 51: Publication History Status (PHST) 52: Publication Status (PST) 53: Publication Type (PT) 54: Publishing Model (PUBM) 55: PubMed Central Identifer (PMC) 56: PubMed Central Release (PMCR) 57: PubMed Unique Identifier (PMID) 58: Registry Number/EC Number (RN) 59: Substance Name (NM) 60: Secondary Source ID (SI) 61: Source (SO) 62: Space Flight Mission (SFM) 63: Status (STAT) 64: Subset (SB) 65: Title (TI) 66: Transliterated Title (TT) 67: Volume (VI) 68: Volume Title (VTI) fields.abbrevs Field Abbreviation <char> <char> 1: Abstract AB 2: Copyright Information CI 3: Affiliation AD 4: Investigator Affiliation IRAD 5: Article Identifier AID 6: Author AU 7: Author Identifier AUID 8: Full Author FAU 9: Book Title BTI 10: Collection Title CTI 11: Comments/Corrections See fields and field tags listed below. 12: Conflict of Interest Statement COIS 13: Corporate Author CN 14: Create Date CRDT 15: Date Completed DCOM 16: Date Created DA 17: Date Last Revised LR 18: Date of Electronic Publication DEP 19: Date of Publication DP 20: Editor ED 21: Full Editor Name FED 22: Edition EN 23: Entrez Date EDAT 24: Gene Symbol GS 25: General Note GN 26: Grant Number GR 27: Investigator Name IR 28: Full Investigator Name FIR 29: ISBN ISBN 30: ISSN IS 31: Issue IP 32: Journal Title Abbreviation TA 33: Journal Title JT 34: Language LA 35: Location Identifier LID 36: Manuscript Identifier MID 37: MeSH Date MHDA 38: MeSH Terms MH 39: NLM Unique ID JID 40: Number of References RF 41: Other Abstract OAB 42: Other Copyright Information OCI 43: Other ID OID 44: Other Term OT 45: Other Term Owner OTO 46: Owner OWN 47: Pagination PG 48: Personal Name as Subject PS 49: Full Personal Name as Subject FPS 50: Place of Publication PL 51: Publication History Status PHST 52: Publication Status PST 53: Publication Type PT 54: Publishing Model PUBM 55: PubMed Central Identifer PMC 56: PubMed Central Release PMCR 57: PubMed Unique Identifier PMID 58: Registry Number/EC Number RN 59: Substance Name NM 60: Secondary Source ID SI 61: Source SO 62: Space Flight Mission SFM 63: Status STAT 64: Subset SB 65: Title TI 66: Transliterated Title TT 67: Volume VI 68: Volume Title VTI Field Abbreviation > > ## If we want to capture the information after the initial h3 line > ## of the input, e.g. the rest column below which contains a > ## description/example for each field, then capture_all_str can be > ## used on the full input file. > h3.fields.dt <- nc::capture_all_str( + no.comments, + h3.pattern, + '</h3>\n', + rest="(?:.*\n)+?", #exercise: get the examples. + "<hr />\n") > (h3.each.field <- h3.fields.dt[, nc::capture_all_str( + fields.abbrevs, field.abbrev.pattern), + by=fields.abbrevs]) fields.abbrevs <char> 1: Abstract (AB) 2: Copyright Information (CI) 3: Affiliation (AD) 4: Investigator Affiliation (IRAD) 5: Article Identifier (AID) 6: Author (AU) 7: Author Identifier (AUID) 8: Full Author (FAU) 9: Book Title (BTI) 10: Collection Title (CTI) 11: Comments/Corrections (See fields and field tags listed below.) 12: Conflict of Interest Statement (COIS) 13: Corporate Author (CN) 14: Create Date (CRDT) 15: Date Completed (DCOM) 16: Date Created (DA) 17: Date Last Revised (LR) 18: Date of Electronic Publication (DEP) 19: Date of Publication (DP) 20: Editor (ED) and Full Editor Name (FED) 21: Editor (ED) and Full Editor Name (FED) 22: Edition (EN) 23: Entrez Date (EDAT) 24: Gene Symbol (GS): not currently input 25: General Note (GN) 26: Grant Number (GR) 27: Investigator Name (IR) and Full Investigator Name (FIR) 28: Investigator Name (IR) and Full Investigator Name (FIR) 29: ISBN (ISBN) 30: ISSN (IS) 31: Issue (IP) 32: Journal Title Abbreviation (TA) 33: Journal Title (JT) 34: Language (LA) 35: Location Identifier (LID) 36: Manuscript Identifier (MID) 37: MeSH Date (MHDA) 38: MeSH Terms (MH) 39: NLM Unique ID (JID) 40: Number of References (RF) 41: Other Abstract (OAB) 42: Other Copyright Information (OCI) 43: Other ID (OID) 44: Other Term (OT) 45: Other Term Owner (OTO) 46: Owner (OWN) 47: Pagination (PG) 48: Personal Name as Subject (PS) 49: Full Personal Name as Subject (FPS) 50: Place of Publication (PL) 51: Publication History Status (PHST) 52: Publication Status (PST) 53: Publication Type (PT) 54: Publishing Model (PUBM) 55: PubMed Central Identifer (PMC) 56: PubMed Central Release (PMCR) 57: PubMed Unique Identifier (PMID) 58: Registry Number/EC Number (RN) 59: Substance Name (NM) 60: Secondary Source ID (SI) 61: Source (SO) 62: Space Flight Mission (SFM) 63: Status (STAT) 64: Subset (SB) 65: Title (TI) 66: Transliterated Title (TT) 67: Volume (VI) 68: Volume Title (VTI) fields.abbrevs Field Abbreviation <char> <char> 1: Abstract AB 2: Copyright Information CI 3: Affiliation AD 4: Investigator Affiliation IRAD 5: Article Identifier AID 6: Author AU 7: Author Identifier AUID 8: Full Author FAU 9: Book Title BTI 10: Collection Title CTI 11: Comments/Corrections See fields and field tags listed below. 12: Conflict of Interest Statement COIS 13: Corporate Author CN 14: Create Date CRDT 15: Date Completed DCOM 16: Date Created DA 17: Date Last Revised LR 18: Date of Electronic Publication DEP 19: Date of Publication DP 20: Editor ED 21: Full Editor Name FED 22: Edition EN 23: Entrez Date EDAT 24: Gene Symbol GS 25: General Note GN 26: Grant Number GR 27: Investigator Name IR 28: Full Investigator Name FIR 29: ISBN ISBN 30: ISSN IS 31: Issue IP 32: Journal Title Abbreviation TA 33: Journal Title JT 34: Language LA 35: Location Identifier LID 36: Manuscript Identifier MID 37: MeSH Date MHDA 38: MeSH Terms MH 39: NLM Unique ID JID 40: Number of References RF 41: Other Abstract OAB 42: Other Copyright Information OCI 43: Other ID OID 44: Other Term OT 45: Other Term Owner OTO 46: Owner OWN 47: Pagination PG 48: Personal Name as Subject PS 49: Full Personal Name as Subject FPS 50: Place of Publication PL 51: Publication History Status PHST 52: Publication Status PST 53: Publication Type PT 54: Publishing Model PUBM 55: PubMed Central Identifer PMC 56: PubMed Central Release PMCR 57: PubMed Unique Identifier PMID 58: Registry Number/EC Number RN 59: Substance Name NM 60: Secondary Source ID SI 61: Source SO 62: Space Flight Mission SFM 63: Status STAT 64: Subset SB 65: Title TI 66: Transliterated Title TT 67: Volume VI 68: Volume Title VTI Field Abbreviation > > ## Either method of capturing abbreviations gives the same result. > identical(first.each.field, h3.each.field) [1] TRUE > > ## but the capture_all_str method returns the additional rest column > ## which contains data after the initial h3 line. > names(first.fields.dt) [1] "name" "fields.abbrevs" > names(h3.fields.dt) [1] "name" "fields.abbrevs" "rest" > cat(h3.fields.dt[fields.abbrevs=="Volume (VI)", rest]) <p>The volume number of the journal in which the article was published is recorded here.</p> <p class="examplekm">Examples:<br />VI - 7<br />VI - 5 Spec No<br />VI - 49 Suppl 20</p> <p>Some records (especially records from <a href="/databases/databases_oldmedline.html">OLDMEDLINE</a>) contain the Issue field but lack the Volume field; some contain the Volume field but lack the Issue field; and some records contain Volume and Issue data in the Volume element.</p> > > ## There are 66 Field rows across three tables. > a.href <- list('<a href=[^>]+>') > (td.vec <- fields.vec[240:280]) [1] "<td><a href=\"#ab\">Abstract</a></td>" [2] "<td><a href=\"#ab\">(AB)</a></td>" [3] "</tr>" [4] "<tr style=\"background-color: #cccccc;\">" [5] "<td><a href=\"#ci\">Copyright Information</a></td>" [6] "<td>" [7] "<div><a href=\"#ci\">(CI)</a></div>" [8] "</td>" [9] "</tr>" [10] "<tr>" [11] "<td><a href=\"#ad\">Affiliation</a></td>" [12] "<td>" [13] "<div><a href=\"#ad\">(AD)</a></div>" [14] "</td>" [15] "</tr>" [16] "<tr style=\"background-color: #cccccc;\">" [17] "<td><a href=\"#irad\">Investigator Affiliation</a></td>" [18] "<td>" [19] "<div><a href=\"#irad\">(IRAD)</a></div>" [20] "</td>" [21] "</tr>" [22] "<tr>" [23] "<td><a href=\"#aid\">Article Identifier</a></td>" [24] "<td>" [25] "<div><a href=\"#aid\">(AID)</a></div>" [26] "</td>" [27] "</tr>" [28] "<tr style=\"background-color: #cccccc;\">" [29] "<td><a href=\"#au\">Author</a></td>" [30] "<td>" [31] "<div><a href=\"#au\">(AU)</a></div>" [32] "</td>" [33] "</tr>" [34] "<tr>" [35] "<td><a href=\"#auid\">Author Identifier</a></td>" [36] "<td><a href=\"#auid\">(AUID)</a></td>" [37] "</tr>" [38] "<tr>" [39] "<td style=\"background-color: #cccccc;\"><a href=\"#fau\">Full Author</a></td>" [40] "<td style=\"background-color: #cccccc;\">" [41] "<div><a href=\"#fau\">(FAU)</a></div>" > fields.pattern <- list( + "<td.*?>", + a.href, + Fields="[^()<]+", + "</a></td>\n") > (td.only.Fields <- nc::capture_all_str(fields.vec, fields.pattern)) Fields <char> 1: Abstract 2: Copyright Information 3: Affiliation 4: Investigator Affiliation 5: Article Identifier 6: Author 7: Author Identifier 8: Full Author 9: Book Title 10: Collection Title 11: Comments/Corrections 12: Conflict of Interest Statement 13: Corporate Author 14: Create Date 15: Date Completed 16: Date Created 17: Date Last Revised 18: Date of Electronic Publication 19: Date of Publication 20: Edition 21: Editor and Full Editor Name 22: Entrez Date 23: Gene Symbol 24: General Note 25: Grant Number 26: Investigator Name and Full Investigator Name 27: ISBN 28: ISSN 29: Issue 30: Journal Title Abbreviation 31: Journal Title 32: Language 33: Location Identifier 34: Manuscript Identifier 35: MeSH Date 36: MeSH Terms 37: NLM Unique ID 38: Number of References 39: Other Abstract 40: Other Copyright Information 41: Other ID 42: Other Term 43: Other Term Owner 44: Owner 45: Pagination 46: Personal Name as Subject 47: Full Personal Name as Subject 48: Place of Publication 49: Publication History Status 50: Publication Status 51: Publication Type 52: Publishing Model 53: PubMed Central Identifier 54: PubMed Central Release 55: PubMed Unique Identifier 56: Registry Number/EC Number 57: Substance Name 58: Secondary Source ID 59: Source 60: Space Flight Mission 61: Status 62: Subset 63: Title 64: Transliterated Title 65: Volume 66: Volume Title Fields > > ## Extract Fields and Abbreviations. Careful: most fields have one > ## abbreviation, but one field has none, and two fields have two. > (td.fields.dt <- nc::capture_all_str( + fields.vec, + fields.pattern, + "<td[^>]*>", + "(?:\n<div>)?", + a.href, "?", + abbrevs=".*?", + "</")) Fields abbrevs <char> <char> 1: Abstract (AB) 2: Copyright Information (CI) 3: Affiliation (AD) 4: Investigator Affiliation (IRAD) 5: Article Identifier (AID) 6: Author (AU) 7: Author Identifier (AUID) 8: Full Author (FAU) 9: Book Title (BTI) 10: Collection Title (CTI) 11: Comments/Corrections &nbsp; 12: Conflict of Interest Statement (COIS) 13: Corporate Author (CN) 14: Create Date (CRDT) 15: Date Completed (DCOM) 16: Date Created (DA) 17: Date Last Revised (LR) 18: Date of Electronic Publication (DEP) 19: Date of Publication (DP) 20: Edition (EN) 21: Editor and Full Editor Name (ED)<br />(FED) 22: Entrez Date (EDAT) 23: Gene Symbol (GS) 24: General Note (GN) 25: Grant Number (GR) 26: Investigator Name and Full Investigator Name (IR) (FIR) 27: ISBN (ISBN) 28: ISSN (IS) 29: Issue (IP) 30: Journal Title Abbreviation (TA) 31: Journal Title (JT) 32: Language (LA) 33: Location Identifier (LID) 34: Manuscript Identifier (MID) 35: MeSH Date (MHDA) 36: MeSH Terms (MH) 37: NLM Unique ID (JID) 38: Number of References (RF) 39: Other Abstract (OAB) 40: Other Copyright Information (OCI) 41: Other ID (OID) 42: Other Term (OT) 43: Other Term Owner (OTO) 44: Owner (OWN) 45: Pagination (PG) 46: Personal Name as Subject (PS) 47: Full Personal Name as Subject (FPS) 48: Place of Publication (PL) 49: Publication History Status (PHST) 50: Publication Status (PST) 51: Publication Type (PT) 52: Publishing Model (PUBM) 53: PubMed Central Identifier (PMC) 54: PubMed Central Release (PMCR) 55: PubMed Unique Identifier (PMID) 56: Registry Number/EC Number (RN) 57: Substance Name (NM) 58: Secondary Source ID (SI) 59: Source (SO) 60: Space Flight Mission (SFM) 61: Status (STAT) 62: Subset (SB) 63: Title (TI) 64: Transliterated Title (TT) 65: Volume (VI) 66: Volume Title (VTI) Fields abbrevs > > ## Get each individual abbreviation from the previously captured td > ## data. > td.each.field <- td.fields.dt[, { + f <- nc::capture_all_str( + Fields, + Field=".*?", + "(?:$| and )") + a <- nc::capture_all_str( + abbrevs, + "\\(", + Abbreviation="[^)]+", + "\\)") + if(nrow(a)==0)list() else cbind(f, a) + }, by=Fields] > str(td.each.field) Classes ‘data.table’ and 'data.frame': 67 obs. of 3 variables: $ Fields : chr "Abstract" "Copyright Information" "Affiliation" "Investigator Affiliation" ... $ Field : chr "Abstract" "Copyright Information" "Affiliation" "Investigator Affiliation" ... $ Abbreviation: chr "AB" "CI" "AD" "IRAD" ... - attr(*, ".internal.selfref")=<pointer: 0x2921e210> > td.each.field[td.fields.dt, .( + count=.N + ), on=.(Fields), by=.EACHI][order(count)] Fields count <char> <int> 1: Comments/Corrections 0 2: Abstract 1 3: Copyright Information 1 4: Affiliation 1 5: Investigator Affiliation 1 6: Article Identifier 1 7: Author 1 8: Author Identifier 1 9: Full Author 1 10: Book Title 1 11: Collection Title 1 12: Conflict of Interest Statement 1 13: Corporate Author 1 14: Create Date 1 15: Date Completed 1 16: Date Created 1 17: Date Last Revised 1 18: Date of Electronic Publication 1 19: Date of Publication 1 20: Edition 1 21: Entrez Date 1 22: Gene Symbol 1 23: General Note 1 24: Grant Number 1 25: ISBN 1 26: ISSN 1 27: Issue 1 28: Journal Title Abbreviation 1 29: Journal Title 1 30: Language 1 31: Location Identifier 1 32: Manuscript Identifier 1 33: MeSH Date 1 34: MeSH Terms 1 35: NLM Unique ID 1 36: Number of References 1 37: Other Abstract 1 38: Other Copyright Information 1 39: Other ID 1 40: Other Term 1 41: Other Term Owner 1 42: Owner 1 43: Pagination 1 44: Personal Name as Subject 1 45: Full Personal Name as Subject 1 46: Place of Publication 1 47: Publication History Status 1 48: Publication Status 1 49: Publication Type 1 50: Publishing Model 1 51: PubMed Central Identifier 1 52: PubMed Central Release 1 53: PubMed Unique Identifier 1 54: Registry Number/EC Number 1 55: Substance Name 1 56: Secondary Source ID 1 57: Source 1 58: Space Flight Mission 1 59: Status 1 60: Subset 1 61: Title 1 62: Transliterated Title 1 63: Volume 1 64: Volume Title 1 65: Editor and Full Editor Name 2 66: Investigator Name and Full Investigator Name 2 Fields count > > ## There is a typo in the data captured from the h3 headings. > td.each.field[!Field %in% h3.each.field$Field] Fields Field Abbreviation <char> <char> <char> 1: PubMed Central Identifier PubMed Central Identifier PMC > h3.each.field[!Field %in% td.each.field$Field] fields.abbrevs <char> 1: Comments/Corrections (See fields and field tags listed below.) 2: PubMed Central Identifer (PMC) Field Abbreviation <char> <char> 1: Comments/Corrections See fields and field tags listed below. 2: PubMed Central Identifer PMC > > ## Abbreviations are consistent. > td.each.field[!Abbreviation %in% h3.each.field$Abbreviation] Empty data.table (0 rows and 3 cols): Fields,Field,Abbreviation > h3.each.field[!Abbreviation %in% td.each.field$Abbreviation] fields.abbrevs <char> 1: Comments/Corrections (See fields and field tags listed below.) Field Abbreviation <char> <char> 1: Comments/Corrections See fields and field tags listed below. > > ## There is a a table that provides a description of each comment > ## type. > (comment.vec <- fields.vec[840:860]) [1] "<tr>" [2] "<th><strong>Comment or Correction Type</strong></th>" [3] "<th><strong>MEDLINE Display Field Tag</strong></th>" [4] "<th><strong>Description</strong></th>" [5] "</tr>" [6] "<tr>" [7] "<td><strong>Comment in</strong></td>" [8] "<td><strong>(CIN)</strong></td>" [9] "<td>cites the reference containing a commentary about the article (appears on citation for original article); began use with journal issues published in 1989.</td>" [10] "</tr>" [11] "<tr>" [12] "<td><strong>Comment on</strong></td>" [13] "<td><strong>(CON)</strong></td>" [14] "<td>cites the reference upon which the article comments; began use with journal issues published in 1989.</td>" [15] "</tr>" [16] "<tr>" [17] "<td><strong>Erratum in</strong></td>" [18] "<td><strong>(EIN)</strong></td>" [19] "<td>cites a published erratum to the article (appears on citation for original article); began use in 1987.</td>" [20] "</tr>" [21] "<tr>" > comment.dt <- nc::capture_all_str( + fields.vec, + "<td><strong>", + Field="[^<]+", + "</strong></td>\n", + "<td><strong>\\(", + Abbreviation="[^)]+", + "\\)</strong></td>\n", + "<td>", + description=".*", + "</td>\n") > str(comment.dt) Classes ‘data.table’ and 'data.frame': 18 obs. of 3 variables: $ Field : chr "Comment in" "Comment on" "Erratum in" "Erratum for" ... $ Abbreviation: chr "CIN" "CON" "EIN" "EFR" ... $ description : chr "cites the reference containing a commentary about the article (appears on citation for original article); began"| __truncated__ "cites the reference upon which the article comments; began use with journal issues published in 1989." "cites a published erratum to the article (appears on citation for original article); began use in 1987." "cites the original article for which there is a published erratum. As of 2016, partial retractions are considered errata." ... - attr(*, ".internal.selfref")=<pointer: 0x2921e210> > > ## Join to original PMC citation file in order to see what the > ## abbreviations used in that file mean. > all.abbrevs <- rbind( + td.each.field[, .(Field, Abbreviation)], + comment.dt[, .(Field, Abbreviation)]) > all.abbrevs[pmc.dt, .( + Abbreviation, + Field, + value=substr(value, 1, 20) + ), on=.(Abbreviation)] Abbreviation Field value <char> <char> <char> 1: PMID PubMed Unique Identifier 21113027 2: OWN Owner NLM 3: STAT Status MEDLINE 4: DCOM Date Completed 20110512 5: LR Date Last Revised 20181113 6: IS ISSN 1362-4962 (Electroni 7: IS ISSN 0305-1048 (Print) 8: IS ISSN 0305-1048 (Linking) 9: VI Volume 39 10: IP Issue 4 11: DP Date of Publication 2011 Mar 12: TI Title A manually curated C 13: PG Pagination e25 14: LID Location Identifier 10.1093/nar/gkq1187 15: AB Abstract Chromatin immunoprec 16: FAU Full Author Rye, Morten Beck 17: AU Author Rye MB 18: AD Affiliation Department of Cancer 19: FAU Full Author Sætrom, Pål 20: AU Author Sætrom P 21: FAU Full Author Drabløs, Finn 22: AU Author Drabløs F 23: LA Language eng 24: PT Publication Type Evaluation Studies 25: PT Publication Type Journal Article 26: PT Publication Type Research Support, No 27: DEP Date of Electronic Publication 20101126 28: TA Journal Title Abbreviation Nucleic Acids Res 29: JT Journal Title Nucleic acids resear 30: JID NLM Unique ID 0411011 31: RN Registry Number/EC Number 0 (Transcription Fac 32: SB Subset IM 33: MH MeSH Terms Benchmarking 34: MH MeSH Terms Binding Sites 35: MH MeSH Terms *Chromatin Immunopre 36: MH MeSH Terms *High-Throughput Nuc 37: MH MeSH Terms *Software 38: MH MeSH Terms Transcription Factor 39: PMC PubMed Central Identifier PMC3045577 40: EDAT Entrez Date 2010/11/30 06:00 41: MHDA MeSH Date 2011/05/13 06:00 42: CRDT Create Date 2010/11/30 06:00 43: PHST Publication History Status 2010/11/30 06:00 [en 44: PHST Publication History Status 2010/11/30 06:00 [pu 45: PHST Publication History Status 2011/05/13 06:00 [me 46: AID Article Identifier 10.1093/nar/gkq1187 47: AID Article Identifier gkq1187 [pii] 48: AID Article Identifier gkq1187 [pii] 49: PST Publication Status ppublish 50: SO Source Nucleic Acids Res. 2 Abbreviation Field value > > ## There is a listing of examples for each comment type. > (comment.ex.dt <- nc::capture_all_str( + fields.vec[938], + "br />\\s*", + Abbreviation="[A-Z]+", + "\\s*-\\s*", + citation="[^<]+?", + list( + "[.] ", + nc::field("PMID", ": ", "[0-9]+") + ), "?", + "<")) Abbreviation citation <char> <char> 1: CON Dev Cell. 2002 Jul;3(1):85-97 2: CIN N Engl J Med. 2003 Jul 17;349(3):211-2 3: CRI Orthop Nurs. 2003 May-Jun;22(3):232-9 4: CRF Biochemistry. 1994 May 10;33(18):5614-22 5: EIN Acta Obstet Gynecol Scand. 2003 Jan;82(1):102 6: EFR J Arthroplasty. 2002 Jun;17(4):524-6 7: RIN J Biochem Mol Biol. 2002 Nov 30;35(6):642 8: ROF Ware FE, Lehrman MA. J Biol Chem. 1996 Jun 14;271(24):13935-8 9: UIN Cochrane Database Syst Rev. 2002;(3):CD003688 10: UOF Cochrane Database Syst Rev. 2002;(2):CD003680 11: SPIN Ann Intern Med. 2003 Jun 3;138(11):I60 12: ORI Ann Intern Med. 2003 Jun 3;138(11):907-16 PMID <char> 1: 12110170 2: 12867604 3: 12872752 4: 8180186 5: 6: 12066289 7: 12476908 8: 8663248 9: 12137706 10: 12076500 11: 12779314 12: 12779301 > > ## Join abbreviations to see what kind of comments. > all.abbrevs[comment.ex.dt, on=.(Abbreviation)] Field Abbreviation <char> <char> 1: Comment on CON 2: Comment in CIN 3: Corrected and Republished in CRI 4: Corrected and Republished from CRF 5: Erratum in EIN 6: Erratum for EFR 7: Retraction in RIN 8: Retraction of ROF 9: Update in UIN 10: Update of UOF 11: Summary for patients in SPIN 12: Original report in ORI citation PMID <char> <char> 1: Dev Cell. 2002 Jul;3(1):85-97 12110170 2: N Engl J Med. 2003 Jul 17;349(3):211-2 12867604 3: Orthop Nurs. 2003 May-Jun;22(3):232-9 12872752 4: Biochemistry. 1994 May 10;33(18):5614-22 8180186 5: Acta Obstet Gynecol Scand. 2003 Jan;82(1):102 6: J Arthroplasty. 2002 Jun;17(4):524-6 12066289 7: J Biochem Mol Biol. 2002 Nov 30;35(6):642 12476908 8: Ware FE, Lehrman MA. J Biol Chem. 1996 Jun 14;271(24):13935-8 8663248 9: Cochrane Database Syst Rev. 2002;(3):CD003688 12137706 10: Cochrane Database Syst Rev. 2002;(2):CD003680 12076500 11: Ann Intern Med. 2003 Jun 3;138(11):I60 12779314 12: Ann Intern Med. 2003 Jun 3;138(11):907-16 12779301 > > ## parsing bibtex file. > refs.bib <- system.file( + "extdata", "namedCapture-refs.bib", package="nc") > refs.vec <- readLines(refs.bib) > at.lines <- grep("@", refs.vec, value=TRUE) > str(at.lines) chr [1:24] " @Manual{namedCapture," " @Manual{TRE," " @Manual{re2r," ... > refs.dt <- nc::capture_all_str( + refs.vec, + "@", + type="[^{]+", + "[{]", + ref="[^,]+", + ",\n", + fields="(?:.*\n)+?.*", + "[}]\\s*(?:$|\n)") > str(refs.dt) Classes ‘data.table’ and 'data.frame': 24 obs. of 3 variables: $ type : chr "Manual" "Manual" "Manual" "Manual" ... $ ref : chr "namedCapture" "TRE" "re2r" "rematch2" ... $ fields: chr " title = {namedCapture: Named Capture Regular Expressions},\n author = {Toby Dylan Hocking},\n year = "| __truncated__ " title = {TRE: The free and portable approximate regex matching library},\n author = {Ville Laurikari},\n"| __truncated__ " title = {re2r: RE2 Regular Expression},\n author = {Qin Wenfeng},\n year = {2017},\n note = {R pac"| __truncated__ " title = {rematch2: Tidy Output from Regular Expression Matching},\n author = {Gábor Csárdi},\n year ="| __truncated__ ... - attr(*, ".internal.selfref")=<pointer: 0x2921e210> > > ## parsing each field of each entry. > eq.lines <- grep("=", refs.vec, value=TRUE) > str(eq.lines) chr [1:140] " title = {namedCapture: Named Capture Regular Expressions}," ... > strip <- function(x)sub("^\\s*\\{*", "", sub("\\}*,?$", "", x)) > refs.fields <- refs.dt[, nc::capture_all_str( + fields, + "\\s+", + variable="\\S+", + "\\s+=", + value=".*", strip), + by=.(type, ref)] > str(refs.fields) Classes ‘data.table’ and 'data.frame': 140 obs. of 4 variables: $ type : chr "Manual" "Manual" "Manual" "Manual" ... $ ref : chr "namedCapture" "namedCapture" "namedCapture" "namedCapture" ... $ variable: chr "title" "author" "year" "note" ... $ value : chr "namedCapture: Named Capture Regular Expressions" "Toby Dylan Hocking" "2019" "R package version 2019.01.14" ... - attr(*, ".internal.selfref")=<pointer: 0x2921e210> > with(refs.fields[ref=="HockingUseR2011"], structure( + as.list(value), names=variable)) $author [1] "Toby Dylan Hocking" $title [1] "Fast, named capture regular expressions in R 2.14" $year [1] "2011" $url [1] "http://web.warwick.ac.uk/statsdept/user-2011/TalkSlides/Lightening/2-StatisticsAndProg\\_3-Hocking.pdf" $booktitle [1] "useR 2011 conference proceedings" > ## the URL of my talk is now > ## https://user2011.r-project.org/TalkSlides/Lightening/2-StatisticsAndProg_3-Hocking.pdf > > if(!grepl("solaris", R.version$platform)){#To avoid CRAN check error on solaris + ## Parsing wikimedia tables: each begins with {| and ends with |}. + emoji.txt.gz <- system.file( + "extdata", "wikipedia-emoji-text.txt.gz", package="nc") + tables <- nc::capture_all_str( + emoji.txt.gz, + "\n[{][|]", + first=".*", + '\n[|][+] style="', + nc::field("font-size", ":", '.*?'), + '" [|] ', + title=".*", + lines="(?:\n.*)*?", + "\n[|][}]") + str(tables) + ## Rows are separated by |- + rows.dt <- tables[, { + row.vec <- strsplit(lines, "|-", fixed=TRUE)[[1]][-1] + .(row.i=seq_along(row.vec), row=row.vec) + }, by=title] + str(rows.dt) + ## Try to parse columns from each row. Doesn't work for second table + ## https://en.wikipedia.org/w/index.php?title=Emoji&oldid=920745513#Skin_color + ## because some entries have rowspan=2. + contents.dt <- rows.dt[, nc::capture_all_str( + row, + "[|] ", + content=".*?", + "(?: [|]|\n|$)"), + by=.(title, row.i)] + contents.dt[, .(cols=.N), by=.(title, row.i)] + ## Make data table from + ## https://en.wikipedia.org/w/index.php?title=Emoji&oldid=920745513#Emoji_versus_text_presentation + contents.dt[, col.i := 1:.N, by=.(title, row.i)] + data.table::dcast( + contents.dt[title=="Sample emoji variation sequences"], + row.i ~ col.i, + value.var="content") + } Classes ‘data.table’ and 'data.frame': 2 obs. of 4 variables: $ first : chr " border=\"1\" cellspacing=\"0\" cellpadding=\"5\" class=\"wikitable nounderlines\" style=\"border-collapse:coll"| __truncated__ " border=\"1\" cellspacing=\"0\" cellpadding=\"5\" class=\"wikitable nounderlines\" style=\"border-collapse:coll"| __truncated__ $ font-size: chr " 67%" "small" $ title : chr "Sample emoji variation sequences" "Sample use of Fitzpatrick modifiers" $ lines : chr "\n|- style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:right\" | U+ || 2139 || 23"| __truncated__ "\n|-style=\"background:#F8F8F8;font-size:67%\"\n! scope=\"col\" colspan=\"2\" style=\"text-align:left\" | Code "| __truncated__ - attr(*, ".internal.selfref")=<pointer: 0x2921e210> Classes ‘data.table’ and 'data.frame': 19 obs. of 3 variables: $ title: chr "Sample emoji variation sequences" "Sample emoji variation sequences" "Sample emoji variation sequences" "Sample emoji variation sequences" ... $ row.i: int 1 2 3 4 5 6 1 2 3 4 ... $ row : chr " style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:right\" | U+ || 2139 || 231B |"| __truncated__ " style=\"background:#F8F8F8;font-size: 67%\"\n! scope=\"col\" style=\"text-align:left\" | default&nbsp;presenta"| __truncated__ "\n! scope=\"col\" style=\"background:#F8F8F8;font-size: 67%;text-align:left\" | base&nbsp;code&nbsp;point\n| ℹ "| __truncated__ "\n! scope=\"col\" style=\"background:#F8F8F8;font-size: 67%;text-align:left\" | base+VS15 (text)\n| {{emoji pre"| __truncated__ ... - attr(*, ".internal.selfref")=<pointer: 0x2921e210> Error in `[.data.table`(contents.dt, , `:=`(col.i, 1:.N), by = .(title, : attempt access index 3/3 in VECTOR_ELT Calls: [ -> [.data.table Execution halted Flavor: r-devel-linux-x86_64-fedora-gcc

Version: 2025.3.24
Check: re-building of vignette outputs
Result: ERROR Error(s) in re-building vignettes: --- re-building ‘v0-overview.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v0-overview.Rmd’ --- re-building ‘v1-capture-first.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v1-capture-first.Rmd’ --- re-building ‘v2-capture-all.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v2-capture-all.Rmd’ --- re-building ‘v3-capture-melt.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v3-capture-melt.Rmd’ --- re-building ‘v4-comparisons.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v4-comparisons.Rmd’ --- re-building ‘v5-helpers.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building 'v5-helpers.Rmd' --- re-building ‘v6-engines.Rmd’ using rmarkdown [WARNING] Deprecated: --highlight-style. Use --syntax-highlighting instead. --- finished re-building ‘v6-engines.Rmd’ --- re-building ‘v7-capture-glob.Rmd’ using rmarkdown Quitting from v7-capture-glob.Rmd:257-272 [unnamed-chunk-18] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <error/rlang_error> Error in `[.data.table`: ! attempt access index 6/6 in VECTOR_ELT --- Backtrace: ▆ 1. ├─...[] 2. └─data.table:::`[.data.table`(...) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Error: processing vignette 'v7-capture-glob.Rmd' failed with diagnostics: attempt access index 6/6 in VECTOR_ELT --- failed re-building ‘v7-capture-glob.Rmd’ SUMMARY: processing the following file failed: ‘v7-capture-glob.Rmd’ Error: Vignette re-building failed. Execution halted Flavor: r-devel-linux-x86_64-fedora-gcc