{"id":2002,"date":"2015-05-18T14:36:09","date_gmt":"2015-05-18T21:36:09","guid":{"rendered":"http:\/\/lukemiller.org\/?p=2002"},"modified":"2015-05-18T14:36:09","modified_gmt":"2015-05-18T21:36:09","slug":"basic-text-string-functions-in-r","status":"publish","type":"post","link":"https:\/\/lukemiller.org\/index.php\/2015\/05\/basic-text-string-functions-in-r\/","title":{"rendered":"Basic text string functions in R"},"content":{"rendered":"<p>To get the length of a text string (i.e. the number of characters in the string):<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;] nchar()[\/code]<br \/>\nUsing <code>length()<\/code> would just give you the length of the vector containing the string, which will be 1 if the string is just a single string.  <\/p>\n<p>To get the position of a regular expression match(es) in a text string x:<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\npos = regexpr(&#8216;pattern&#8217;, x) # Returns position of 1st match in a string<br \/>\npos = gregexpr(&#8216;pattern&#8217;, x) # Returns positions of every match in a string<br \/>\n[\/code]<\/p>\n<p>To get the position of a regular expression match in a <b>vector<\/b> x of text strings (this returns the <b>index<\/b> of the matching string in the vector, <i>not<\/i> the position of the match in the text string itself):<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\npos = grep(&#8216;pattern&#8217;, x)<br \/>\n[\/code]<\/p>\n<p>To extract part of a text string based on position in the text string, where <i>first<\/i> and <i>last<\/i> are the locations in the text string, usually found by the <code>regexpr()<\/code> function:<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\nkeep = substr(x, first, last)<br \/>\n[\/code]<\/p>\n<p>To replace part of a text string with some other text:<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\nsub(&#8216;pattern&#8217;, replacement, input) # Changes only the 1st pattern match per string<br \/>\ngsub(&#8216;pattern&#8217;, replacement, input) # Changes every occurrence of a pattern match<br \/>\n[\/code]<\/p>\n<p>The pattern argument in the various regular expression functions can include include regular expressions enclosed in square brackets. See <code>?regex<\/code> for the explanation of regular expressions. For example, to make a pattern that matches any numerical digit, you could use <code>'[0-9]'<\/code> as the pattern argument. You may also use several predefined patterns such as <code>'[:digit:]'<\/code>, which also finds any numerical digit in the string, same as the <code>[0-9]<\/code> pattern. <\/p>\n<h2>File name stuff<\/h2>\n<p>To get a list of file names (and paths) in a directory:<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\nfnames = dir(&quot;.\/path\/to\/my\/data&quot;, full.names=TRUE)<br \/>\n[\/code]<\/p>\n<p>To extract just the filename from a full path:<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\nfname = basename(path)<br \/>\n[\/code]<\/p>\n<p>To extract the directory path from a file path:<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\ndirectory = dirname(path)<br \/>\n[\/code]<\/p>\n<p>If you have a text string assigned to a variable in the R workspace, and you want to parse it using various other functions, you can use the <code>textConnection()<\/code> function to feed your string to the other function.<br \/>\n[code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;]<br \/>\nmydataframe = read.csv(textConnection(myString)) # If myString contained comma-separated-values, this would convert them to a data frame.<br \/>\n[\/code]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>To get the length of a text string (i.e. the number of characters in the string): [code lang=&#8221;R&#8221; gutter=&#8221;false&#8221;] nchar()[\/code] Using length() would just give you the length of the vector containing the string, which will be 1 if the string is just a single string. To get the position of a regular expression match(es) [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,218],"tags":[219],"class_list":["post-2002","post","type-post","status-publish","format-standard","hentry","category-journal","category-r-project","tag-text-string"],"_links":{"self":[{"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/posts\/2002","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/comments?post=2002"}],"version-history":[{"count":17,"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/posts\/2002\/revisions"}],"predecessor-version":[{"id":2019,"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/posts\/2002\/revisions\/2019"}],"wp:attachment":[{"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/media?parent=2002"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/categories?post=2002"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lukemiller.org\/index.php\/wp-json\/wp\/v2\/tags?post=2002"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}