String Parsing in C++: Methods and Examples
In C++, string parsing is essential for many tasks such as command line applications, search applications, and network applications. There are various methods like using functions and algorithms, string class functions, sscanf functions, or regular expressions (regex). Reading input lines can be done with functions like getline, and regex can be a powerful tool for complex pattern matching. This guide provides insights into different techniques for string parsing in C++.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Cosc 2030 String Parsing in c++ Regular Expressions
String parsing One of main tasks that a program may need to do is take a string and parse it to determine the next step in the program. Command line applications Search applications (like bing and google). Most network applications, send and receive data as strings. to many more to even begin to name.
How to parse As with all things in c/c++ you can do it any number of ways. Develop a functions and algorithms to parse a string up. Use the methods functions in the string class String parsing. Use the sscanf functions More like regular expressions. Use the regex stl Which is regular expressions. Requires visual studio 2013+ or gcc 4.3.0+
Reading a line of input. The standard cin reads to a space and then stops. This is not always the functionality we want. getline function: (2 methods) cin.getline(c_str, 256) Reads to end of line marks or number of characters, which ever comes first. Requires a c_str, instead of a string. Example: char stuff[256]; Cin.getline(stuff, 256); But this is still not the method we want since it requires c-strings.
Reading a line of input (2) Getline second method, which is the method we want to use, since it returns a string. Part of the string class getline(cin, string) Example: string stuff; getline(cin, stuff)
Regular Expressions Regex for short. Likely the most powerful way to do any string processing. Use: Create a pattern that you want to match with Run the match If returns true, then the string matched the pattern Also can get all the matches into an array to use as well. Problem: Regex patterns can be very complex and we don t have to time (about 6 lectures) to cover the entire regex set. This will only cover the very basics.
Code for regex Include the regex stl #include <regex> Define the pattern Note pattern is a variable! std::regex pattern ( string pattern ); object that will contain the sequence of sub-matches (optional) std::match_results<std::string::const_iterator> result;
code for regex regex_match to match the full string If (std::regex_match(string, result, pattern)) If true there was a match if capturing matches, result should have matches. if result.size() >0 or if !result.empty() regex_search to match any part of a string If (std::regex_search(string, result, pattern)) same as match, with result.
simple matching regex pattern("ello"); string line = "Hello"; If (std::regex_search(line, result, pattern)) cout <<"Match "<<line<<endl; output: Match Hello
simple matching (2) regex pattern("ello"); string line = "Hi"; If (std::regex_search(line, result, pattern)) cout <<"Match "<<line<<endl; This would not match Neither would line = "eLLo";
Case insensitive matching. Use the constant icase Character matching should be performed without regard to case. regex pattern("ello", std::regex_constants::icase); string line = "heLLo"; If (std::regex_search(line, result, pattern)) cout <<"Match "<<line<<endl; output: Match heLLo
alternation matching assume we are using regex_search unless other noted. | allows matching with an or regex pattern("Fred|Wilma|Pebbles"); True if contains Fred, Wilma, or Pebbles regex pattern("Fred|Wilma|Pebbles Flintstone"); matches Fred, Wilma, or Pebbles Flintstone
Grouping and alternation ( ) allow you to group matching and end alternation as well. regex pattern("(Fred|Wilma|Pebbles) Flintstone") matches Fred Flintstone, Wilma Flintstone, or Pebbles Flintstone regex pattern("(Blue|Song)bird") matches Bluebird or Songbird Remember we are using reg_search. So the line could "There is a Songbird singing in the tree"
Grouping and alternation (2) regex ex3("(p|g|m|s|b)et"); true if contains contains: pet, get, met, set, or bet note () are also used to capture the match So the result variable will tell you which letter it matched. regex ex4("th(is|at)"); true if the string contains this or that
Single character matching Use [] for single character matching. regex pattern("[abc]") true if it contains a and/or b and/or c regex pattern("[pgmsb]et") true if it contains for pet, get, met, set or bet regex pattern("[Fred]") true if it contains F and/or r and/or e and/or d Not listed characters ^ character regex pattern("[^abc]") true if it does not contain a and b and c regex pattern("[a-z]") true if it contains any lower case letter a through z
Single character or'd matching (2) regex pattern("[0-9]") true if it contains any number 0 through 9 regex pattern("[0-9\-]") matches 0 through 9 or the minus regex pattern("[a-z0-9\^]") matches any single lowercase letter or digit or ^ regex pattern("[a-zA-Z0-9_]") matches any single letter, digit, or underscore regex pattern("[^aeiouAEIOU]") matches any non-vowel in in the string
matching quantifiers multiple uses {min,max} regex pattern("a{3}") true if the string contains aaa regex pattern("a{3,}") matches aaa, aaaa, aaaaa, aaaaaa, etc. regex pattern("a{3,5}b") matches aaab, aaaab, aaaaab common mistake regex pattern("Fred{3}") matches Freddd, not FredFredFred How to actually do it. regex pattern("(Fred){3}") matches FredFredFred
matching quantifiers (2) regex pattern("a{0,5}") match a, aa, aaa, aaaa, aaaaa, and if there are no a's regex pattern("a*") * match 0 or more times (max match) regex pattern("a*?") * match 0 or more times (min match) Difference between min and max matching "aaaa"; #matches all three above Difference *, matches "aaaa" while *? matches "a" max matches as many characters as it can while min, matches as few characters as it can This becomes important later on.
matching quantifiers (3) + 1 or more times (max match) +? 1 or more times (min match) regex pattern("a+") true if there are 1 or more "a"s ? match 0 or 1 time (max match) ?? match 0 or 1 time (min match) regex pattern("a?") true if there 1 a or no "a"s Also {3,5}? min match tries to match only 3 where possible and {3,5} max match tries to match 5 where possible
matching quantifiers (4) regex pattern("fo+ba?r") matches f, 1 or more o's, b, 0 or 1 a, then an r match: fobar, foobar, foobr, Non-match: fbar (missing o), foobaar (to many a's) regex pattern("fo*ba?r") matches f, 0 or more o's, b, 0 or 1 a, then an r match: fobar, fbr, fooobr, etc Inside [], matching quantifiers are "normal" characters. regex pattern("[.?!+]*") matches zero or more ., ?, !, or +
Trying out what we have learned. What will the following match? 1. regex pattern("a+[bc]") 2. regex pattern("(a|be)", std::regex_constants::icase) 3. regex pattern("Hi{1,3} There\!?") 4. regex pattern("(Foo)?Bar", std::regex_constants::icase) 5. regex pattern("[1-9][1-9][a-z]*") 6. regex pattern("[a-zA-Z]+, [A-Z]{2} [0-9]{5}") Write a regular expression for these 1. Match a social security number (with or without dashes) 2. A street address: number Name with either St, Ln, Rd or nothing. Also, case insensitive
metasymbols . match one character (except newline) regex pattern(".") Always true, except when the string is empty. regex pattern("d.g") true for d and anycharacter and g so dog, dbg, dag, dcg, d g, etc. regex pattern("d.*g") true d and 0 or more character and g so dg, dog, dasdfg, d g, etc. regex pattern("d.+g") true d and 1 or more character and g so NOT dg, but the rest dog, dasdfg, d g, etc.
metasymbols (2) regex pattern("d.?g") true for d and any single character and g AND dg regex pattern("d.{0,1}g") true for d and any single character and g AND dg same as above if regex pattern("d.{2}g") true for d and 2 characters and g so doog, dafg, dghg, etc regex pattern("d.{2,5}g") true for d and 2 to 5 characters and g so dooog, doog, dXXXXXg, dXobgg, etc
metasymbols (3) Anchoring ^ beginning of the string (it's a not in []) $ end of the string regex pattern("^dog$") true only for "dog", not "ddogg" Note, we could just use regex_match(string, result, pattern) instead of anchoring for this one, since we just find to find "dog". regex pattern("^dog") true only when the string start with "dog" so "dog", "doga", etc.
metasymbols (4) regex pattern("dog$") true when the string ends with "dog" "dog", "asdfadfdog", "ddddooodog" regex pattern("^.$") true when the string is one character long and not the newline symbol regex pattern("^[abc]+") true when the string start with "a", "aa", "aaa", etc with any characters following. "b", "bb", "bbb", etc with any characters following. "c", "cc", "ccc", etc with any characters following As well as any combination of a's, b's, and c's "abcabc", etc.
metasymbols (5) \d match a Digit \D match a Nondigit \s match whitespace \S match a Nonwhitespace \w match a Word character \W match a Non word Character Note, we have an issue in C++ strings. The \ means something else. So while it s \d, We will have to write \\d in order to get the string to recognize it correctly. [0-9] [^0-9] [ \t\n\r\f] [^ \t\n\r\f] [a-zA-Z0-9_] [^a-zA-Z0-9_]
metasymbols (5) regex pattern("\\d") true when string contains a digit regex pattern("\\d+") true when string contains 1 or more digit regex pattern("\\w\\d") true contains a word character and 1 digit regex pattern("\\w+\\d") true when contains 1 or more word characters and 1 digit so these match "abc1" "a1" "11" "_9" "Z8" and "a1a1"
metasymbols (5) regex pattern("^\\s\\w\\d") true when it starts with a whitespace, then a word character, and then a digit " 11" "\ta1" "\n11" etc. regex pattern("^\\s*\\w\\d") true when it starts with 0 or more whitespaces, then a word character, and then a digit " 11" "11" " \t a1" etc
Trying them out again. What will the following match? 1. regex pattern("a+\\w*?") 2. regex pattern("\\w\\s*\\w+") 3. regex pattern("^\\d+[a-z]*") 4. regex pattern("\\w+,\\s\\w{2}\\s{2}\\d{5}") Write a regular expression for these Rewrite #4 so the city can two or more words. Must start with has a letter, then have any number of letters and/or numbers or none at all, but end with a number
capturing capturing the matches use the () around the part you want to capture regex ex13("(\\w+)"); find 1 or more word characters and capture the resulting match regex ex14("(\\w+)\s+(\\w+)"); find 1 or more word characters, then white space, then 1 or more word characters. Capture the word character matches example: "hi there" result[1]="hi", result[2]="there" regex ex15("(\\d+) (.*)"); What does this capture?
capturing (2) line="Hi There" regex pattern2("(((\\w+) )(\\w+))"); regex_search(line,result,pattern2); result[1]="Hi There" result[2]="Hi "result[3]="Hi" result[4]= "There"
capturing (3) line = "a xxx c xxxxxc xxx d"; regex pattern3("(.+)x(.+)c"); regex_search(line,result,pattern3); result[1] = result[2]= hint, these are a max match.
Regex and python. You will find regex exists in most languages (at least most useful languages). In python it's part of the re package import re line = "a xxx c xxxxxc xxx d"; x = re.find("(.+)x(.+)c",line); x is the first string that matched re.findall returns an array of all the matches.
Java Regex in the string library. Java uses regex in many different string functions. Remember String.split( ) can use a regex expression to split a string. For matching, we will use the String.matches( ) function This return true or false if the whole string matches the pattern. Example: s= 10 add R1 ; s.matches("(\\d+) (.*)"); //is true line= "a xxx c xxxxx c xxx d"; line.matches(".+x.+c"); //is false line.matches("(.*).+x.+c(.*)"); //is true, since match the whole line
Java using regex library. using the regex, we can do sub string matches as well as full string matches. Pattern.matches(regex, string) return true or false if the whole string matches the regex pattern, just like the string matches. find() allows to sub matches, but more complex. first compile the pattern, get a matcher, which can use find() Pattern pattern = Pattern.compile("(.+)x(.+)c"); Matcher matcher = pattern.matcher(s); if (matcher.find() ) { System.out.println(matcher.group()); //matches the match.
Java using regex library (2) example: String line2 = "a xxx c xxxxx c xxx d"; Pattern pattern = Pattern.compile("(.+)x(.+)c"); Matcher matcher = pattern.matcher(line2); if (matcher.find() ) { System.out.println(matcher.group()); } output a xxx c xxxxx c
Regex reference http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c15339 http://www.codeproject.com/KB/string/TR1Regex.aspx Patterns http://msdn.microsoft.com/en-us/library/bb982727.aspx https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.ht ml https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.h tml
QA &