Thursday, January 1, 2009

Parsing words

Suppose, a user inputs the line in Code 1 (without the quotations).

Code 1. Sample input by user.
“  My name           is   Earl ”

And you need to parse the words “My”, “name”, “is” and “Earl”. How do you do that? Note that the spaces and/or tabs between words, or before or after each word may be zero or more spaces and/or tabs. This just makes parsing more complicated, doesn’t it?

People have different ways of doing things. Let me present my way in Code 2.

Code 2. Sample code for parsing words.
001 #include "stdio.h"
002
003 int main() {
004 __char line[100], ctmp[100];
005 __int i, j;
006
007 __fgets(line, 100, stdin);
008
009 __for (i = 0; 1; i += 1) {
010 ____if (line[i] == '\n' line[i] == 0) {
011 ______break;
012 ____}
013
014 ____if (line[i] == ' ' line[i] == '\t') {
015 ______continue;
016 ____}
017
018 ____for (j = 0; 1 ; i += 1, j += 1) {
019 ______if ( line[i] == ' '
020 _________ line[i] == '\t'
021 _________ line[i] == '\n'
022 _________ line[i] == 0
023 ______) {
024 ________break;
025 ______}
026 ______ctmp[j] = line[i];
027 ____}
028 ____ctmp[j] = 0;
029
030 ____printf("B%sE\n", ctmp);
031 __}
032 }

Of course, there are easier ways like I could use strtok or other more sophisticated methods available. But somehow, I never learned how to use them, had a hard time trying to configure them, and eventually, I ended up just using the code in Code 2 after tiring frustrations one after the other.

The code in Code 2 uses the fgets function in line 007 to get an input line from the user. The for loop in line 009 is then used to process the input character by character.

The for loop starts by checking if the current character it is processing is an end character which can either be a carriage return or a null character (line 010). If the current character is an end character, the loop breaks and processing of the input line is finished.

If the loop determines that the current character is not an end character, it does a another check to see if the current character being processed is a white space which could be a literal space character or a tab character (line 014). If it is a white space character, the loop restarts with the next character.

So, what checks 010 and 014 do is to find a character that is not an end character and at the same time, not a white space also. If a character passes these two initial checks, then we know that we have come across a character that is the beginning of a word because the character is neither an end character or a white space.

The loop then enters into a second for loop (line 018). The second for loop first checks if the character is an end character or a white space (line 019). If it is, it means that we have reached the end of a word and so we break out of the second for loop to break further processing of the current word being parsed out.

In the second for loop, if the current character passes the check of line 019, meaning that the current character is a valid character for a word – valid meaning that it could be alphanumeric or any other character as long as it is not an end character or a white space, then the current character is added into the current word being formed (ctmp).

So the second for loop keeps looping, concatenating one character after another until the end of the word is detected through a character that is either an end character or a white space. A null character is then added to the end of the word to signify the end of the character string (line 028).

After we have formed a word, we can pretty much do anything we want to do with the word parsed out from the input line. In Code 2, we simply print it (line 030).

So we go back to the first for loop and repeat the process all over again until all characters in the input line are processed.

Figure 1 shows a sample output of Code 2 when it is ran with the input of Code 1.


Figure 1. Output when Code 2 is inputted with the input of Code 1.

No comments:

Post a Comment