1 year ago

Introduction to Python Programming - A Word-counter Program

Introduction

In the last article, we looked at controlling program execution with if-elif-else statements, match statements, for and while loops, pass statements, and try-except statements. In this, we will create some programs.

A Word Counter Program

Building a word counter will allow us to put some of the concepts we have learned into practice.

Assuming you have a string variable declared literally or a text read from a file, you may decide to know the unique words as well as count them. Let us create a variable called text.

text = "A very long text declared or read from a file. \
    The text may contain very long lines but you should not worry because you have python skills. " \
    "Let us begin the conquest."

If you have a file in the same directory/folder as your python file, you can replace the above variable with

    file = open("your_file_name.txt")
    text = file.read()
    file.close()

Since Python is case-sensitive, words like "The" and "the" are not the same. So, convert the text to lowercase.

text = text.lower()

To get the individual words in the text, use the split() method. Optionally, you can sort the text_split list using the list's sort() method.

text_split = text.split()

text_split.sort()

View the unique words in the list using the set() function.


unique_words = set(text_split)

print(unique_words)

{'read', 'from', 'file.', 'because', 'lines', 'very', 'declared', 'you', 'should', 'text', 'a', 'conquest', 'worry', 'may', 'us', 'not', 'have', 'long', 'python', 'let', 'contain', 'skills.', 'the', 'begin', 'but', 'or'}

Notice that some words like file and skills end with a period (.). Let us remove them using the map() function.

# A map returns a "map object"
text_split = map(lambda x: x[: -1] if x.endswith(".") else x,  text_split)

# convert back to list
text_split = list(text_split)

The map() takes a function and an iterable/sequence and applies the function to every element of the sequence. The lambda you see above is used to create a one-time-use function. We will see more of map and lambda in my next article on functions :)

The expression within the lambda function x[: -1] if x.endswith(".") else x is an example of an if expression. You will find it in the previous article. If the word ends with a period take a slice excluding the last [: -1]. If not, give me back the word.

print(text_split)

['a', 'a', 'because', 'begin', 'but', 'conquest', 'contain', 'declared', 'file', 'from', 'have', 'let', 'lines', 'long', 'long', 'may', 'not', 'or', 'python', 'read', 'should', 'skills', 'text', 'text', 'the', 'the', 'us', 'very', 'very', 'worry', 'you', 'you']

Next, create an empty dictionary to hold the words and their count. Let us call the variable summary.

summary = {}

Using a for loop, we will go through the text_split list and if we see a word already in the dictionary, we will increase its count by 1 else, we will add the variable to the dictionary and set its count to 1. For example

for word in text_split:
    if word in summary:
        summary[word] = summary[word] + 1
    else:
        summary[word] = 1

The word in summary checks if the word is already a key in the dictionary.

Here is a shorter way to re-write the code above using the dictionary's get() method.

summary2 = {}

for word in text_split:
    summary2[word] = summary2.get(word, 0) + 1

The summary and summary have the same content.

print(summary == summary2)

True

summary2.get(word, 0) + 1 get the value/count associated with word. If there is nothing there that is, the word is not a key in the dictionary, give me zero (0) as the count. Add 1 to either set the summary[word] to 1 or increment the value contained in summary[word].

Finally, print the content of the summary. The order of the content may differ from yours but they will be the same.

print(summary)

{'a': 2, 'because': 1, 'begin': 1, 'but': 1, 'conquest': 1, 'contain': 1, 'declared': 1, 'file': 1, 'from': 1,'have': 1, 'let': 1, 'lines': 1, 'long': 2, 'may': 1, 'not': 1, 'or': 1, 'python': 1, 'read': 1, 'should': 1, 'skills': 1, 'text': 2, 'the': 2, 'us': 1, 'very': 2, 'worry': 1, 'you': 2}

To obtain a much prettier print, use the pp() method in the pprint module. We have not discussed modules so do not worry about it. Just type it for now.

from pprint import pp

# remember that summary and summary2 have the same content
pp(summary2)

{'a': 2,
'because': 1,
'begin': 1,
'but': 1,
'conquest': 1,
'contain': 1,
'declared': 1,
'file': 1,
'from': 1,
'have': 1,
'let': 1,
'lines': 1,
'long': 2,
'may': 1,
'not': 1,
'or': 1,
'python': 1,
'read': 1,
'should': 1,
'skills': 1,
'text': 2,
'the': 2,
'us': 1,
'very': 2,
'worry': 1,
'you': 2}

Now, here is the full program.

# This could come from a file
text = "A very long text declared or read from a file. \
    The text may contain very long lines but you should not worry because you have python skills. " \
    "Let us begin the conquest"

# Convert to lower case
text = text.lower()

# Create a list of words
text_split = text.split()

# Sort in alphabetical order
text_split.sort()

# Get a set of unique words
unique_words = set(text_split)

print(unique_words)

# Go through each word and remove any period (.)
text_split = map(lambda x: x[: -1] if x.endswith(".") else x,  text_split)

# Convert the map output back to a list
text_split = list(text_split)

print(text_split)

summary = {}

for word in text_split:
    if word in summary:
        summary[word] = summary[word] + 1
    else:
        summary[word] = 1


summary2 = {}

for word in text_split:
    summary2[word] = summary2.get(word, 0) + 1


print(summary == summary2)

# print(summary)

from pprint import pp

pp(summary2)

Conclusion

In this article, we created a word-counter program. If we had read a file, we could include the number of lines and the size of the file in bytes just like the wc command in a linux shell. In the next article, we will deal with functions. Thanks for reading.

(0) (0)