Introduction to Python Programming - A Word-counter Program
Introduction
In the last article, we looked at controlling program execution with if-elif-else statements, match statements, for and while loops, pass statements, and try-except statements. In this, we will create some programs.
A Word Counter Program
Building a word counter will allow us to put some of the concepts we have learned into practice.
Assuming you have a string variable declared literally or a text read from a file, you may decide to know the unique words as well as count them. Let us create a variable called text
.
text = "A very long text declared or read from a file. \
The text may contain very long lines but you should not worry because you have python skills. " \
"Let us begin the conquest."
If you have a file in the same directory/folder as your python file, you can replace the above variable with
file = open("your_file_name.txt")
text = file.read()
file.close()
Since Python is case-sensitive, words like "The" and "the" are not the same. So, convert the text
to lowercase.
text = text.lower()
To get the individual words in the text
, use the split()
method. Optionally, you can sort the text_split
list using the list's sort()
method.
text_split = text.split()
text_split.sort()
View the unique words in the list using the set()
function.
unique_words = set(text_split)
print(unique_words)
{'read', 'from', 'file.', 'because', 'lines', 'very', 'declared', 'you', 'should', 'text', 'a', 'conquest', 'worry', 'may', 'us', 'not', 'have', 'long', 'python', 'let', 'contain', 'skills.', 'the', 'begin', 'but', 'or'}
Notice that some words like file and skills end with a period (.). Let us remove them using the map()
function.
# A map returns a "map object"
text_split = map(lambda x: x[: -1] if x.endswith(".") else x, text_split)
# convert back to list
text_split = list(text_split)
The
map()
takes a function and an iterable/sequence and applies the function to every element of the sequence. Thelambda
you see above is used to create a one-time-use function. We will see more ofmap
andlambda
in my next article on functions :)
The expression within the lambda function
x[: -1] if x.endswith(".") else x
is an example of an if expression. You will find it in the previous article. If the word ends with a period take a slice excluding the last[: -1]
. If not, give me back the word.
print(text_split)
['a', 'a', 'because', 'begin', 'but', 'conquest', 'contain', 'declared', 'file', 'from', 'have', 'let', 'lines', 'long', 'long', 'may', 'not', 'or', 'python', 'read', 'should', 'skills', 'text', 'text', 'the', 'the', 'us', 'very', 'very', 'worry', 'you', 'you']
Next, create an empty dictionary to hold the words and their count. Let us call the variable summary
.
summary = {}
Using a for loop, we will go through the text_split
list and if we see a word already in the dictionary, we will increase its count by 1 else, we will add the variable to the dictionary and set its count to 1. For example
for word in text_split:
if word in summary:
summary[word] = summary[word] + 1
else:
summary[word] = 1
The
word in summary
checks if the word is already a key in the dictionary.
Here is a shorter way to re-write the code above using the dictionary's get()
method.
summary2 = {}
for word in text_split:
summary2[word] = summary2.get(word, 0) + 1
The summary
and summary
have the same content.
print(summary == summary2)
True
summary2.get(word, 0) + 1
get the value/count associated withword
. If there is nothing there that is, the word is not a key in the dictionary, give me zero (0) as the count. Add 1 to either set thesummary[word]
to 1 or increment the value contained insummary[word]
.
Finally, print the content of the summary
. The order of the content may differ from yours but they will be the same.
print(summary)
{'a': 2, 'because': 1, 'begin': 1, 'but': 1, 'conquest': 1, 'contain': 1, 'declared': 1, 'file': 1, 'from': 1,'have': 1, 'let': 1, 'lines': 1, 'long': 2, 'may': 1, 'not': 1, 'or': 1, 'python': 1, 'read': 1, 'should': 1, 'skills': 1, 'text': 2, 'the': 2, 'us': 1, 'very': 2, 'worry': 1, 'you': 2}
To obtain a much prettier print, use the pp()
method in the pprint
module. We have not discussed modules so do not worry about it. Just type it for now.
from pprint import pp
# remember that summary and summary2 have the same content
pp(summary2)
{'a': 2,
'because': 1,
'begin': 1,
'but': 1,
'conquest': 1,
'contain': 1,
'declared': 1,
'file': 1,
'from': 1,
'have': 1,
'let': 1,
'lines': 1,
'long': 2,
'may': 1,
'not': 1,
'or': 1,
'python': 1,
'read': 1,
'should': 1,
'skills': 1,
'text': 2,
'the': 2,
'us': 1,
'very': 2,
'worry': 1,
'you': 2}
Now, here is the full program.
# This could come from a file
text = "A very long text declared or read from a file. \
The text may contain very long lines but you should not worry because you have python skills. " \
"Let us begin the conquest"
# Convert to lower case
text = text.lower()
# Create a list of words
text_split = text.split()
# Sort in alphabetical order
text_split.sort()
# Get a set of unique words
unique_words = set(text_split)
print(unique_words)
# Go through each word and remove any period (.)
text_split = map(lambda x: x[: -1] if x.endswith(".") else x, text_split)
# Convert the map output back to a list
text_split = list(text_split)
print(text_split)
summary = {}
for word in text_split:
if word in summary:
summary[word] = summary[word] + 1
else:
summary[word] = 1
summary2 = {}
for word in text_split:
summary2[word] = summary2.get(word, 0) + 1
print(summary == summary2)
# print(summary)
from pprint import pp
pp(summary2)
Conclusion
In this article, we created a word-counter program. If we had read a file, we could include the number of lines and the size of the file in bytes just like the wc
command in a linux shell. In the next article, we will deal with functions. Thanks for reading.