Introduction to File Processing in Python Programming

undefined
 
Python Programming, 4/e
 
1
 
Python Programming:
An Introduction To
Computer Science
 
Chapter 10
Persistent Data
Python Programming, 4/e
2
Objectives
 
To understand basic file-processing concepts and
techniques for opening, reading, and writing files in
Python.
To understand the structure of text files and be able
to write programs that use them.
To become familiar with the basic organization of file
systems, including role of absolute and relative paths
play in locating files, and be able to write Python
programs that process collections of files.
Python Programming, 4/e
3
Objectives
 
To understand binary data and the bytes data type
and be able to create programs that store and load
Python objects from files using the 
pickle
 module.
To recognize the similarity between working with local
files and working with network resources.
Python Programming, 4/e
4
Text Files
 
In all of the examples so far, data has either been
embedded in the program code or entered by the user
when the program runs.
We lack a mechanism for entering data and having
that data persist from one run of the program to the
next.
Python Programming, 4/e
5
Text Files
 
Persistent data is a critical component of any modern
computing system.
Your word processor needs to save the paper you’re
working on.
Your programming environment needs to be able to save
and reload your Python code.
Typically, such information is stored in files.
Python Programming, 4/e
6
Text Files
 
A 
file
 is a sequence of data that is stored in secondary
memory (usually on a disk drive of some sort).
Files can contain any data type, but the easiest files to
work with a those that contain text.
Files of text have the advantage that they can be read and
understood by humans, and they are easily created and
edited using general purpose text editors, like IDLE.
Python Programming, 4/e
7
Multi-line Strings
 
You can think of a text file as a (possibly long) string
that happens to be stored on disk.
A special character or sequence of characters is used
to mark the end of each line.
While this convention varies by operating system, Python
takes care of these different conventions for us and just
uses the regular newline character (
\n
).
Python Programming, 4/e
8
Multi-line Strings
 
Hello
World
 
Goodbye 32
When stored to a file, you get this:
Hello\nWorld\n\nGoodbye 32\n
Notice that the blank line becomes a bare newline.
Python Programming, 4/e
9
Multi-line Strings
 
This is no different than when we embed newline
characters into output strings to produce multiple lines
of output with a single 
print
 statement.
print("Hello\nWorld\n\nGoodbye 32\n")
Remember, if you simply evaluate a string containing
newline characters in the shell, you will just get the
embedded newline representation back.
"Hello\nWorld\n\nGoodbye 32\n"
Python Programming, 4/e
10
File Processing Outline
 
Virtually all programming languages share certain
underlying file manipulation concepts.
We need some way to associate a file on disk with an
object in a program – this is called 
opening
 a file.
We need a set of operations that can manipulate the file
object.
At the very least, we need to be able to read the information from
a file and to write new information to a file.
Lastly, when a we are done we need to 
close
 the file.
Python Programming, 4/e
11
File Processing Outline
 
This idea of opening and closing files is closely related
to how you might work with files in an application
program such as IDLE.
When you open a file for editing in IDLE, the file is actually
read from disk and stored in RAM.
At this point, the file is closed (in the programming sense).
As you edit the file, you are really making changes to the
data in memory, not the file itself.
Changes will not show up on disk until you “save” it.
Python Programming, 4/e
12
File Processing Outline
 
The process of saving a file in IDLE is also a multi-
step process.
The original file on the disk is opened, this time in a mode
that allows it to store information (opened for 
writing
).
Doing this actually 
erases
 the old contents of the file!
File writing operations are then used to copy the current
contents of the in-memory file into the new file on disk.
Python Programming, 4/e
13
File Processing Outline
 
Working with text files in Python is easy!
Create a file object that corresponds to a file on disk:
<variable> = open(<path>, <mode>)
Here, 
path
 is a string that provides the location of the file
on disk.
For a text file, 
mode
 is either "r" or "w" depending on
whether the file intended to be 
read
 from or 
written
 to.
If the mode is omitted, the file is opened for reading.
Python Programming, 4/e
14
File Processing Outline
 
# printfile.py
#   Prints a file to the screen.
 
def main():
    fname = input("Enter a filename: ")
    infile = open(fname, "r")
    data = infile.read()
    infile.close()
    print(data)
Python Programming, 4/e
15
File Processing Outline
 
The program first prompts the user for a file name
and then opens the file for reading through the
variable 
infile
.
While any identifier works, here the name serves to remind
us that the object is a file and it is being used for input.
The entire contents of the file is then read as one
multi-line string and stored in the variable 
data
.
Printing 
data
 causes the file contents to be displayed.
Python Programming, 4/e
16
File Processing Outline
 
This process illustrates the basic three-step process
for working with a file:
1.
Open the file.
2.
Use file operations to read or write data.
3.
Close the file.
Any file that is opened should be closed when the
program is done using it. Technically, all files get
closed when the program terminates, but doing it
explicitly is good programming style.
Python Programming, 4/e
17
File Processing Outline
 
In order to make sure that necessary actions such as
closing a file occur, Python has a powerful feature
called a 
context manager
.
# printfile2.py
#   Prints a file to the screen.
 
def main():
    fname = input("Enter a filename: ")
    with open(fname, "r") as infile:
        data = infile.read()
    print(data)
Python Programming, 4/e
18
File Processing Outline
 
The 
with
 statement associates the variable with the
file object created by 
open
.
The file object acts as a context manager for
executing the instructions in the indented body of the
with.
When the body has completed, the file will be closed
automatically, even if control leaves the body due to
an exception or 
return
 statement.
Python Programming, 4/e
19
Reading from a File
 
read
 is just one of several options that can be used to
access the contents of a file.
<file>.read()
 – Returns the entire remaining contents of
the file as a single (potentially large, multi-line) string.
<file>.readline()
 – Returns the next line of the file, i.e.
all text up to 
and including
 the newline character.
<file>.readlines()
 – Returns a list of the remaining lines
in the file. Each list item is a string of a single line including
the newline character at the end.
Python Programming, 4/e
20
Reading from a File
 
Text files are read sequentially – the system keeps
track of what has been read since a file has been
opened, so that a later read will pick up where the
previous one left off.
If you want to read a previous line, you need to close
and reopen the file.
Python Programming, 4/e
21
Reading from a File
 
Successive calls to 
readline()
 read successive line
from the file.
The string returned by 
readline()
 will always end
with a newline character.
Use slicing to strip off the newline character at the
end of the line, otherwise it will look double-spaced.
Or, you could also tell print to not add its own
newline, e.g. 
print(line, end="")
.
Python Programming, 4/e
22
Reading from a File
 
with open(someFile, "r") as infile:
    for _ in range(5):
        line = infile.readline()
        print(line[:-1])
Python Programming, 4/e
23
Reading from a File
 
One way to loop through the entire contents of a file
is to read in all of the file using 
readlines
, then loop
through the resulting list.
with open(someFile, "r") as infile:
    for line in infile.readlines():
        # process the line here
What happens if the file is too large to fit in your
computer’s memory?
Python Programming, 4/e
24
Reading from a File
 
Python treats a file as sequence of lines, so looping
through the lines can be done directly:
 
with open(someFile, "r") as infile:
    for line in infile:
        # process the line here
Python Programming, 4/e
25
Reading from a File
 
Let’s improve our statistics library from last chapter.
One disadvantage of the previous version is that
getNumbers()
 gets numbers from the user
interactively.
What if you are trying to average one hundred
numbers and you make a mistake on number 98?
Doh! You’d need to start over again.
Python Programming, 4/e
26
Reading from a File
 
A better approach – type all the numbers into a file. We
can then edit the data before sending it to the program.
This file-oriented approach is typically used for data-
processing applications.
We can improve the usefulness of our library by adding a
getNumbersFromFile
 function that takes the name of a file
as a parameter and returns a list of numbers read from the
file.
Python Programming, 4/e
27
Reading from a File
 
Suppose our numbers are in a text file, with each line
containing a single number.
def getNumbersFromFile(fname):
    nums = []
    with open(fname, "r") as infile:
        for line in infile:
            nums.append(float(line))
    return nums
Python Programming, 4/e
28
Reading from a File
 
We could also do this more succinctly with a list
comprehension:
def getNumbersFromFile(fname):
    nums = []
    with open(fname, "r") as infile:
       nums = [float(line) for line in infile]
    return nums
Python Programming, 4/e
29
Reading from a File
 
Using this approach, we need to be very careful with the
format of the input file – there must be 
exactly one
number on each line.
A common error is to introduce an extra blank line at the
bottom that may go unnoticed. This would cause
in <listcomp>
nums = [float(line) for line in infile]
ValueError: could not convert string to float: ’’
Python Programming, 4/e
30
Reading from a File
 
We could make our function more flexible by having it
accept multiple numbers on the same line.
A single line can easily be turned into a list of numbers
using split in the list comprehension, similar to what we
did when we had multiple numbers on a single line of
interactive input:
nums = [float(num) for x in line.split()]
Python Programming, 4/e
31
Reading from a File
 
To get all the numbers across multiple lines, we simply wrap
this up in an accumulator loop that processes the lines of the
input file:
def getNumbersFromFile(fname):
    nums = []
    with open(fname, "r") as infile:
        for line in infile:
            newnums = [float(num) for x in line.split()]
            nums.extend(newnums)
    return nums
Python Programming, 4/e
32
Reading from a File
 
Here the accumulator is called 
nums
 and the list created
from each line is called 
newnums
.
The final line in the loop body appends the numbers from
the current line to the end of the accumulator using the
list
 extend method introduced in chapter 9.
This version of the stats program appears in 
stats3.py
.
Python Programming, 4/e
33
Reading from a File
 
Using this approach has several benefits:
It allows you to create a data file with as many numbers on each
line as you want.
The program will also be more robust by handling accidental
blank lines (Do you see how?).
Python Programming, 4/e
34
Writing to a File
 
Opening a file for writing prepares that file to receive data.
If no file with the given name exists, a new file will be
created.
If a file with the given name 
does
 exist, Python will
delete it and create a new, empty file.
with open("mydata.out", "w") as outfile:
    # do things with outfile here
Python Programming, 4/e
35
Writing to a File
 
The easiest way to write information into a text file is to
use the 
print
 function.
To do this, simply add an extra keyword parameter that
specifies the file:
print(..., file=<outputfile>)
This behaves exactly like a normal 
print
, except the result
is sent to 
outputfile
 rather than the screen.
Python Programming, 4/e
36
Writing to a File
 
Here’s a program to create a text file with a haiku about
programming:
# haiku.py
def main():
    haiku = ["White space and syntax",
             "Python code flows like water",
             "Solutions emerge"]
    print("I have a haiku for you.")
Python Programming, 4/e
37
Writing to a File
 
    fname = input("Enter a file name to receive the haiku: ")
    with open(fname, "w") as haikufile:
        for line in haiku:
            print(line, file=haikufile)
    print(f"Look in {fname} to see your haiku")
Python Programming, 4/e
38
Batch Processing
 
To see how these pieces fit together in a larger example,
let’s redo the username generation program from Chapter
8.
Our previous version created usernames interactively by
having the user type in his or her name.
If we were setting up accounts for a large number of users,
this process would probably not be done interactively,  but
in 
batch
 mode, where program input and output is done
through files.
Python Programming, 4/e
39
Batch Processing
 
Each line of the input file will contain the first and last
names of a new user separated by one or more spaces.
The program produces an output file containing a line for
each generated username.
Python Programming, 4/e
40
Batch Processing
 
# userfile.py
# Program to create a file of usernames in batch mode.
def main():
    print("This program creates a file of usernames from a")
    print("file of names.")
    # get the file names
    infileName = input("What file are the names in? ")
    outfileName = input("What file should the usernames go in? ")
Python Programming, 4/e
41
Batch Processing
 
    # open the files
    
with open(infileName, "r") as infile, open(outfileName, "w") as outfile:
    # process each line of the input file
    for line in infile:
        # get the first and last names from line
        first, last = line.split()
        # create the username
        uname = (first[0]+last[:7]).lower()
        # write it to the output file
        print(uname, file=outfile)
print("Usernames have been written to", outfileName)
Python Programming, 4/e
42
Batch Processing
 
A couple things worth noticing:
Two files are open at the same time, one for input (
infile
) and
one for output (
outfile
). This is accomplished in the 
with
 by
including two 
open(…) as <variable>
 clauses separated by a
comma. It’s not unusual for a program to act on multiple files
simultaneously.
When creating the username, the lower string method was used
to ensure that the username is all lowercase, even if the input
names are mixed case.
Python Programming, 4/e
43
File Names and Paths
 
So far in our examples we’ve indicated the file to be
opened by supplying the name of the file as a string.
Using this approach, files end up in the folder where the
programs live.
This might be OK for assignments, but in the real world
we’d like users to be able to select files from anywhere in
secondary memory.
Python Programming, 4/e
44
Absolute and Relative Paths
 
Way back in Chapter 1 we looked at how a computer’s
operating system generally organizes secondary memory
as a hierarchical collection of directories (also called
folders) that can contain files as well as other directories.
The directory at the top of this hierarchy is called the root
directory.
A file is located by specifying a 
path
 from the root
directory down through the hierarchy of directories.
Python Programming, 4/e
45
Absolute and Relative Paths
 
E.g., the text of this chapter is in a file having the path
/home/zelle/Books/cs1book/cs1book4e/textbook/chapter10.tex
The top-level directory on Dr. Zelle’s computer is designated
with a 
/
. His computer’s root directory contains around 20
subdirectories, including one called 
home
.
A slash (
/
) is also used to separate the directory names along
the path.
Python Programming, 4/e
46
Absolute and Relative Paths
 
You can think of the path from the root as representing
the “full name” of any given file.
The name has to be so complex because a typical
computer contains millions of files; there must be a way to
uniquely identify each of these files.
This complete path to a given directory or file is called the
absolute path
.
Anywhere in Python where a file path is needed, an
absolute path can be used.
Python Programming, 4/e
47
Absolute and Relative Paths
 
Anywhere in Python where a file path is needed, an
absolute path can be used.
Working with absolute paths can be a pain!
They’re long
Moving a file or folder changes the absolute paths of files and
folders!
Any path that beings with something other than the root
directory is considered a 
relative
 path.
Python Programming, 4/e
48
Absolute and Relative Paths
 
When we just use the name of a file in our examples,
those were relative paths.
Running programs always have an associated 
working
directory
 which is the directory that it is currently working
in.
Typically, this is the directory where your program file is
located.
Python Programming, 4/e
49
Absolute and Relative Paths
 
Suppose we have a program 
data_analyzer.py
 stored in
/home/zelle/python
.
When this program is run its working directory will be
/home/zelle/python
.
path = input("What file should I analyze? ")
with open(path, "r") as infile:
    # process the file
If the user enters 
nums.txt
, the program will look for
/home/zelle/python/nums.txt
.
Python Programming, 4/e
50
Absolute and Relative Paths
 
Suppose the user instead enters 
data/nums.txt
.
Python will threat this as a path starting at the current
working directory: 
/home/zelle/python/
data/nums.txt
.
The characters “.” and “..” have special meanings for
relative paths.
“.” indicates the current working directory
“..” indicates the parent of the current working directory.
In our previous example, an equivalent would be
../data/nums.txt
Python Programming, 4/e
51
Absolute and Relative Paths
 
Dr. Zelle’s laptop is running Linux. While the ideas are the
same, the details differ among operating systems.
On macOS, a user’s home directory is in 
/Users
.
/Users/zelle/data/nums.txt
On Windows, the path notation is a little different.
C:\Users\zelle\data\nums.txt
Each hard drive (
C:
, 
D
:
) has its own file system with its own root
directory.
Windows uses 
\
 rather than 
/
 in paths
Python Programming, 4/e
52
Absolute and Relative Paths
 
Python always allows paths to be separated using a
regular slash (
/
) on any OS for interoperability.
It’s best practice to avoid “
\
” in Windows paths in Python
since the backslash is used in string literals to indicate
special characters, i.e. 
\t
, 
\n
. To use an actual backslash
in a literal, you’d need to escape it (
\\
) or prefix the string
with r to indicate it is a “raw” string (don’t interpret).
Python Programming, 4/e
53
Absolute and Relative Paths
 
Three ways to open the same file in Windows
with open("data/nums.txt") as infile:  # generic Python
                                       # notation
with open("data\\nums.txt") as infile: # Windows notation
                                       # using special char
with open(r"data\nums.txt") as infile: # Windows notation
                                       # using raw string
The best one? Number one – it will work on other operating
systems besides Windows.
Python Programming, 4/e
54
Using pathlib
 
File are a ubiquitous part of the computing landscape, and
just about every program has to manipulate them in one
way or another.
Python provides a library called 
pathlib
 to help with some
of the common, but tedious tasks.
The main tool is the 
Path
 object. 
Path
 is a sort of
“wrapper” around a path string that gives it some
convenient superpowers.
Python Programming, 4/e
55
Using pathlib
 
Let’s improve our batch-oriented username program so
that it checks if the intended output file exists. If it does,
create a backup of that file so that the contents aren’t lost
when the new usernames are written.
Python Programming, 4/e
56
Using pathlib
 
# userfile2.py
from pathlib import Path
 
def main():
   print("This program creates a file of usernames from a")
   print("file of names.")
   # get the file names
   inPath = Path(input("What file are the names in? "))
   outPath = Path(input("What file should the usernames go in? "))
Python Programming, 4/e
57
Using pathlib
 
    # backup the output file if it already exists
    if outPath.exists():
        backupPath = outPath.with_suffix(".bak")
        print(f"Renaming existing {outPath.name} to {backupPath.name}")
        outPath.rename(backupPath)
 
Python Programming, 4/e
58
Using pathlib
 
    # open the files
    with open(inPath, "r") as infile, open(outPath, "w") as outfile:
       # process each line of the input file
       for line in infile:
          # get the first and last names from line
          first, last = line.split()
          # create the username
          uname = (first[0]+last[:7]).lower()
          # write it to the output file
          print(uname, file=outfile)
print("Usernames have been written to", outPath)
Python Programming, 4/e
59
Using pathlib
 
You can extract different parts of a path using simple
attributes from a Path object.
>>> path = Path("/home/zelle/python/data.txt")
>>> path.name
    'data.txt’
>>> path.stem
    'data’
>>> path.suffix
    '.txt'
Python Programming, 4/e
60
Using pathlib
 
We can create a slightly modified path by using
with_<part>
 methods to replace specific parts in an
existing path.
backupPath = outPath.with_suffix(".bak")
This creates a new 
Path
 that is just like 
outPath
, except it has
the extension (suffix) “.bak” instead of its original extension.
Our program’s output will look something like
Renaming existing usernames.txt to usernames.bak
Python Programming, 4/e
61
Using pathlib
 
The actual renaming of the file happens with
outPath.rename(backupPath)
The rename method is one of a number of Path object
methods that can be used to make changes in the
underlying file system.
The necessary commands differ by operating system, but
the 
Path
 object handles the differences in a transparent
way!
Python Programming, 4/e
62
Iterating over Directories
 
Another task that programs often need to do is to process
a whole batch of files at a time.
For example, a photo management app might allow the
user to load all the images in a given directory.
If you have a 
Path
 object that points to a directory on
your hard disk, there are a couple methods that allow you
to loop over the contents of that directory.
Python Programming, 4/e
63
Iterating over Directories
 
The simplest of these methods is 
iterdir
.
It produces a sequence of 
Path
 objects, one for each file
or directory contained in the original directory.
>>> path = Path(".")
>>> for p in path.iterdir():
            print(p)
names.txt
stats3.py
Python Programming, 4/e
64
Iterating over Directories
 
list(path.iterdir())
[PosixPath(’names.txt’), PosixPath(’test.txt’),
PosixPath(’stats3.py’), PosixPath(’data’),
PosixPath(’nums1.txt’),
PosixPath(’usernames.bak’), PosixPath(’nums2.txt’),
PosixPath(’usernames.txt’),
PosixPath(’userfile2.py’),
PosixPath(’userfile.py’), PosixPath(’haiku.py’)]
Python Programming, 4/e
65
Iterating over Directories
 
Notice that each item in the sequence produce by
listdir()
 is itself a 
Path
 object.
It means we can make use of the various 
Path
 methods
on these items.
The 
is_file
 method returns 
True
 if the path is a file (as
opposed to a directory).
files = [p for p in path.iterdir() if p.is_file()]
Python Programming, 4/e
66
Iterating over Directories
 
If we wanted just the Python program files, we could grab just
the items that had a .py suffix.
python_files = [p for p in path.iterdir() if p.suffix == ".py"]
This last example could have been handled more simply using a
technique known as 
file globbing
.
You can select a subset of files that match a pattern using the
glob
 method:
path.glob(pattern)
Python Programming, 4/e
67
Iterating over Directories
 
The pattern looks like a regular path string except that it
can contain certain “wildcard” characters.
“?” matches any single character
“*” matches any sequence of characters
python_files = list(path.glob("*.py"))
The glob 
"*.py"
 will match any file that ends with .py
Python Programming, 4/e
68
Iterating over Directories
 
Our last addition was a 
getNumbersFromFile(path)
function that can be used to get a data set from a specific
file.
Suppose we have a number of data sets, each stored in a
separate file in our data directory.
It would have handy to have a 
getNumbersFromFiles
function making use of file globbing to accumulate all the
data across the set of files.
Python Programming, 4/e
69
Iterating over Directories
 
Let’s write a function with two parameters.
basedir
 gives the directory containing the data
pattern
 is a pattern for which files to look in
To get the number from all the flies in a data directory, we could
do 
data = getNumbersFromFiles("data", "*")
To get data from all files having “exam” in the name,
data = getNumbersFromFiles("data", "*exam*")
To write this you need an accumulator to build a list of all the
numbers.
Python Programming, 4/e
70
Iterating over Directories
 
def getNumbersFromFiles(basedir, pattern):
   path = Path(basedir)
   nums = []
   for filepath in path.glob(pattern):
      newnums = getNumbersFromFile(filepath)
      nums.extend(newnums)
   return nums
Python Programming, 4/e
71
Iterating over Directories
 
Notice how 
basedir
 was turned into a 
Path
 object at the
start – that ensures that you can call 
glob
 in the heading.
This function will work when 
basedir
 is passed as either a
string or a 
Path
 object.
Python Programming, 4/e
72
File Dialogs
 
Some operating systems (e.g. Windows and macOS), by
default will only show the main stem of the filename and
not the type suffix, making it hard to know the full
filename for performing file operations.
This situation is even more complicated when the file
exists somewhere other than the current working
directory. In order to operate on these far-flung files, we
need the complete path to them! Do you know how to find
the complete path to an arbitrary file on your computer?
Python Programming, 4/e
73
File Dialogs
 
One solution to this problem is to allow users to browse
the file system visually and navigate their way to particular
file/directory.
The usual technique incorporates a dialog box that allows
a user to click around in the file system and either select
or type in th ename of a file.
Fortunately for us, the tkinter GUI library included with
(most) standard Pythons has these kinds of functions!
Python Programming, 4/e
74
File Dialogs
 
To ask the user for the name of a file to open, you can use
the 
askopenfilename
 function found in the
tkinter.filedialog
 module.
from tkinter.filedialog import askopenfilename
The reason for the dot notation is that tkinter is package
composed of multiple modules.
To get the name of the user names file
infileName = askopenfilename()
 
Python Programming, 4/e
 
75
 
File Dialogs
Python Programming, 4/e
76
File Dialogs
 
The dialog box allows the user to either type in th ename
of the file or to simply select it with the mouse.
When the user clicks the “Open” button, the complete path
name of the file is returned as a string and saved into the
variable 
infileName
.
If the user clicks the “Cancel” button, the function will
simpley return the empty string, "".
Python Programming, 4/e
77
File Dialogs
 
from tkinter.filedialog import asksaveasfilename
...
outfileName = asksaveasfilename()
You could, of course, import both at once:
from tkinter.filedialog import askopenfilename, asksaveasfilename
 
Python Programming, 4/e
 
78
 
File Dialogs
Python Programming, 4/e
79
File Dialogs
 
If you need to get a directory path from the user, there’s
also an 
askdirectory
 function.
All these functions have numerous optional parameters
that allow a program to customize the resulting dialogs.
Python Programming, 4/e
80
Binary Files and Pickling
 
Files can store any kind of data, even though we’ve
focused on string data so far.
Files on disk are really just a sequence of bytes, so
arbitrary data can be encoded into the bytes stored in a
particular file.
You undoubtedly have files on your computer that store
images, audio, video, etc.
Python Programming, 4/e
81
Strings and Bytes
 
There is a close correspondence between characters of a
string and bytes.
Before Unicode, each character in a string was treated as a
single byte of data.
When a string that contains only characters from the
original ASCII alphabet is encoded as bytes, each
character is stored as a single byte.
Python Programming, 4/e
82
Strings and Bytes
 
>>> s = "Hello, Bytes!"
>>> b = s.encode()
>>> type(b)
    <class 'bytes’>
Here, we created a string, 
s
, then encoded it into bytes,
storing it into variable 
b
.
Python Programming, 4/e
83
Strings and Bytes
 
A byte is 8 bits, which means there are 256 different byte
values.
Typically, bytes are stored as unsigned integers in the
range 0-255, inclusive.
>>> b[0]
    72
>>> b[1]
    101
Python Programming, 4/e
84
Strings and Bytes
 
The first byte of 
b
 is 72, because that is the Unicode value
of “H”. In other words, it is ord(“H”).
>>> len(s)
    13
>>> len(b)
    13
>>> b
    b'Hello, Bytes!'
Python Programming, 4/e
85
Strings and Bytes
 
s
 has 13 characters, 
b
 has 13 bytes
The last line shows a string literal prefaced with b (for
bytes), which is a compact way of showing the byte
sequence, exploiting the standard ASCII mapping of byte
values to character.
What if our string contains non-ASCII characters?
Let’s concatenate some Unicode characters with values
greater than 255 to our string.
Python Programming, 4/e
86
Strings and Bytes
 
sx = s + chr(128) + chr(256) + chr(512) + chr(1024)
bx = sx.encode()
len(sx)
17
len(bx)
21
bx
b'Hello, Bytes!\xc2\x80\xc4\x80\xc8\x80\xd0\x80'
Python Programming, 4/e
87
Strings and Bytes
 
We added four characters, so the length of the string is
now 17 (characters).
The encoding of the string, though, is now 21 bytes. The
non-ASCII characters were encoded into a 
pair
 of bytes,
and are displayed in hexadecimal (base 16) notation.
We can also convert a bytes object back into a string.
>>> b.decode()
    'Hello, Bytes!'
Python Programming, 4/e
88
Strings and Bytes
 
In fact, when we work with a text file in Python, this is
exactly what’s happening behind the scenes!
When reading from a file, Python reads in a sequence of
bytes from the file and decodes them into a string.
To write to a text file, Python encodes the string as a
sequence of bytes and streams the bytes into the file.
Python Programming, 4/e
89
Binary Mode and Pickling
 
Python also allows byte-level access to files.
We can read and write data as sequences of bytes rather
than strings.
Let’s assume the haiku we wrote earlier is stored in the file
haiku_out.txt
.
Python Programming, 4/e
90
Binary Mode and Pickling
 
>>> with open("haiku_out.txt", "r") as infile:
       data = infile.read()
       print(data)
White space and syntax
Python code flows like water
Solutions emerge
Python Programming, 4/e
91
Binary Mode and Pickling
 
To treat the file as a sequence of bytes instead of text, we
just append a ‘b’ (for binary) to the mode string when
opening the file.
Notice the difference in our next interaction!
Using the mode ‘rb’ opens the file for reading in binary
mode.
Reading the file in this mode gets back a bytes object
instead of a string.
Python Programming, 4/e
92
Binary Mode and Pickling
 
>>> with open("haiku_out.txt", "rb") as infile:
    data = infile.read()
    print(data)
b’White space and syntax\nPython code flows like
water\nSolutions emerge\n’
Python Programming, 4/e
93
Binary Mode and Pickling
 
If we want a string back, we must explicitly decode it.
>>> with open("haiku_out.txt", "rb") as infile:
    data = infile.read()
    print(data.decode())
White space and syntax
Python code flows like water
Solutions emerge
Python Programming, 4/e
94
Binary Mode and Pickling
 
We can also open a file for binary writing using the mode
‘wb’.
To write to a file in this mode, we must write bytes, not
strings.
with open("bytes.out", "wb") as outfile:
    outfile.write(b"Hello, Bytes!")
Notice we didn’t use 
print
, since 
print
 turns its
arguments into strings.
Python Programming, 4/e
95
Binary Mode and Pickling
 
To output bytes to a file, use the file method 
write
.
The binary mode is really for manipulating non-text data.
Doing so requires some sort of binary encoding to
represent the data as a raw sequence of bytes.
Usually, we can use existing libraries that handle whatever
specialized data format we need.
Python Programming, 4/e
96
Binary Mode and Pickling
 
One standard library that’s handy for storing binary data is
pickle
. The purpose of the library is to preserve your
arbitrary Python objects as a sequence of bytes in a file.
The process of turning an object into a sequence of bytes
is called 
serialization
.
Python Programming, 4/e
97
Binary Mode and Pickling
 
Suppose we have created a data set and would like to
save it so that it can be loaded back up again later.
If we quit our program, our list of numbers will be lost
unless we somewhow write it to a file!
We could do this with a text file, e.g.
writeNumbersToFile()
 (left as an exercise for you).
But what’s the fun of that?
Python Programming, 4/e
98
Binary Mode and Pickling
 
Let’s have two functions – one that serializes the list into a
binary file and another that reads it back in again.
import pickle
def storeData(nums, path):
    with open(path, "wb") as outfile:
        pickle.dump(nums, outfile)
Python Programming, 4/e
99
Binary Mode and Pickling
 
In this function, 
nums
 is the list of numbers that we want
to save and 
path
 is the path string (or 
Path
 object) for the
file to save the list into.
Our list is pickled for storage and later consumption with
no loops our futzing around with the 
dump
 method.
Python Programming, 4/e
100
Binary Mode and Pickling
 
Python uses its own binary format to do the serialization.
To load the list back in again will require another use of
pickle.
The inverse of 
dump
 is 
load
.
All we need to do is open up the file for reading in binary
mode and call 
pickle.load
.
Python will read in the bytes and decode them back into
whatever was pickled in the first place.
Python Programming, 4/e
101
Binary Mode and Pickling
 
def loadData(path):
    with open(path, "rb") as infile:
        nums = pickle.load(infile)
    return nums
 
>>> storeData([3, 1, 4, 1, 5, 9], "test.pkl")
>>> nums = loadData("test.pkl")
>>> nums
[3, 1, 4, 1, 5, 9]
Python Programming, 4/e
102
Binary Mode and Pickling
 
You can use pickle to save the state of a game so that
users can pick up where they left off, or your AI
application might serialize a trained neural network so that
you can distribute it to thousands of users.
Python Programming, 4/e
103
Binary Mode and Pickling
 
But there are some downsides:
The resulting file is binary and so it is not in a human readable
format. In many cases (like configuration files) it would be a
better idea to keep it human readable.
While pickle works for lots of objects and all Python’s built-in
types, it won’t work for all object types.
The process of loading a pickle file could cause the execution of
arbitrary (and potentially nefarious) Python. Never load a pickle
from an untrusted source!
Python Programming, 4/e
104
Remote Files
 
A lot of the data that our programs might use is not stored
on the local computer, but is accessed by the Internet.
Sometimes this is referred to as storing data “in the
cloud.”
The supporting web site for this textbook has all the code
and data file from the book. You can locate those files by
typing the Uniform Resource Locater into your favorite
browser.
Python Programming, 4/e
105
Remote Files
 
https://mcsp.wartburg.edu/zelle/python/ppics4/code
/chapter10/nums2.txt
Assuming you have an Internet connection, this will direct
your OS to send a request to another computer asking for
the specified data.
You’ll notice that this looks like a path…
Python Programming, 4/e
106
Remote Files
 
You could use your browser to save this data to your
computer, but wouldn’t it be more convenient if we had a
program fetch the data directly off the web for us?
Let’s add one more data fetching function to our statistics
library.
Python provides a function that allows us to open a remote
file in a fashion analogous to opening a file on the local
computer.
Python Programming, 4/e
107
Remote Files
 
from urllib.request import urlopen
def getNumbersFromURL(url):
    nums = []
    with urlopen(url) as infile:
        for line in infile:
            line = line.decode()
            newnums = [float(x) for x in line.split()]
            nums.extend(newnums)
    return nums
Python Programming, 4/e
108
Remote Files
 
There are really only two slight changes from
getNumbersFromFile
.
Instead of using the standard 
open
 function, it uses 
urlopen
,
which is imported from the module 
urllib.request
.
The 
urlopen
 function sends out a network request for the given URL and
provides a file-like object from which we can read the data coming back
over the network.
This object acts like a file that has been opened in ‘rb’ mode since the
URL may not point to textual data.
Python Programming, 4/e
109
Remote Files
 
After opening the URL, we loop over the resulting data line-by-
line. Since this is binary data, the line is initially a bytes object.
The first line in the loop body decodes it into a string so that we can
then turn the string into a list of number, 
newnums
, and accumulate those
numbers into the complete list, 
nums
.
data = getNumbersFromURL("https://mcsp.wartburg.edu/zelle/python ... ")
>>> data
[26.0, 53.0, 5.0, 89.0, 79.0, 32.0, 38.0, 46.0]
Slide Note
Embed
Share

Understanding the basics of file processing in Python is crucial for opening, reading, and writing files. This chapter covers text files, binary data, persistent data, and the use of the pickle module to store and load Python objects from files. Learn about the role of files in persistent data storage and how to work with text files effectively.

  • Python Programming
  • File Processing
  • Text Files
  • Binary Data
  • Persistent Data

Uploaded on Oct 07, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Python Programming: An Introduction To Computer Science Chapter 10 Persistent Data Python Programming, 4/e 1

  2. Objectives To understand basic file-processing concepts and techniques for opening, reading, and writing files in Python. To understand the structure of text files and be able to write programs that use them. To become familiar with the basic organization of file systems, including role of absolute and relative paths play in locating files, and be able to write Python programs that process collections of files. Python Programming, 4/e 2

  3. Objectives To understand binary data and the bytes data type and be able to create programs that store and load Python objects from files using the pickle module. To recognize the similarity between working with local files and working with network resources. Python Programming, 4/e 3

  4. Text Files In all of the examples so far, data has either been embedded in the program code or entered by the user when the program runs. We lack a mechanism for entering data and having that data persist from one run of the program to the next. Python Programming, 4/e 4

  5. Text Files Persistent data is a critical component of any modern computing system. Your word processor needs to save the paper you re working on. Your programming environment needs to be able to save and reload your Python code. Typically, such information is stored in files. Python Programming, 4/e 5

  6. Text Files A file is a sequence of data that is stored in secondary memory (usually on a disk drive of some sort). Files can contain any data type, but the easiest files to work with a those that contain text. Files of text have the advantage that they can be read and understood by humans, and they are easily created and edited using general purpose text editors, like IDLE. Python Programming, 4/e 6

  7. Multi-line Strings You can think of a text file as a (possibly long) string that happens to be stored on disk. A special character or sequence of characters is used to mark the end of each line. While this convention varies by operating system, Python takes care of these different conventions for us and just uses the regular newline character (\n). Python Programming, 4/e 7

  8. Multi-line Strings Hello World Goodbye 32 When stored to a file, you get this: Hello\nWorld\n\nGoodbye 32\n Notice that the blank line becomes a bare newline. Python Programming, 4/e 8

  9. Multi-line Strings This is no different than when we embed newline characters into output strings to produce multiple lines of output with a single print statement. print("Hello\nWorld\n\nGoodbye 32\n") Remember, if you simply evaluate a string containing newline characters in the shell, you will just get the embedded newline representation back. "Hello\nWorld\n\nGoodbye 32\n" Python Programming, 4/e 9

  10. File Processing Outline Virtually all programming languages share certain underlying file manipulation concepts. We need some way to associate a file on disk with an object in a program this is called opening a file. We need a set of operations that can manipulate the file object. At the very least, we need to be able to read the information from a file and to write new information to a file. Lastly, when a we are done we need to close the file. Python Programming, 4/e 10

  11. File Processing Outline This idea of opening and closing files is closely related to how you might work with files in an application program such as IDLE. When you open a file for editing in IDLE, the file is actually read from disk and stored in RAM. At this point, the file is closed (in the programming sense). As you edit the file, you are really making changes to the data in memory, not the file itself. Changes will not show up on disk until you save it. Python Programming, 4/e 11

  12. File Processing Outline The process of saving a file in IDLE is also a multi- step process. The original file on the disk is opened, this time in a mode that allows it to store information (opened for writing). Doing this actually erases the old contents of the file! File writing operations are then used to copy the current contents of the in-memory file into the new file on disk. Python Programming, 4/e 12

  13. File Processing Outline Working with text files in Python is easy! Create a file object that corresponds to a file on disk: <variable> = open(<path>, <mode>) Here, path is a string that provides the location of the file on disk. For a text file, mode is either "r" or "w" depending on whether the file intended to be read from or written to. If the mode is omitted, the file is opened for reading. Python Programming, 4/e 13

  14. File Processing Outline # printfile.py # Prints a file to the screen. def main(): fname = input("Enter a filename: ") infile = open(fname, "r") data = infile.read() infile.close() print(data) Python Programming, 4/e 14

  15. File Processing Outline The program first prompts the user for a file name and then opens the file for reading through the variable infile. While any identifier works, here the name serves to remind us that the object is a file and it is being used for input. The entire contents of the file is then read as one multi-line string and stored in the variable data. Printing data causes the file contents to be displayed. Python Programming, 4/e 15

  16. File Processing Outline This process illustrates the basic three-step process for working with a file: Open the file. Use file operations to read or write data. Close the file. Any file that is opened should be closed when the program is done using it. Technically, all files get closed when the program terminates, but doing it explicitly is good programming style. 1. 2. 3. Python Programming, 4/e 16

  17. File Processing Outline In order to make sure that necessary actions such as closing a file occur, Python has a powerful feature called a context manager. # printfile2.py # Prints a file to the screen. def main(): fname = input("Enter a filename: ") with open(fname, "r") as infile: data = infile.read() print(data) Python Programming, 4/e 17

  18. File Processing Outline The with statement associates the variable with the file object created by open. The file object acts as a context manager for executing the instructions in the indented body of the with. When the body has completed, the file will be closed automatically, even if control leaves the body due to an exception or return statement. Python Programming, 4/e 18

  19. Reading from a File read is just one of several options that can be used to access the contents of a file. <file>.read() Returns the entire remaining contents of the file as a single (potentially large, multi-line) string. <file>.readline() Returns the next line of the file, i.e. all text up to and including the newline character. <file>.readlines() Returns a list of the remaining lines in the file. Each list item is a string of a single line including the newline character at the end. Python Programming, 4/e 19

  20. Reading from a File Text files are read sequentially the system keeps track of what has been read since a file has been opened, so that a later read will pick up where the previous one left off. If you want to read a previous line, you need to close and reopen the file. Python Programming, 4/e 20

  21. Reading from a File Successive calls to readline() read successive line from the file. The string returned by readline() will always end with a newline character. Use slicing to strip off the newline character at the end of the line, otherwise it will look double-spaced. Or, you could also tell print to not add its own newline, e.g. print(line, end=""). Python Programming, 4/e 21

  22. Reading from a File with open(someFile, "r") as infile: for _ in range(5): line = infile.readline() print(line[:-1]) Python Programming, 4/e 22

  23. Reading from a File One way to loop through the entire contents of a file is to read in all of the file using readlines, then loop through the resulting list. with open(someFile, "r") as infile: for line in infile.readlines(): # process the line here What happens if the file is too large to fit in your computer s memory? Python Programming, 4/e 23

  24. Reading from a File Python treats a file as sequence of lines, so looping through the lines can be done directly: with open(someFile, "r") as infile: for line in infile: # process the line here Python Programming, 4/e 24

  25. Reading from a File Let s improve our statistics library from last chapter. One disadvantage of the previous version is that getNumbers() gets numbers from the user interactively. What if you are trying to average one hundred numbers and you make a mistake on number 98? Doh! You d need to start over again. Python Programming, 4/e 25

  26. Reading from a File A better approach type all the numbers into a file. We can then edit the data before sending it to the program. This file-oriented approach is typically used for data- processing applications. We can improve the usefulness of our library by adding a getNumbersFromFile function that takes the name of a file as a parameter and returns a list of numbers read from the file. Python Programming, 4/e 26

  27. Reading from a File Suppose our numbers are in a text file, with each line containing a single number. def getNumbersFromFile(fname): nums = [] with open(fname, "r") as infile: for line in infile: nums.append(float(line)) return nums Python Programming, 4/e 27

  28. Reading from a File We could also do this more succinctly with a list comprehension: def getNumbersFromFile(fname): nums = [] with open(fname, "r") as infile: nums = [float(line) for line in infile] return nums Python Programming, 4/e 28

  29. Reading from a File Using this approach, we need to be very careful with the format of the input file there must be exactly one number on each line. A common error is to introduce an extra blank line at the bottom that may go unnoticed. This would cause in <listcomp> nums = [float(line) for line in infile] ValueError: could not convert string to float: Python Programming, 4/e 29

  30. Reading from a File We could make our function more flexible by having it accept multiple numbers on the same line. A single line can easily be turned into a list of numbers using split in the list comprehension, similar to what we did when we had multiple numbers on a single line of interactive input: nums = [float(num) for x in line.split()] Python Programming, 4/e 30

  31. Reading from a File To get all the numbers across multiple lines, we simply wrap this up in an accumulator loop that processes the lines of the input file: def getNumbersFromFile(fname): nums = [] with open(fname, "r") as infile: for line in infile: newnums = [float(num) for x in line.split()] nums.extend(newnums) return nums Python Programming, 4/e 31

  32. Reading from a File Here the accumulator is called nums and the list created from each line is called newnums. The final line in the loop body appends the numbers from the current line to the end of the accumulator using the list extend method introduced in chapter 9. This version of the stats program appears in stats3.py. Python Programming, 4/e 32

  33. Reading from a File Using this approach has several benefits: It allows you to create a data file with as many numbers on each line as you want. The program will also be more robust by handling accidental blank lines (Do you see how?). Python Programming, 4/e 33

  34. Writing to a File Opening a file for writing prepares that file to receive data. If no file with the given name exists, a new file will be created. If a file with the given name does exist, Python will delete it and create a new, empty file. with open("mydata.out", "w") as outfile: # do things with outfile here Python Programming, 4/e 34

  35. Writing to a File The easiest way to write information into a text file is to use the print function. To do this, simply add an extra keyword parameter that specifies the file: print(..., file=<outputfile>) This behaves exactly like a normal print, except the result is sent to outputfile rather than the screen. Python Programming, 4/e 35

  36. Writing to a File Here s a program to create a text file with a haiku about programming: # haiku.py def main(): haiku = ["White space and syntax", "Python code flows like water", "Solutions emerge"] print("I have a haiku for you.") Python Programming, 4/e 36

  37. Writing to a File fname = input("Enter a file name to receive the haiku: ") with open(fname, "w") as haikufile: for line in haiku: print(line, file=haikufile) print(f"Look in {fname} to see your haiku") Python Programming, 4/e 37

  38. Batch Processing To see how these pieces fit together in a larger example, let s redo the username generation program from Chapter 8. Our previous version created usernames interactively by having the user type in his or her name. If we were setting up accounts for a large number of users, this process would probably not be done interactively, but in batch mode, where program input and output is done through files. Python Programming, 4/e 38

  39. Batch Processing Each line of the input file will contain the first and last names of a new user separated by one or more spaces. The program produces an output file containing a line for each generated username. Python Programming, 4/e 39

  40. Batch Processing # userfile.py # Program to create a file of usernames in batch mode. def main(): print("This program creates a file of usernames from a") print("file of names.") # get the file names infileName = input("What file are the names in? ") outfileName = input("What file should the usernames go in? ") Python Programming, 4/e 40

  41. Batch Processing # open the files with open(infileName, "r") as infile, open(outfileName, "w") as outfile: # process each line of the input file for line in infile: # get the first and last names from line first, last = line.split() # create the username uname = (first[0]+last[:7]).lower() # write it to the output file print(uname, file=outfile) print("Usernames have been written to", outfileName) Python Programming, 4/e 41

  42. Batch Processing A couple things worth noticing: Two files are open at the same time, one for input (infile) and one for output (outfile). This is accomplished in the with by including two open( ) as <variable> clauses separated by a comma. It s not unusual for a program to act on multiple files simultaneously. When creating the username, the lower string method was used to ensure that the username is all lowercase, even if the input names are mixed case. Python Programming, 4/e 42

  43. File Names and Paths So far in our examples we ve indicated the file to be opened by supplying the name of the file as a string. Using this approach, files end up in the folder where the programs live. This might be OK for assignments, but in the real world we d like users to be able to select files from anywhere in secondary memory. Python Programming, 4/e 43

  44. Absolute and Relative Paths Way back in Chapter 1 we looked at how a computer s operating system generally organizes secondary memory as a hierarchical collection of directories (also called folders) that can contain files as well as other directories. The directory at the top of this hierarchy is called the root directory. A file is located by specifying a path from the root directory down through the hierarchy of directories. Python Programming, 4/e 44

  45. Absolute and Relative Paths E.g., the text of this chapter is in a file having the path /home/zelle/Books/cs1book/cs1book4e/textbook/chapter10.tex The top-level directory on Dr. Zelle s computer is designated with a /. His computer s root directory contains around 20 subdirectories, including one called home. A slash (/) is also used to separate the directory names along the path. Python Programming, 4/e 45

  46. Absolute and Relative Paths You can think of the path from the root as representing the full name of any given file. The name has to be so complex because a typical computer contains millions of files; there must be a way to uniquely identify each of these files. This complete path to a given directory or file is called the absolute path. Anywhere in Python where a file path is needed, an absolute path can be used. Python Programming, 4/e 46

  47. Absolute and Relative Paths Anywhere in Python where a file path is needed, an absolute path can be used. Working with absolute paths can be a pain! They re long Moving a file or folder changes the absolute paths of files and folders! Any path that beings with something other than the root directory is considered a relative path. Python Programming, 4/e 47

  48. Absolute and Relative Paths When we just use the name of a file in our examples, those were relative paths. Running programs always have an associated working directory which is the directory that it is currently working in. Typically, this is the directory where your program file is located. Python Programming, 4/e 48

  49. Absolute and Relative Paths Suppose we have a program data_analyzer.py stored in /home/zelle/python. When this program is run its working directory will be /home/zelle/python. path = input("What file should I analyze? ") with open(path, "r") as infile: # process the file If the user enters nums.txt, the program will look for /home/zelle/python/nums.txt. Python Programming, 4/e 49

  50. Absolute and Relative Paths Suppose the user instead enters data/nums.txt. Python will threat this as a path starting at the current working directory: /home/zelle/python/data/nums.txt. The characters . and .. have special meanings for relative paths. . indicates the current working directory .. indicates the parent of the current working directory. In our previous example, an equivalent would be ../data/nums.txt Python Programming, 4/e 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#