Text files
Posted on March 11, 2015 in Uncategorized by Eric Lease Morgan
While a rose is a rose is a rose, a text file is not a text file is not a text file.
For better or for worse, we here in our text analysis workshop are dealing with three different computer operating systems: Windows, Macintosh, and Linux. Text mining requires the subject of its analysis to be in the form of plain text files. [1] But there is a subtle difference between the ways each of our operating systems expect to deal with “lines” in that text. Let me explain.
Imagine a classic typerwriter. A cylinder (called a “platten”) fit into a “carriage” designed to move back & forth across a box while “keys” were slapped against a piece of inked ribbon ultimately imprinting a character on a piece of paper rolled around the platten. As each key was pressed the platten moved a tiny bit from right to left. When the platten got to the left-most position, the operator was expected to manually move the platten back to the right-most postion and continue typing. This movement was really two movements in one. First, the carriage was “returned” to the right-most position, and second, the platten was rolled one line up. (The paper was “fed” around the platten by one line.) If one or the other of these two movements were not performed, then the typing would either run off the right-hand side of the paper, or the letters would be imprinted on top of the previously typed characters. These two movements are called “carriage returns” and “line feeds”, respectively.
Enter computers. Digital representations of characters were saved to files. These files are then sent to printers, but there is no person there to manually move the platten from left to right nor to roll the paper further into the printer. Instead, invisible characters were created. There are many invisible characters, and the two of most interest to us are carriage return (ASCII character 13) and line feed (sometimes called “new line” and ASCII character 10). [2] When the printer received these characters the platten moved accordingly.
Enter our operating systems. For better or for worse, traditionally each of our operating systems treat the definition of lines differently:
- in a traditional Macintosh file lines are delimited by a single carriage return (ASCII 13)
- on Unix/Linux lines are delimited by line feeds (ASCII 10)
- Windows computers expect lines to be delimited by a combination of both (ASCII 13 and ASCII 10)
Go figure?
Macintosh is much more like Unix now-a-days, so most Macintosh text files use the Unix convention.
Windows folks, remember how your text files looked funny when initially displayed? This is because the original text files only contained ASCII 10 and not ASCII 13. Notepad, your default text editor, did not “see” line feed characters and consequently everything looked funny. Years ago, if a Macintosh computer read a Unix/Linux text file, then all the letters would be displayed on top of each other, even messier.
If you create a text file on your Windows or (older) Macintosh computer, and then you use these files as input to other programs (ie., wget -i ./urls.txt
), then the operation may fail because the programs may not know how a line is denoted in the input.
Confused yet? In any event, text files are not text files are not text files. And the solution to this problem is to use full-featured text editor — the subject of another essay.
[1] plain text files explained – http://en.wikipedia.org/wiki/Plain_text
[2] intro’ to ASCII – http://www.theasciicode.com.ar