11.4. Text manipulation tools

	Also see
	Also see tac, and cat over in this section, Section 11.2, as they can perform text modification too

sort

Sorting text with no options the sort is alphabetical. Can be run on text files to sort them alphabetically (note it also concatenates files), can also be used with a pipe '|' to sort the output of a command.

Use sort -r to reverse the sort output, use the -g option to sort 'numerically' (ie read the entire number, not just the first digit).

Examples:

cat shoppinglist.txt | sort

The above command would run cat on the shopping list then sort the results and display them in alphabetical order.

sort -r shoppinglist.txt

The above command would run sort on a file and sort the file in reverse alphabetical order.

Advanced sort commands:

sort is a powerful utility, here are some of the more hard to learn (and lesser used) commands. Use the -t option to use a particular symbol as the separator then use the -k option to specify which column you would like to sort by, where column 1 is the first column before the separator. Also use the -g option if numeric sorting is not working correctly (without the -g option sort just looks at the first digit of the number). Here is a complex example:

sort -t : -k 4 -k 1 -g /etc/passwd | more

This will sort the "/etc/passwd" file, using the colon ':' as the separator. It will sort via the 4th column (GID section, in the file) and then sort within that sort using the first (name) if there are any ties. The -g is there so it sorts via full numbers, otherwise it will have 4000 before 50 (it will just look at the first digit...).

join Will put two lines together assuming they share at least one common value on the relevant line. It won't print lines if they don't have a common value.

Command syntax:

join file1 file2

cut Prints selected parts of lines (of a text file), or, in other words, removes certain sections of lines. You may wish to remove things according to tabs or commas, or anything else you can think of...

Options for cut:

-d --- allows you to specify another delimiter, for example ':' is often used with /etc/passwd:
cut -d ':' (and probably some more options here) /etc/passwd
-f --- this option works with the text by columns, separated according to the delimiter. For example if your file had lines like "result,somethingelse,somethingelse" and you only wanted result you would use:
cut -d ',' -f 1 /etc/passwd
This would get you only the usernames in /etc/passwd
"," (commas) --- used to separate numbers, these allow you to cut particular columns. For example:
cut -d ':' -f 1,7 /etc/passwd
This would only show the username and the shell that each person is setup for in /etc/passwd.
"-" (hyphen) --- used to show from line x to line y, for example 1-4, (would be from lines 1 to line 4).
cut -c 1-50 file1.txt
This would cut (display) characters (columns) 1 to 50 of each line (and anything else on that line is ignored)
-x --- where x is a number to cut from line 1 to "x" and use x- (where x is a number) to cut from "x" to the end.
cut -5, 20-, 8 file2.txt
This would display ("cut") characters (columns) 1 to 5, 8 and from 20 to the end.

ispell/aspell

To spell check a file interactively, prompts for you to replace word or continue. aspell is said to be better at suggesting replacement words, but its probably best to find out for yourself.

aspell example:

aspell -c FILE.txt

This will run aspell on a particular file called "FILE.txt", apsell will run interactively and prompt for user input.

ispell example:

ispell FILE.txt

This will run ispell on a particular file called "FILE.txt" ispell will run interactively and prompt for user input.

chcase Is used to change the uppercase letters in a file name to lowercase (or vice versa).

You could also use tr to do the same thing...

cat fileName.txt | tr [A-Z] [a-z]  > newFileName.txt

The above would convert uppercase to lowercase using the the file "fileName.txt" as input and outputting the results to "newFileName.txt".

cat fileName.txt | tr [a-z] [A-Z] > newFileName.txt

The above would convert lowercase to uppercase using the the file "fileName.txt" as input and outputting the results to "newFileName.txt".

chcase (a perl script) can be found at the chcase homepage.

fmt (format) a simple text formatter. Use fmt with the -u option to output text with "uniform spacing", where the space between words is reduced to one space character and the space between sentences is reduced to two space characters.

Example:

fmt -u myessay.txt

Will make sure the amount of space between sentences is two spaces and the amount of space between words is one space.

paste Puts lines from two files together, either lines of each file side by side (normally separated by a tab-stop but you can have any symbols(s) you like...) or it can have words from each file (the first file then the second file) side by side.

To obtain a list of words side by side, the first word from the first file on the left side separated by a tab-stop then the first word from the second file you would type:

paste file1.txt file2.txt

To have the list displayed in serial, first word from first file, [Tab], second word from first file, then third and fourth until the end of the first file type:

paste --serial file1.txt file2.txt

expand Will convert tabs to spaces and output it. Use the option -t num to specify the size of a "tapstop", the number of characters between each tab.

Command syntax:

expand file_name.txt

unexpand Will convert spaces to tabs and output it.

Command syntax:

unexpand file_name.txt

uniq Eliminates duplicate entries from a file and it sometimes greatly simplifies the display.

uniq options:

-c --- count the number of occurances of each duplicate
-u --- list only unique entries
-d --- list only duplicate entries

For example:

uniq -cd phone_list.txt

This would display any duplicate entries only and a count of the number of times that entry has appeared.

tr (translation). A filter useful to replace all instances of characters in a text file or "squeeze" the white space.

Example:

cat some_file | tr 3 5 > new_file

This will run the cat program on some file, the output of this command will be sent to the tr command, tr will replace all the instances of 3 with 5, like a search and replace. You can also do other things such as:

cat some_file | tr [A-Z] [a-z] > new_file

This will run cat on some_file and convert any capital letters to lowercase letters (you could use this to change the case of file names too...).

	Alternatives
	You can also do a search and replace with a one line Perl command, read about it at the end of this section.

nl The number lines tool, it's default action is to write it's input (either the file names given as an argument, or the standard input) to the standard output.

Line numbers are added to every line and the text is indented.

This command can do take some more advanced numbering options, simply read the info page on it.

These advanced options mainly relate to customisation of the numbering, including different forms of separation for sections/pages/footers etc.

Also try cat -n (number all lines) or cat -b (number all non-blank lines). For more info on cat check under this section: Section 11.2

There are two ways you can use nl:

nl some_text_file.txt

The above command would add numbers to each line of some_text_file. You could use nl to number the output of something as shown in the example below;

grep some_string some_file | nl

Perl search and replace

text To search and replace text in a file is to use the following one-line Perl command[1]:

$ perl -pi -e "s/oldstring/newstring/g;" filespec [RET]

In this example, oldstring is the string to search, newstring is the string to replace it with, and filespec is the name of the file or files to work on. You can use this for more than one file.

Example: To replace the string `helpless' with the string `helpful' in all files in the current directory, type:

$ perl -pi -e "s/helpless/helpful/g;" * [RET]

Also try using tr to do the same thing (see further above in this section).

If these tools are too primitive

If these text tools are too simple for your purposes then you are probably looking at doing some programming or scripting.

If you would like more information on bash scripting then please see the advanced bash scripting guide, authored by Mendel Cooper.

sed and awk are traditional UNIX system tools for working with text, this guide does not provide an explanation of them. sed works on a line-by-line basis performing substitution and awk can perform a similar task or assist by working on a file and printing out certain information (its a programming language).

You will normally find them installed on your GNU/Linux system and will find many tutorials all over the internet, feel free to look them up if you ever have to perform many similar operations on a text file.

11.4. Text manipulation tools

Notes