Schwartzian sort example in "Text Processing in Python"

I was browsing through "Text Processing in Python" and tried its example about Schwartzian sort.

I used following structure for sample data which also contains empty lines. I sorted this data by fifth column: 383230 -49 -78 1 100034 '06 text' 9562 'text' 720 'text' 867 335067 -152 -18 3 100030 'text' 2400 'text' 2342 'text' 696 136592 21 230 3 100035 '03. text' 10368 'text' 1838 'text' 977

Code used for Schwartzian sorting:

for n in range(len(lines)):       # Create the transform
    lst = string.split(lines[n])
    if len(lst) >= 4:             # Tuple w/ sort info first
        lines[n] = (lst[4], lines[n])
    else:                         # Short lines to end
        lines[n] = (['\377'], lines[n])

lines.sort()    # Native sort

for n in range(len(lines)):       # Restore original lines
    lines[n] = lines[n][1]

open('tmp.schwartzian','w').writelines(lines)

I don't get how the author intended that short or empty lines should go to end of file by using this code. Lines are sorted after the if-else structure, thus raising empty lines to top of file. Short lines of course work as supposed with the custom sort (fourth_word function) as implemented in the example.

This is now bugging me, so any ideas? If I'm correct about this then how would you ensure that short lines actually stay at end of file?

EDIT: I noticed the square brackets around '\377'. This messed up sort() so I removed those brackets and output started working.

else:                         # Short lines to end
    lines[n] = (['\377'], lines[n])
print type(lines[n][0])
>>> (type 'list')

I accepted nosklo's answer for good clarification about the meaning of '\377' and for his improved algorithm. Many thanks for the other answers also!

If curious, I used 2 MB sample file which took 0.95 secs with the custom sort and 0.09 with the Schwartzian sort while creating identical output files. It works!

Answer

You're right in identifying that the code you were using for Schwartzian sorting had a couple of issues, especially with handling short or empty lines in the way you expected. Let's break down what went wrong and how to fix it.

Problem Explanation

The main issue with your original implementation is in the line where you are creating the tuple for "short lines" in the else block:

lines[n] = (['\377'], lines[n])

Here, you are wrapping the string '\377' (which is the ASCII value for the "delete" character) in a list ['\377']. This makes the first element of the tuple a list, which causes problems during sorting. When you sort the list of tuples, Python compares lists, which doesn't behave the same way as comparing simple string or numeric values. This explains why your sorting didn't work as expected.

Fix: Remove the List Wrapper

To fix this, simply remove the square brackets around '\377', so that the tuple will contain just a string (the delete character), and Python will be able to compare it correctly:

else:                         # Short lines to end
    lines[n] = ('\377', lines[n])

Now, '\377' will be compared directly during the sort, and since '\377' has a very high ASCII value, any line with it will end up at the end of the sorted list.

Explanation of the Schwartzian Transform

The Schwartzian Transform is a technique used to speed up sorting by applying a transformation (like extracting the fifth column) before sorting, and then restoring the original list after the sort. It avoids the need to repeatedly compute the sort key for each comparison, which is the performance bottleneck for sorting in some cases.

Here's a quick step-by-step breakdown of the logic:

Transform Phase: You create a tuple where the first element is the value you want to sort by (e.g., the fifth column), and the second element is the original line. This allows you to sort based on the first element but still keep the original data.
```
lines[n] = (lst[4], lines[n])
```
Sorting Phase: Once you've transformed each line into a tuple, you perform the sort. Since tuples are compared lexicographically, Python will sort based on the first element (the fifth column) of each tuple.
```
lines.sort()
```
Restoration Phase: After sorting, you strip out the sorted values and keep the original lines.
```
lines[n] = lines[n][1]
```

This method ensures that sorting is efficient, especially for large datasets.

Complete Fixed Code

Here’s the complete version of your script with the fix applied:

for n in range(len(lines)):       # Create the transform
    lst = string.split(lines[n])
    if len(lst) >= 4:             # Tuple w/ sort info first
        lines[n] = (lst[4], lines[n])
    else:                         # Short lines to end
        lines[n] = ('\377', lines[n])  # Fix: remove the square brackets

lines.sort()    # Native sort

for n in range(len(lines)):       # Restore original lines
    lines[n] = lines[n][1]

open('tmp.schwartzian', 'w').writelines(lines)

Additional Notes

ASCII value of '\377': The value '\377' corresponds to the ASCII code for the "delete" character (decimal 255), which is the highest ASCII value. This ensures that any short lines, which are assigned this value, will be placed at the end of the sorted list.
Performance: As you've noted, Schwartzian sorting is faster than a regular sort when dealing with large datasets because it avoids recalculating the sort key multiple times. However, if you're working with extremely large files or need even more speed, you might want to look into using libraries like pandas, which can offer optimized sorting methods.

Let me know if you have more questions or if anything is unclear!