Reverse a string in place

InfoDabble > Tech Notes > Reverse a string in place
Jump to: navigation, search

By Eric Hartwell - February 6, 2008

A common interview question is, "How would you reverse a string in place?". While it's a simple problem, it lets the interviewer see a number of things:

  • understanding of characters
  • efficient loop and pointer management, language conventions
  • tests for edge conditions such as empty string, null pointer, even lenth, odd length
  • use of system library
  • test driven development approach

It's a good problem because there's more to it than just a simple algorithm. Desktop applications don't use 8 bit characters any more, and web applications never did. Then there are performance and portability issues...

[edit] Basic algorithms

[edit] Pointers pointing

A typical solution in ANSI C would look something like this:[1]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
/* reverse a string in place, return str */
static char* reverse(char* str)
{
    char* left  = str;
    char* right = left + strlen(str) - 1;
    char  tmp;
    while (left < right) {
        tmp      = *left;
        *left++  = *right;
        *right-- = tmp;
    }
    return str;
}

That's a decent solution. The hardcore hacker might go on to eliminate the use of a temporary variable when it's not strictly necessary:[2]

// The extra char buffer is unnecessary (Dominic Cooney - Nov 30, 2002):
    *left  ^= *right;
    *right  ^= *left;
    *left++  ^= *right--;
 
// xor's are for sissies (b, Dec 30, 2002):
    *left  = *left + *right;
    *right  = *left - *right;
    *left  = *left++ - *right--;

[edit] Use the library, Luke

Of course, in the real world, if you're using <string.h> it makes much more sense to actually *use* it:

// The POSIX C version:
puts(strrev("Hello world!")); // outputs "!dlrow olleH"
 
// The ISO conformant version:
puts(_strrev("Hello world!")); // outputs "!dlrow olleH"

The same in PHP:

<?php
echo strrev("Hello world!"); // outputs "!dlrow olleH"
?>

Instead of 8 lines of code, we're using one. Furthermore, that one line of code has already been extensively tested by the compiler manufacturer and by millions of users. But let's assume we have a good reason to write and test our own code (e.g. it's a job interview) ...

[edit] Test cases and test driven design/development

Now let's write the same function using a test-driven approach.[3]

  1. Write one test that expresses what the code should do and through which structure it should do that,
  2. Run for failure ensures that the recently added test fails,
  3. Code minimal solution that makes the written test pass (while not breaking existing tests),
  4. Run for success ensures that test and code fit one another,
  5. Refactor tests and code so that redundant code, constants, initialization and structures should be factored into more general and/or more simpler solution,
  6. Run to success ensures that refactoring does not break code and test base.

First write the basic failure test for the empty function:

static void test_reverse()
{
    // For this to work you must compile with writable strings.
    assert(!strcmp(reverse("123"), "321"));
}
 
static char* reverse(char* str)
{
    return("");
}

Now hardcode a return value so the test will pass:

static char* reverse(char* str)
{
    return("321");
}

Now extend the test case again:

static void test_reverse()
{
    assert(!strcmp(reverse("123"), "321"));
    assert(!strcmp(reverse("4321"), "1234"));
}

Update the code, so that works, and so on. As you get closer to building a real function, the test routine gets more realistic too.

static void test_reverse()
{
    assert(!strcmp(reverse("123"), "321"));
    assert(!strcmp(reverse("4321"), "1234"));
 
    assert(!strcmp(reverse(""), ""));
    assert(reverse(NULL) == NULL);
}

Note that the last two tests deal with parts of the problem that weren't originally specified but are still essential in the real world.

Next, we should add a test to make sure the result is actually in the same location (address) as the input. Also, in the case of a simple function like this, you could also add a test to generate a number of random strings of random length, and compare the result using some other function (like _strrev).

[edit] The real world

We've already covered the basic edge cases:

  • empty string
  • even, odd lengths
  • null pointer

The solution so far seems complete, but it can still fail in the real world in cases like these:

  • illegal pointer (example: pass a numeric value in a strongly typed language)
  • read-only (pointer to const, immutable, ROM, resource, system, remote, outside sandbox)
  • ill-formed string: no terminator, illegal length (e.g. negative)
  • too long (8 bit/16 bit addressing)
  • characters have different sizes (e.g. UTF-8) so result must be built in a temporary buffer then copied back

For example, the source string may be read-only, causing erratic results

  • located in protected memory (const "123"; immutable), causing a compile-time or run-time error
  • located in ROM, causing the changes to be ignored
  • located in swappable code memory, which is reloaded as the original value some time after the routine has run

In all these cases, the original requirement for "in place" results has to be reviewed.

Typically the specification can be changed to permit the result to be returned in a new location - but this adds a whole level of complexity in languages like C where allocated memory blocks must be manually freed.

[edit] Code, Unicode, and the UTF triplets

A string is a finite series of characters. But what's a character?

In legacy applications, a character was represented as 6, 7, or 8 bits of one byte: data type char, char *. In the real world, you can't fit all possible letters into a mere 8 bits. Modern applications tend to use some form of Unicode characters.

At this point, stop and (re)read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).[4]

Each pattern of bits is connected to a particular character using an encoding standard:

  • custom encoding and packing
  • 6-16 bit: ASCII, ANSI, EBCDIC
  • 32 bit Unicode: UTF-8, UTF-16, UTF-32. All three UTF encoding forms need at most 4 bytes (or 32-bits) of data for each character.
    • Under the Unicode standard (and on Unix/Linux) a character is 4 bytes encoded with UTF-32, so wchar_t works properly.
    • Under the Windows API, wchar_t is 2 bytes wide so it represents a UTF-16 character (or part of it).
  • UTF-8 characters can be 1, 2, 3, 4, 5, or 6 bytes

Allen Hadden wrote a nice discussion of the Unicode issues on fogcreek in February 2005:[5]

"We've started asking candidates to write a function that reverses strings in C (one of our developers started after reading it here). Nearly everyone gets it close to right, but some people goof up the end condition. Others end up including the final null in the swap. Nothing surprising there."
"The thing that I find interesting is that nobody considers that the characters might be multi-byte... So now after the candidate gets the answer right, we ask them if there are there any inputs that this function won't work with. Most people don't have any idea, which I find amazing. These are usually people with 5-10 years C/C++ development experience. The ones who do get it usually don't really know how to fix the problem. The best answer we received was that we should use Unicode (which would work, but might force you to redesign your app)."

Even using Unicode isn't quite correct under Windows, since under the Windows API, wchar_t is 2 bytes wide instead of 4. The Windows API breaks the ANSI/ISO C standard by not making wchar_t the character type that supports all the system representable characters in a single wchar_t unit; instead, wchar_t in Windows represents a UTF-16 character (or part of it). So you may still need multiple wchar_ts to represent some of the more obscure letters.

  • Is the encoding: byte, word, packed, custom, ASCII, ANSI, 16 bit, UTF-8, UTF-16, UTF-32, Unicode? What if we don't know?

[edit] Refining and optimizing the solution

But wait, there's more. In .NET applications the concept of an in-memory string gets fuzzy, since *everything* is an object, and the runtime actively protects things like immutable strings.

Then there are performance and portability issues.

  • What are we optimizing for? performance? code size (e.g. client-side code in web page)?
  • Size of string: typical user input (5, 10, 100, 1000 characters) or millions and billions?
  • What platforms must it run on?

Justin Rogers looked at the performance of a variety of string reversing algorithms for managed strings in .NET[6]

"Let me tell you folks, the results here really surprised me, and I don't get surprised very often. Obviously the most performant string reverse would be an in-place reverse over top of the previous string. Unfortunately you can't do this in .NET with the whole concept of immutability, so you have to use your ingenuity to work around this, come up with the right mix of buffer allocations."

This was the fastest algorithm for strings less than ~100-150 characters:

private static string ReverseUsingCharArrayInline(string input) {
    char[] reversed = new char[input.Length];
    for(int i = 0, j = reversed.Length - 1; i <= j; i++, j--) {
        reversed[i] = input[j];
        reversed[j] = input[i];
    }
    return new String(reversed);
}

This method actually performs faster over large strings by about 5% over the fastest algorithm:

private static string ReverseUsingCharArrayCopyAndReverse(string input) {
    char[] reversed = input.ToCharArray();
    Array.Reverse(reversed);
    return new String(reversed);
}

The first version has 5 lines of code, 3 local variables, one loop to test, and it's faster with short strings. The latter version has 3 lines of code, one local variable, two library calls, no conditional checking, looping, or other control constructs to test, and it's faster with long strings.

Which is better?

My answer: Never write code unless you have to. Reduce, reuse, recycle über alles!

[edit] References

  1. Reverse a String of Words (ANSI C version) - eyepopslikeamosquito, Dec 15, 2006)
  2. fogcreek reverse a string - word by word - revisited
  3. Patchwork - Test Driven-Development and Code Coverage
  4. Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  5. Allen Hadden, Reverse strings/words and i18n, February 2005
  6. Justin Rogers: Performance: Fastest string reversing algorithms... (final results), June 2004