Old junk code: Word finder

Skrevet af ngws, 17. januar 2013 - 16:00

If you ever get tired of looking at your own junk code, take a look at this.

In August 2008, when I was still learning to program in C, I created a program
"ordfinder" (eng: word finder) which, given a word and a dictionary, prints the
words from the dictionary which can be created from the letters from the given
word in any order. Incredibly, it ended up compiling and works perfectly for any
word whose length does not exceed 8 characters, although it is a bit slow.

But why not more than 8 characters? My view of memory might have been a bit
naive back then, because the first step in my algorithm is to generate and
store all permutations of all subsequences of the given word. That is, if the
string is "me", my program stores the array { "m", "e", "me", "em" } in
memory before going on to reading the dictionary and looking for matches.

If the string is "you", the program stores { "y", "o", "yo", "oy", "u", "yu", "uy", "ou", "uo", "you", "yuo", "oyu", "ouy", "uyo", "uoy" }.

If the string is "computer", the program stores the 109600 permutations of the
subsequences of "computer".

If the string is "difficult", the length of 9 characters means that the program
attempts to store 986409 strings of lengths 1 to 9. That probably takes up not
more than 10 MB, so it shouldn't be a problem. However, my program seems to
store the list of words on the stack instead of in memory, so words with length
9 or above cause a stack overflow to happen.

In any case, a word length of 10 would require about 100 MB, a word length of 11
about 1.2 GB, a word length of 12 about 15.6 GB, and a word length of 17 (like
"inconspicuousness") about 16,5 Petabytes (16500000 GB). That's 6,5 Petabytes
more than what the Internet Archive uses to store millions of
websites, books, video and audio.

So perhaps neither my algorithm nor my implementation was that good.

The code

Note that this code doesn't actually compile, because of all the wrong
code. However, it did compile back in 2008 which means that either I added the
wrong code after I had compiled it, or I used an overfriendly compiler (I don't
remember which compiler it was, but it ran on Windows). I have run the old
executable with wine, and that works.

It's not necesarry to know C to laugh at this code, but it helps.

We'll start with some basic #includes.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <math.h>

So far, so good. Then the global variables with descriptive names. And let's
declare four strings of length 0 to be statically allocated, because we'll just
extend them later on...?

char os[0],s[0],r[0],t[0];
int l,c,rc,k,sk,i,ii,iii,ri;

The next step is to define our own version of C's builtin strstr function
(almost). I was used to PHP, so I wanted the same return values as PHP's
strpos.

int strpos (const char *haystack, const char *needle) {
  int i;

  if (strlen (haystack) < strlen (needle))
    return -1;

  for (i = 0; i <= (strlen (haystack) - strlen(needle)); i++) {
    if (!strncmp (&haystack[i], needle, strlen(needle)))
      return i;
  }

  return -1;
}

Then it's time for the main function. We don't want to separate it into
auxiliary functions, because that's just ugly!

Indentation? Too much wastes too much space.

int main(int argc, char *argv[])
{
 if (argc>1) {
 strcpy(os,argv[1]);
 }
 else {
 printf("Indtast ord: ");
 gets(os);
 }
 printf("T\x91nker...\n");
 strcpy(s,os);
 for(i=0;s[i];i++) {
 s[i]=tolower(s[i]);
 }

Wait, what? We use strcpy to copy the string argv[1], which contains the
word we want to permute, into the statically allocated os with length 0? Or we
read a line from standard in and save in os? And almost the same for s?
That's... not good.

At least these two lines aren't that bad.

 l=strlen(s);
 c=pow(l,l);

But then begins the actual permutation generation logic. I have tried to
re-understand it, with no success.

 rc=1;
 i=0;
 while (i<l-1) {
 rc=rc*(l-i);
 i++;
 }

While we're at it, why not declare two to-be-statically-allocated arrays with
dynamically-generated ints as lengths?

 int ca[l];
 char ra[rc][l+1];

And then some more assignments and while loops...

 ri=0;
 i=0;
 while (i<c) {
 k=1;
 ii=0;
 while (ii<l && k==1) {

This formula does something. I'm not sure what.

 ca[ii]=floor(i/pow(l,l-ii-1))-floor(i/pow(l,l-ii))*l;

More while loops, now also with if statements.

 iii=0;
 while (iii<ii) {
 if (ca[ii]==ca[iii]) {k=0;}
 iii++;
 }
 ii++;
 }
 if (k==1) {
 strcpy(ra[ri],"");
 ii=0;
 while (ii<l) {
 strncpy(t,s+ca[ii],1);

Let's concatenate t onto ra[ri], a string which hardly exists due to the
char ra[rc][l+1]; magic above.

 strcat(ra[ri],t);
 ii++;
 }

And why not concatenate an end-of-string mark onto a string which, if it
doesn't have an end-of-string mark, will make strcat fail miserably?

 strcat(ra[ri],"\0");

And then more junk.

 sk=1;
 ii=0;
 while (ii<ri && sk==1) {
 if (strcmp(ra[ri],ra[ii])==0) {sk=0;}
 ii++;
 }
 if (sk==1) {
 //printf("%s\n",ra[ri]);
 ri++;
 }
 }
 i++;
 }
 //printf("\nOrd: %s\nOrdl\x91ngde: %d\nOrdkombinationer: %d\n",os,l,ri);

Phew... At this point, I'm certain that ra is supposed to be an array of all
word permutations. So let's open our dictionary "ord.txt" and look for matches.

 FILE *f;
 char wrd[128];
 if (f=fopen("ord.txt","r")) {
 FILE *fw;

Everything is written both to output.txt and standard out. Anything else would
be stupid.

 fw=fopen("output.txt","w");
 printf("Ord dannet af \"%s\":\n\n",os);
 fprintf(fw,"Ord dannet af \"%s\":\n\n",os);
 int wc=0;
 while(!feof(f)) {
 if(fgets(wrd,126,f)) {

The words each end with a newline, so let's replace the newline with an
end-of-string mark.

 wrd[strlen(wrd)-1]=0;
 //printf("%s\n",wrd);
 k=0;
 ii=0;
 while (ii<ri && k==0) {

The magical core of the matching logic, using our own strpos:

 if (strpos(ra[ii],wrd)>-1) {k=1;}

If k == 1, something good happens. But it doesn't happen at once for some
reason.

 ii++;
 }
 if (k==1) {
 printf("%s\n",wrd);
 fprintf(fw,"%s\n",wrd);
 wc++;
 }
 }
 }
 printf("\nI alt %d ord\n",wc);
 fprintf(fw,"\nI alt %d ord",wc);
 fclose(fw);
 fclose(f);
 system("output.txt");
 }
 return 0;
}

And that's my pretty C code.

The SML equivalent

To make my inefficient algorithm a bit clearer, I have made a few SML functions
to do the same as above:

open List

(* Removes an element from a list. *)
fun remove x (y :: ys) = if x = y
                         then ys
                         else y :: remove x ys

(* Tails of a list. Stolen from Haskell's Data.List. *)
fun tails [] = [[]]
  | tails (xxs as (_ :: xs)) = xxs :: tails xs

(* Non-empty subsequences of a list. Stolen from Haskell's Data.List. *)
fun nonEmptySubsequences [] = []
  | nonEmptySubsequences (x :: xs) =
    let
        fun f (ys, r) = ys :: (x :: ys) :: r
    in 
        [x] :: foldr f [] (nonEmptySubsequences xs)
    end

(* All permutations of a list. *)
fun permutations [] = [[]]
  | permutations xs =
    let
        fun subPermutations x = map (fn ys => x :: ys) (permutations (remove x xs))
    in
        concat (map subPermutations xs)
    end


(* Permutations of subsequences of a list. *)
fun subsequencePermutations xs = concat (map permutations (nonEmptySubsequences xs))

(* The same, but for a string. *)
fun stringSubsequencePermutations s = map implode (subsequencePermutations (explode s))

(* Finds words in `wordList` which matches any permutation of any subsequence
 * of `word`. *)
fun findMatchingWords word wordList =
    let
        val wordPermutations = stringSubsequencePermutations word
    in
        filter (fn testWord =>
                   exists (fn word => word = testWord)
                          wordPermutations) wordList
    end

As well as some SML functions to calculate the number of permutations and bytes:

(* Calculates the factorial. *)
fun factorial 0 = 1
  | factorial n = n * factorial (n - 1)

(* Calculates the binomial coeffecient. *)
fun binomc n k = factorial n div (factorial k * factorial (n - k))

(* Gives [m, m + 1, ..., n]. *)
fun upTo m n = if m < n
               then m :: upTo (m + 1) n
               else [m]

(* Gives the total number of word subsequence permutations for a given word
 * length. *)
fun nPermutations len = foldl op+ 0 (map (fn n => factorial n * binomc len n)
                                         (upTo 1 len))

(* Gives the size in bytes for storing all word subsequence permutations for a
 * given word length in a space-saving way: there are `len` arrays, each taking
 * up space for the pointer to the array and the permutations of subsequences of
 * length n where `1 <= n <= len` and n is unique.
 *)
fun nSize len = 8 * len + foldl op+ 0 (
                map (fn n => (n + 1) * factorial n * binomc len n)
                    (upTo 1 len))

The alternative

Preprocess the dictionary into a clever data structure and don't use up all the
memory.

Article picture based on https://commons.wikimedia.org/wiki/File:2001-91-1_Computer,_Laptop,_Pentagon_(5891422370).jpg

log ind eller opret konto for at skrive kommentarer

Q&A

Erhverv