GENERATING RECORD NUMBERS

            If CDOS needs a record number to find a file, how is that
        record number supplied?  There are a number of ways.  The
        easiest is to let the user supply the record number of the
        record he/she wants, like in the preceding programs.  But in
        reality, how does the user know what the record number is when
        he/she doesn't know where in the file the record is?
        Clairvoyance?  CDOS is smart, but it's not quite the
        Clairvoyant Disk Operating System!

            How often is a number part of the data in the record?
        Often!  With the direct address method just take that number
        and use it or abuse it as necessary to come up with a record
        number.  For example, we might have a parts inventory file.
        Each part is marked with a department number and item code:
        ddd-iii; where ddd is the department number and iii is the
        item code.  The inventory file contains this department
        number/item code, the part name, its cost, its retail price,
        the vendor, and the number remaining in inventory.  A
        warehouse technician wants to find out how many of a certain
        part are still hidden in the warehouse.  He/she punches in the
        department/item code: 659-345.  Our inventory program consists
        of several random files: one file for each department.  The
        program takes the first three digits and opens the appropriate
        file: 0:DEP659.DO, and takes the last three digits as the
        record number in that file: 345.  CDOS then finds record
        number 345 and the program displays the data in that record.
        Note that the records in this file are in numerical order, so
        in a sense it's a sequential file that can be accessed at
        random.  Purists take note that the term "direct addressing"
        is used rather loosely here.  Although direct addressing as
        used here has limited practical applications, it does produce
        only one unique record number for each key entered.

            If we need to search for a record based on the input of
        alphabetic data, we can generate a record number from that
        input.  This is called hashing, or key transformation, because
        the search key entered by the user is transformed into a
        record number.  Hashing is particularly suitable for files
        that are not maintained in alphabetical or numerical order, or
        for files where records are added and deleted often (which
        makes repeated sorting impractical).

            A third technique uses two files: an index file and a data
        file.  The index file has only two fields: the search key and
        the record number where that record is located in the main
        data file.  You may use either a random file or a sequential
        file for an index, but it must be sorted for fast lookup of
        the record number.  A sorted file is called an ordered file.
        Read a sequential index file into RAM first for fast lookup.



                                    - 31 -











        (Index files too large for available RAM can be searched on
        disk.  Refer to the section on searching).

            Other methods of random access file organization exist
        which will work with CDOS Disk Basic.  Please consult advanced
        programming and data management textbooks.


        DELETING RECORDS

            In the sequential file, a deleted record is physically
        removed from the file.  It's gone forever.  It's possible to
        do that, too, with random files, by putting blanks or
        something in the fields and destroying the data therein.  But
        smart programmers use a better way.  Instead of destroying the
        data, just "mark" the record as deleted.  Then, unless the
        record has since been overwritten by a new record, it can
        still be undeleted.  The record remains, but the search
        operation recognizes the mark and ignores it.  We could put a
        one character field in the record that contains a certain
        character to signify a valid record and another character to
        signify a deleted record.  Or we could alter one character of
        the key field.  Here's one method:

            1000 ' Do a logical OR with 128 on the first character
                   of the key field:
            1010 X$ = CHR$(ASC(KY$) OR 128)
            1020 ' Put that modified character back in the key:
            1030 MID$(KY$,1,1) = X$

            Run that little example with your own input and see what
        happens to the first character.  For the "bit twiddlers," we
        set the high bit of that character.  To recover the marked
        record, find it and then flip the first character back to
        normal:

            2000 ' Do a logical AND with 127 on the first character
                   of the key field:
            2010 X$ = CHR$(ASC(KY$) AND 127)
            2020 ' Put it back:
            2030 MID$(KY$,1,1) = X$


        RANDOM ACCESS BY KEY TRANSFORMATION

            Alphabetic characters have an intrinsic numeric value:
        their ASCII code.  For example, the ASCII code for "A" is 65
        and the ASCII code for "a" is 97 (decimal).  Why not take the
        ASCII values of several of the characters in the key entered
        by the user, do some arithmetic on them, and generate a record
        number?  If uppercase versus lowercase isn't crucial when



                                    - 32 -











        matching the key to the desired record, good technique calls
        for converting the input to all of one or the other before
        hashing, so Smith = smith = SMITH.

            When planning a hash, consider only techniques that
        generate reasonable record numbers.  We don't want record
        numbers too small to provide for all our records, nor do we
        want record numbers larger than the total number of records
        expected.  For example, in a file expected to contain about
        1000 records, don't use a hash that generates a record number
        of 45,673.  We also want the record numbers spread out over
        the size of the file, so that not all the records get
        clustered in one area of an otherwise empty file.

            Also consider how to handle hashes that produce identical
        record numbers from different keys.  The matching record
        number is a synonym, and the event is a collision.  It's
        possible for synonyms to occur with most conceivable
        techniques, but if it happens frequently, find a better hash.
        Two of the easiest ways to handle synonyms are to increment
        the record number until an empty record location is found, and
        put it there; or to put the record in an overflow area above
        the top of the main file, where synonyms are stored
        sequentially.

            Here's an example of one of many hashing techniques used
        to add records: decide how many records are likely to be in
        the file.  Take several of the ASCII values of the key entered
        by the user and come up with a large number.  Then divide that
        by a prime number close to the file size and take the
        remainder.  This produces a number from 1 to the prime.  Use
        this as the record number.  Thusly for a file size of 500:

            1. Open the file
            2. Define the fields

            And the coding for the hash:

            100 ' 3. User enters data:
            101 PRINT "Enter record data:";
            102 '    Input statements here.
            103 '    Use one of the fields as a key.
            110 ' 4. Separate the first three letters of the key:
            111 K1$ = LEFT$(KY$,1): K2$ = MID$(KY$,2,1):
                K3$ = MID$(KY$,3,1)
            120 ' 5. Convert each into its ASCII value:
            121 K1 = ASC(K1$): K2 = ASC(K2$): K3 = ASC(K3$)
            130 ' 6. Convert the ASCII values into strings:
            131 K1$ = STR$(K1): K2$ = STR$(K2): K3 = STR$(K3)
            140 ' 7. Trim the leading space off the strings:
            141 K1$ = RIGHT$(K1$,2): K2$ = RIGHT$(K2$,2):
                K3$ = RIGHT$(K3$,2)  ' Uppercase input



                                    - 33 -











            150 ' 8. Concatenate the three strings:
            151 KT$ = K1$ + K2$ + K3$
            160 ' 9. Convert the string into a number:
            161 KT = VAL(KT$)
            170 ' 10. Divide by 499 to find the quotient:
            171 QU = KT / 499
            180 ' 11. Multiply the integer portion of the quotient
                      by the divisor and subtract from the dividend
                      to find the remainder:
            181 RM = KT - INT(QU) * 499
            190 ' 12. The remainder is the record number:
            191 RN = RM

                And the rest of the program:

                13. Is this record number past the end of the file?
                    (IF RN > LOF THEN...).
                    Yes: PUT #1, RN: Go to step 3.
                    No:  Proceed with step 14.
                14. Is there a valid record here?
                    Case 1)  RN < 500
                             Yes: RN = 500: Go to step 13.
                             No:  PUT #1, RN: Go to step 3.
                    Case 2)  RN >= 500
                             Yes: RN = RN + 1: Go to step 13.
                             No:  PUT #1, RN: Go to step 3.

            Synonyms go in an overflow area starting at record number
        500.  There they are added sequentially.  This assumes
        infrequent collisions.  The frequency of collisions depends on
        the number ("density") of records in the file.  When most of
        the record locations are filled with valid records ("dense
        population") collisions are more likely to occur when new
        records are added than if there are many empty locations
        ("sparse population").  It's harder to find a parking place in
        Manhattan than in Pecos, Texas.  Frequent collisions in our
        example program can be handled by adding 500 to the record
        number and starting sequentially from there.  This expands the
        size of the overflow, because we start looking for an empty
        slot at record number + 500 (which could be as high as 999)
        instead of starting at 500.

            Reading a record follows the same procedure.  In fact,
        this program can be efficiently coded to read and write
        records using the same routines.  Query the user as to whether
        he/she wants to find a record or add a record, and set a flag
        accordingly.  Then just do a GET instead of a PUT.  If the
        record requested isn't found by the end of the overflow area,
        then return a "Not found" message.





                                    - 34 -











            For faster access (and more complicated code), add a field
        to the records in the main file storage area that points to
        synonyms of that record located in the overflow area.  For
        example, let's say a record to be added has a record number of
        345, but there's already a valid record in record location
        345.  Searching through the overflow area produces an empty
        slot at 523.  Put the new record at 523.  Then put the number
        523 in the pointer field of record 345.  This links record 523
        to record 345.  When searching for record 523, the hash first
        generates a record number of 345.  Retrieving record 345
        reveals a valid record that does not match the search key
        entered.  Look in the pointer field and find the number 523.
        This points to the record located at 523.  Retrieve record 523
        and check for a match.  Multiple synonyms can be handled be
        storing them sequentially after the first synonym in the
        overflow.

            To change a record, just search for and find it.  Then put
        the new data back where the old was found.  If there's a
        pointer field, be sure to preserve it.

            Be careful when deleting records.  Deleting a record in
        the main file that points to a record in the overflow also
        loses the record pointed to.  Even if pointers aren't used,
        deleting a record in the main file results in losing its
        synonym in the overflow.  Provide a way to find a record in
        the overflow even if its synonym in the main file (the entry
        point) is deleted.  If using pointers, delete a record with a
        marker and preserve the pointer.  Then a search operation that
        finds a deleted record can still check the pointer field to
        see if a synonym is stored in the overflow.  If not using
        pointers, a search operation that finds a deleted record in
        the main file can start looking sequentially in the overflow
        for a match.  When deleting an overflow record, make provision
        for multiple synonyms.  If the search operation goes to the
        overflow to look for a match and finds a deleted record, it's
        not safe to assume there isn't a match.  The deleted record
        may have been the synonym pointed to by the record in the main
        file, with other synonyms stored sequentially behind it.  When
        an overflow record is deleted and there are other synonyms,
        the deleted record can be marked to signal the search
        operation to keep looking.  Or an easier (but slower)
        technique continues to look for a match until an empty record
        location (or end of the overflow) is reached.

            There are numerous twists on the theme of hashing record
        numbers, handling synonyms, adding, changing, and deleting
        records in this type of random access file.  Check out an
        advanced data management text for more ideas.





                                    - 35 -