Any faster?

Ben · March 10th, 2004, 08:10 AM

Hi JÃ¼rgen

Thanks very much for your reply, the method you proposed took 19.778 seconds to process 8 records, again like the other two methods, it processed the first record twice

Also the method of timing I'm using is looking at how long each instance of the function takes to run, it doesn't track the time between function calls - the proposed method visually seemed to take longer to initiate and seemed longer between each function call.

Ben

joefawcett · March 10th, 2004, 09:19 AM

Quote:

quote:Originally posted by Ben
Strange - I was posting the results to the debug window to ensure they were correct and both methods seem to process the first record twice?????

Did you add anything to the function I wrote, only has record once on my system? How exactly are you calling it?

--

Joe

joefawcett · March 10th, 2004, 09:21 AM

The other thing to do is to create the recordset outside of the function and re-use it by opening and closing as necessary. This will remove the overhead of creating the object many times.

I was trying to work out a cross tab query to produce your desired results but have not been able to get it right yet...

--

Joe

Ben · March 10th, 2004, 09:42 AM

I made a couple of small changes, I changed the SQL to exclude the Top 8 command as there will never me more than 8 records returned, and I changed the getstring function to place a comma instead of a carriage return character. I also added in code to time the function. The exact code as it appears in the DB is below:

Public Function CombineFieldsADO(UniqueCompositeKey)

Dim recTopics As ADOR.Recordset, sSQL As String, sdata As String, i As Integer, sngStart As Single, sngEnd As Single

sngStart = Timer

CombineFieldsADO = UniqueCompositeKey
sSQL = "SELECT Topics.Topic FROM Topics WHERE (((Topics.RecID) = '" & UniqueCompositeKey & "'));"

Set recTopics = New ADOR.Recordset
recTopics.Open sSQL, CurrentProject().Connection

sdata = UniqueCompositeKey & "," & recTopics.GetString(adClipString, , ",", ",", "")

recTopics.Close
CombineFieldsADO = sdata

sngEnd = Timer

Debug.Print CombineFieldsADO
Debug.Print "It took " & Format$(sngEnd - sngStart, "0.000") & " Seconds"

End Function

I'm calling the function through a query on a recordset of all the unique keys (circa 25,000).
Visually it looks like it processes the first record once, but when you look at the stats posted to the debug window it actually processes the first record twice.

Ben

jurgenw · March 10th, 2004, 10:59 AM

If the time your test is displaying accounts for most of the time, then the problem isn't the driving query. It has always been my principle that running a line of code takes less time than retrieving a nibble of data. In fact, it is nearly always faster than retrieving a single bit of data over a LAN.

I would break down your testing to see what's taking the time. For example, at the top of the module containing your function, I'd declare a windows multimedia timer with a millisecond resolution and debug print the time before and immediately after opening your recordset and at other points in time to see how much contributes. The timeGetTime declare is simple and quite accurate. If you need more resolution you can look into the queryperformancecounter (which has nothing to do with queries by the way).

The multimedia timer:

Public Declare Function timeGetTime Lib "Winmm" () As Long

and you use it in your function

Function()
dim t as long

t = timegettime
some code
debug.print timegettime - t & " part x of code"
t = timegettime
some more code
debug.print timegettime - t & " part n of code" & vbcrlf & vbcrlf & "end of call"

Among other things, your timer and this suggestion will identify how much of the time is taken by the 'calling query'.

Now to more basics. Can you move the data to a local disk? Is there an index (unique key, most likely is). How about dimensioning the recordset one time in the module declarations and set it in a procedure that calls the query and then destroys it after the query runs. If the database is not in the same file as your code, you can set the database object with 'opendatabase' and open the actual database rather than a link as you would get with CurrentDb. This enables you to try seek methods against a persistent recordset in memory and use move, find or seek methods to test values against the key.

If I had to do this and found these slow results, I would dimension a getrows array with an order by and run the caller query with the same order by. That would permit you to iterate the array in a single direction with a static variable index so you don't need to start from the beginning each time. A getrows array has less overhead than the recordset from which it is based and caches all the data locally in RAM so you lose the overhead of opening the recordset with each call.

There should be no reason that you can't cut the time in the function down to a few milliseconds per call once you have the data cached in RAM, either as a recordset or an array.

Ciao
JÃ¼rgen Welz
Edmonton AB Canada
[email protected]

Ben · March 10th, 2004, 11:35 AM

Hi JÃ¼rgen,

The data I'm querying is held in the same MDB as the code and while I have been doing the tests it has been located on my C:, so network issues are out of the picture.
The original data in the table does have both a primary key and indexes. The meat of the function is the DAO/ADO query which looks at 31000 records (supplied by a query) and returns up to eight records that match the unique key (supplied by a query).

The first query mentioned concatenates the first four fields in the table (after the primary key), these new "Composite Keys" can be duplicated up to eight times each with a topic.

The second query mentioned does a Select Distinct on the "Composite Keys"

Ben

jurgenw · March 10th, 2004, 12:30 PM

OK, here's come air code to show you what I mean. Assuming the calling function can be sorted by your composite key, I would write a calling procedure to run your driving query and in it, I would first call the InitializeData routine, run your driving query and then maybe add a CloseData routine to erase the array. Here's what I'm suggesting:

Option Compare Database
Option Explicit

Dim mlngC As Long
Dim mArr As Variant

Public Sub InitializeData
Dim r As DAO.Recordset

Set r = CurrentDb.OpenRecordset("Select RecID, Topic From Topic Order By RecID")
r.MoveLast
r.MoveFirst
mArr = r.GetRows(r.RecordCount)
r.Close
Set r = Nothing
mlngC = 0
End Sub

Public Function CombineFields(strKey As String) As String
Dim strTemp As String
Dim lngI As Long

strTemp = strKey
If vArr(0, mlngC < strKey) Then
    Do While vArr(0, mlngC) < strKey)
        mlngC = mlngC + 1
    Loop
End If
If vArr(0, mlngC) = strKey Then
    strTemp = strTemp & "," & vArr(1, mlngC)
    lngI = 1
    Do While vArr(0, mlngC + lngI) = strKey
        strTemp = strTemp & "," & vArr(1, mlngC + lngI)
    Loop
    For lngI = lngI To 8
        strTemp = strTemp & ","
    Next
Else
    strTemp = strTemp & ",,,,,,,,"
End If
CombineFields = strTemp
End Function

This approach will require processing to build the recordset (which I assume must be the lion's share of the time) only one time. Subsequent calls should take milliseconds to process.

Ciao
JÃ¼rgen Welz
Edmonton AB Canada
[email protected]

jurgenw · March 10th, 2004, 12:34 PM

The problem with posting air code is you miss things. I forgot to increment the lngI in the loop:

    Do While vArr(0, mlngC + lngI) = strKey
        strTemp = strTemp & "," & vArr(1, mlngC + lngI)
        lngI = lngI + 1
    Loop

Hope there are no other errors.....

Ciao
JÃ¼rgen Welz
Edmonton AB Canada
[email protected]

Ben · March 11th, 2004, 07:53 AM

Sweeeeet, thanks JÃ¼rgen

A lesson learned, the whole query ran in seconds!

I really appreciate your help on this

Thanks Ben

jurgenw · March 12th, 2004, 07:04 PM

Seconds as opposed to 8 Records/19 seconds * 31,000 Records * 1 minute/60 Seconds * 1 Hour/60 Minutes = a bit over 3.5 hours?

I just had another look at the air code I posted and note that the first If - End If loop can be removed:

If vArr(0, mlngC < strKey) Then
.
.
.
End If

The nested Do While loop should remain and run correctly even without the If test.

The other thing I should mention is that the primary reason that this routine is faster is because it can work against data cached one time in memory. The array processing is extra fast because it is a single direction movement continuing from where it left off thanks to the ability to sort the calling query and the recordset. If instead of an array you used a recordset and used seek (local table) for the first instance (again assuming sorted) and move next until the key no longer matched, you may find a virtually identical performance increase.

One other notion I've heard at p2p that I believe is worth dispelling:

It is faster to process a Long than an Integer as a loop counter on a Win32 computer (any version of Windows after Win 95 currently available to consumers). The notion that it is more efficient to use a smaller dataytpe in code is wrong except in the case of large arrays. This would not be the case where you have one or a few nested loop counters. Even if you have a dozen such variables, the use of each Integer variable rather than a Long uses up the same amount of RAM in a 32 bit environment (an address to the variable is a long and you can't address something less than a long) and when an integer is passed to the CPU, a conversion has to take place to interpret the data as an Integer in the 32 bit space. Sure the conversion happens in a few cycles at billion cycles per second with a gigahertz CPU but the Long datatype does not require this conversion. For this reason I always use a Long for a loop counter.

When you are pulling data off a LAN, it is coming at LAN speed which is about 10 million bytes per second with stop and parity bit (there is actually more overhead and less throughput as a result) assuming you have sole use of 100% of the bandwidth of a typical modern LAN. With 10 users all hitting a file server at the same time, you have disk time and latency and only 1/10th of the bandwith anyway so maybe a million bytes per second. The cpu has a four thousand processor cycles for each long to arrive over the LAN. The result is, that although you are ahead of the game to use Longs in your code, when you are pulling data over a LAN, you save a ton of time by using a smaller datatype. Switching from a Long foreign key to an Integer key (at least as a foreign key, you don't want to give up autonumbers) will save you a great deal of traffic on the LAN. Not only will the user running a query be happier with the speed, all other users of the LAN benefit as well.

Next point. Create a database and add a table with an autonumber field and a text field. Run a loop to add 30 thousand records, adding something like a random 10 character string to the text field. Compact the database and check the size. Now add an index to the text field, compact and check the size. The size of the index is pretty much the size of the total data in the field plus 4 bytes per record, a long which is the sorted pointer or numeric index value.

When you retrieve a table in a file server environment like Access, even if the file is local, the data must travel from the disk to memory in order to be processed. A LAN inserts a huge delay in the transfer time and a WAN makes Access unweildy. What is being retrieved by Access when you open a recordset with a where clause on an indexed field is the index. After it processes the index, it knows which parts of the disk file to request and pulls your data. It is incorrect to say that it pulls the entire table except in the case where it cannot take advantage of an index. As a join key, an autonumber is without peer because, even though it is a long, it is in itself an index, 4 bytes and you're done. Not the data plus another four bytes. If you index an integer data field, you add the size of the integer plus a long to your database for each record in the database. The result is that as a primary key, using an autonumber requires 4 bytes for a unique identifier. Using an integer and indexing starts with 2 bytes for the integer plus 2 bytes for the sorted integer field plus 4 bytes for the actual index. Now you have 8 bytes for your primary key and you've cost yourself an extra 4 bytes in traffic per record over just using a long autonumber plus, when a join is processed in the CPU, you've got some conversion overhead.

You can use an integer type as a foreign key since in the foreign table, the size of the index on a long is the long plus a long (8 bytes) whereas had you chosen an integer it would be 6 bytes per record, the integer plus the long.

If you give this a little extra thought, you will see that a typical junction table of something like a table joining CompanyID to ContactID may be 8 bytes per record. You may further join a CompanyContact to a ProjectID in another junction table. If you add an autonumber field (CompanyContactID) to the CompanyContact junction table, your index is the autonumber, 4 bytes per record. Joining this field to other tables uses far fewer bytes than the index on the two columns (4 bytes for the first field, 4 for the second and 4 for the index value). Adding an autonumber seems to make the table larger, it now has 3 fields rather than two, but you've only added 4 bytes per record that you can now use for an efficient join. Joining on the dual key requires 12 bytes per record of traffic, even though the table only has two fields in it.

Bottom line. ALWAYS add an autonumber key to every table and only ever join on that field.

I have to add a small disclaimer. I've done the size of index testing and assumed that the index is kept in a hidden table stored on contiguous disk clusters. I'm sure after a compact this is generally true. Not knowing how the data pages that store the actual data on disk may mean that some unused data will have to travel to memory before it is discarded so the advantage you gain with the autonumber may not be quite so large when you pull full table recordsets. If you commonly bind to single record recordsets in situations where you don't continually navigate forward and back through a group of records, the advantage can be quite enormous.

Ciao
JÃ¼rgen Welz
Edmonton AB Canada
[email protected]