Groupings and Speed

asearle · November 10th, 2006, 04:03 AM

Dear All,

I have an XSL stylesheet which reads data (from and XML section titled 'rep_base') and displays the same data, once grouped on Team (tm) and once grouped on Creditor (cred). The issue is that the more simple view (team) is much slower than the more complicated view. Let me explain ...

The underlying data is identical but is grouped over about 15 teams or, alternatively, about 200 creditors.

The display by team is very simple displaying only one value (GR_CNT) but the display for creditor is more complicated (displaying 7 values).

Each display also retrieves the text for the item retrieved, e.g.
   <xsl:variable name="pteamcodestore" select="tm"/>
   <xsl:value-of select="key('krefteam', $pteamcodestore)[1]/TEAM_NAME" />
   ... to get the TEAM_NAME

Some of the code is parameterised (e.g. [*[name()=$plevel]=$pfiltfld] to apply a filter such as 'cred=ABC') but this is identical both for the team and the creditor groupings.

When I run the sheet (with views switched by javascript), the team view takes about 11 secs (up to 30 secs for larger volumes of data) while the creditor view returns data in about 1 or 2 secs. This speed remains constant even if I switch between the views multiple times.

So there are two issues for me here: First, I would like to increase the speed of my selection/display and, second, I would be most interested to understand how different groupings impact on speed.

Or maybe there is an error in my code and/or keys which I haven't noticed.

Any tips would be a great help.

Regards and many thanks,
Alan Searle

--------------------------------------------------
Code Excerpts ...

[key declarations ...]
<xsl:key name="kteam" match="rep_base" use="tm" />
<xsl:key name="kcred" match="rep_base" use="cred" />

<xsl:key name="krefteam" match="ref_team" use="TEAM_CODE" />
<xsl:key name="krefcred" match="ref_cred" use="CRED_CODE" />

[template call: team ...]
<xsl:apply-templates
     select="rep_base[generate-id()=generate-id(key('kteam', tm)
         [(*[name()=$plevel]=$pfiltfld)])]" mode="dteam">
     <xsl:sort select="tm"/>
</xsl:apply-templates>

[template call: creditor ...]
<xsl:apply-templates
     select="rep_base[generate-id()=generate-id(key('kcred', cred)
        [(*[name()=$plevel]=$pfiltfld)])]" mode="dcreditor">
    <xsl:sort select="plt"/>
    <xsl:sort select="cred"/>
</xsl:apply-templates>

[template: team ...]
<xsl:template match="rep_base" mode="dteam">
<tr>
   <td><xsl:value-of select="plt" /></td>
   <td><xsl:value-of select="tm" /></td>
   <xsl:variable name="pteamcodestore" select="tm"/>
   <td><xsl:value-of
         select="key('krefteam', $pteamcodestore)[1]/TEAM_NAME" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kteam', tm)[*[name()=$plevel]=$pfiltfld]/GR_CNT)" /></td>
</tr>
</xsl:template>

[template: creditor ...]
<xsl:template match="rep_base" mode="dcreditor">
<tr>
  <td><xsl:value-of select="plt" /> </td>
  <td><xsl:value-of select="cred" /> </td>
   <xsl:variable name="pcredcodestore" select="cred"/>
   <td><xsl:value-of select="key('krefcred', $pcredcodestore)[1]/CRDTR_NAME" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GR_CNT)" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GRIO_CNT)" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="format-number(1-(sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/zGRIO_CNT)
         div sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GR_CNT)), '#0.00%')" />
   </td><td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GRIO_GEN)" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GRIO_MAN)" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GRIO_SKIP)" /></td>
   <td ALIGN="RIGHT"><xsl:value-of
         select="sum(key('kcred', cred)[*[name()=$plevel]=$pfiltfld]/GRIO_NONSKIP)" /></td>
</tr>
</xsl:template>

mhkay · November 10th, 2006, 01:34 PM

Performance issues are necessarily product-dependent. I don't think it's possible to give any useful help without knowing what product you are using; and for my part, I can't give any useful help unless that product happens to be Saxon.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

asearle · November 13th, 2006, 11:58 AM

Michael,

How do I check which XSL interpreter is being used by our browser (IE 6.0, sp2)?

As it is, I have been investigating this quite a lot (it could be a slow-stopper for us) and have found the following pattern:

Using keys and the Muenchian method for grouping and summing (as shown in my earlier code samples) I can get the result that I require but have different response times.

Anyway, I have stripped out all unnecessary code (including the summing) so that I was left with the key statements and have found the following (while querying the same XML data):

- The fewer entries in the group, the slower the response time.

For example:
1. there are several hundred 'creditors' and when I group on them, I get almost instant response.
2. there are about 30 'teams' and a grouping on this takes about 10 secs.
3. there are 6 'business units' and a grouping on this takes about 20 secs.

These test were done using exactly the same code, just changing the key definitions.

So now I am very curious to know whether this is a known issue? And if so, whether there are any work-rounds or suggestions?

Or maybe it is an issue with my XML (see below)? Maybe I should include the key fields as attributes (e.g. <item cred='abc' team='team1'> ) rather than separate data items? Or doesn't that make a different?

As it is, it all seems very strange because the code is querying exactly the same data.

Any ideas / tips would be a great help.

Many thanks,
Alan Searle

PS: Here is an entry from the XML showing a record with the various fields that are used for grouping ...

<rep_base>
<PERIOD>1</PERIOD>
<businessunit>V</businessunit>
<plant>E</plant>
<team>VZT</team>
<creditor>23002242</creditor>
<GRIO_CNT>1</GRIO_CNT>
<GRIO_GEN>1</GRIO_GEN>
<GRIO_SKIP>1</GRIO_SKIP>
<GRIO_INSPEC_OK>1</GRIO_INSPEC_OK>
<GR_CNT>1</GR_CNT>
<GR_STUCK>30</GR_STUCK>
<GR_CNT_CHECK>1</GR_CNT_CHECK>
</rep_base>

mhkay · November 13th, 2006, 12:05 PM

The processor in IE is MSXML (not sure which version).

I'm afraid I can't give you any information about MSXML performance characteristics. Ask Microsoft.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

asearle · November 14th, 2006, 05:30 AM

Hi Michael,

I have been doing a lot of googling on the speed issue (it is quite central to our reporting) and have found the following site giving statistics on the speed of various grouping options:

http://www.tkachenko.com/blog/archives/000401.html

Here it looks like the Muenchian method can suffer from big drops in speed when you try to scale it up to handle larger volumes of data (see the chart).

The most efficient option seemed to be with 'set:distinct' syntax but that didn't work with my browser (IE 6.0 sp2) so I tried the second fastest, using 'count' syntax: Supposedly still significantly faster than the classic Muenchen method.

Below is the recommended syntax:

<xsl:for-each select="
orders[count(.| key('countryKey', @ShipCountry)[1]) = 1]">

I tried this on my data and, yes, it is significantly faster than the normal 'generate-id' syntax.

However, when I tried to incorportate filters (plant='M'):

<xsl:for-each select="rep_base[count(.| key('testkey', team)[plant='M']) = 1]">

... and summing (summing on GR_CNT) ...

<xsl:value-of select="sum(count(.| key('testkey', team)[1])=1/GR_CNT)"/>

... it wouldn't work.

Maybe the 'count' syntax is not conducive to filtering and summing (i.e. it's already applying an arithmetic function to the data)? Or maybe my syntax is wrong?

I know that I am using a different processor to yours (MSXML) but understanding alternative grouping techniques might be interesting for many forum visitors.

As soon as I have results from my performance testing, I will post them here for everyone to see :-)

Many thanks,
Alan Searle

mhkay · November 14th, 2006, 05:46 AM

I remember Jeni Tennison did some performance comparisons a few years ago of the two different ways of doing identity testing in the Muenchian method (the generate-id() approach and the count() approach). She found that with some processors, generate-id() was faster, with others, count() was faster. This is exactly why it's impossible to give performance advice without knowing a lot about the specific processor.

I can't see anything obviously wrong with this expression:

<xsl:for-each select="rep_base[count(.| key('testkey', team)[plant='M']) = 1]">

but perhaps it doesn't match your data.

This expression, however, is complete nonsense:

<xsl:value-of select="sum(count(.| key('testkey', team)[1])=1/GR_CNT)"/>

I can't even work out what you intended it to mean.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

asearle · November 14th, 2006, 06:25 AM

Yipppeeeee!

Michael, I tweaked the syntax (based on your feedback) and found that with the following:

<xsl:for-each select="rep_base[count(.| key('testkey', team)[1])=1 and plant='M']">

... as the key declaration (with filter) and the following:

<xsl:value-of select="sum(key('testkey', team)/GR_CNT)"/>

... as the summing clause ...

I could get my speed down from about 12 secs to 2 secs.

We are very pleased about this because the speed issues were really threatening the project.

I also hope that these snippets of syntax will help others who can experiment with the two grouping/summing techniques to see which is best for them.

And thanks for coming back with replies even though this (i.e. MSXML) isn't really your field.

Cheers,
Alan.

asearle · November 15th, 2006, 09:13 AM

Topic: Difference in behaviour between generate-id and count ?

As already mentioned:

For some of my summing/grouping operations I have been using the 'count' command instead of the 'generate-id' method.

In many cases this works fine but one situation is giving me problems.

Below are the two versions (of the same functionality) that I have been experimenting with:

1. <xsl:for-each select="rep_base[generate-id()=generate-id(key('kteam', team)[(cred=$pfiltfld)])]">
2. <xsl:for-each select="rep_base[count(.| key('kteam', team)[1])=1 and (cred=$pfiltfld)]">
[then ...]
  <tr>
  <td ALIGN="RIGHT">
  <xsl:value-of select="sum(key('kteam', team)[cred=$pfiltfld]/GR_CNT)" />

With version '1.', I get the correct summing and grouping (but with slow speed) and with version '2.' (count)
I get the column headers but the operation returns no data.

I then removed the second filter on version '2.' (i.e. "and (cred=$pfiltfld)" ) and found that data was returned very quickly
but of course I got full data rather than data specific to my 'cred' filter. The strange thing is the that the 'cred'
filter must work because (as you can see) it is used in several locations.

(NB: During these tests I made no other change to the code)

So this is starting to make me think that maybe the 'generate-id' and the 'count' methods handle filter logic in different
ways but am not sure why and how. Indeed, similar (count) logic has worked fine in other locations.

Maybe a second pair of eyes looking at my code will see where I have gone wrong? Or maybe there is a different in processing logic that I am not aware of?

Many thanks for your help.

Regards,
Alan Searle.

mhkay · November 15th, 2006, 09:24 AM

1. <xsl:for-each select="rep_base[generate-id()=generate-id(key('kteam', team)[(cred=$pfiltfld)])]">

2. <xsl:for-each select="rep_base[count(.| key('kteam', team)[1])=1 and (cred=$pfiltfld)]">

It seems to me that you are using these constructs without fully understanding what they mean.

Firstly, note that in the first construct you are relying on the fact that generate-id(), if given a set containing multiple nodes, returns the id of the first node in the set. So

generate-id() = generate-id(key('k', team))

tests whether the current node is the first node with the relevant key value. When you add a predicate

generate-id() = generate-id(key('k', team)[cred=$p])

you are testing whether it is the first node of the subset of nodes with that key value that satisfy the predicate.

The expression count(.| key('kteam', team)[1])=1 returns true if the current node is the first node selected by key('kteam', team). If you want a predicate equivalent to the above, the way to get it is

count(.| key('kteam', team)[cred=$p][1])=1

What you have actually written is

count(.| key('kteam', team)[1])=1 and (cred=$pfiltfld)

which returns true only if (a) this node is the first among all the nodes with this team value, and (b) this node also satisfies the predicate.

If the first node doesn't satisfy the predicate, then the two expressions will give you quite different answers.

Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference

asearle · December 11th, 2006, 05:17 AM

Hallo Michael,

Thanks for the explanation and, yes, I agree that I didn't/don't completely understand the logic. As it is I have been doing lots of experimenting with generate-id() and count() clauses and am still having some problems. Maybe you can take a quick peep at my syntax and see if you can spot the issue?

In several locations I use the generate-id() clause thus ...

<xsl:for-each select="rep_base[generate-id()=generate-id(key('kteam', tm)[(cred=$pfiltfld)])]">

... using the key 'kteam':

<xsl:key name="kteam" match="rep_base" use="tm" />

This retrieves data grouped by teams (tm), filtered by creditors (cred) and works fine but is slow.

Syntax using the count() clause seems to be much faster and I am using it in some locations however here I am having problems:

Michael, you have already mentioned that ...

<xsl:for-each select="rep_base[count(.| key('kteam', tm)[1])=1 and (cred=$pfiltfld)]">

... is wrong and would bring a different result to the generate-id version. Therefore I have been trying with the recommended syntax thus ...

<xsl:for-each select="rep_base[count(.| key('kteam', tm)[cred=$pfiltfld][1])=1]">

... in order to group by team and filter by creditor.

But this seems to bring a 'cartesian' result, returning countless duplicates of the data rather than grouping.

Is this an issue with my count() clause? Or maybe with the key definition? Maybe I need to include all potential group fields to avoid this cartesian effect?

Here is a snippet from the XML ...

<rep_base>
<P_CODE>1</P_CODE>
<GE>B</GE>
<plt>F</plt>
<tm>B-Team F</tm>
<cred>99132630</cred>
<GR_CNT>2</GR_CNT>
<GR_STUCK>504</GR_STUCK>
</rep_base>

- P_CODE is the period (always 1)
- GE is the business unit (varies)
- plt is the plant (also varies)
- tm is the team, the field which should be grouped on
- cred is the creditor, the field to be filtered on
- The other fields are data fields

I really do want to get my head around how the count() clause works (especially in a grouping context) as I believe that this could be central to much of the reporting functionality which I am building.

If you don't have time to explain/check this, then maybe you could point me towards a HOWTO ?

Many thanks,
Alan Searle.