I can't really infer your transformation rules from one example - for example it's not clear whether there is always exactly one <p> element after the <caption>, or whether other examples might have things inside the <figure> that shouldn't end up inside the <caption>. However, from the information given, Sam's code is as good as one can do.
Michael Kay
http://www.saxonica.com/
Author, XSLT Programmer's Reference and XPath 2.0 Programmer's Reference