Recently when plugging two components of a high-throughput web service together, I ran into a snag. One component (a data repository) exposes an Iterator for pulling XML-formatted records out of it one by one. The other (for serving SOAP response documents) needed, ideally, something that could be wrapped in a StreamSource — i.e. an InputStream or Reader. But although these are both pull-based ways of providing (in this case) character data, they’re not compatible.
One easy option is to iterate over the whole Iterator and buffer the results in a String, and then use a StringReader. But that’s not terribly efficient, when you might well be dealing with XML documents in the 10-20MB range. So I wrote an IteratorReader class, which is a Reader that can be wrapped around any Iterator. Each time it’s read from, it pulls enough elements from the Iterator to enable the request to be fulfilled, and buffers any remainder. This keeps its memory usage down, although this of course depends on (a) the number of characters requested at once from its read method, and (b) the size of the elements coming off the Iterator. (Each element is simply converted into a String via its toString method before being stored in a character buffer.)
Surprisingly, given the vast amount of Java source out there, I couldn’t find an existing solution for this — not even in the usually comprehensive Apache Commons. The code is below, and you are free to do what you like with it, but a credit would be nice if you use it, and if you come up with any improvements I’d be interested to hear about them. In particular, I’m sure it could be optimized more, as it spends a lot of time garbage collecting in its current form. It’s pretty thoroughly tested, with an ArrayList of two million random strings as the source of the Iterator, and seems to work fine both with single-character reads and a BufferedReader wrapped round it. Actually, testing taught me some very interesting lessons, but that’s another post.
Implementation note: As well as providing the Iterator to read from, you can optionally provide an object that implements the Closeable interface. This is because in the scenario I developed this for, the Iterator in question represented a stream of objects that was being generated on-the-fly from a live database connection, and implemented Closeable as well as Iterator so the connection could be closed when necessary. I needed a way of doing this automatically from the Reader’s point of view, so when the Iterator runs out of data (hasNext returns false) the close method of the attached Closeable, if present, is called.
Download the file: IteratorReader.java.v0_1
All comments very gratefully received.
/** * IteratorReader v. 0.1 * Andrew B. Clegg */ import java.io.Closeable; import java.io.IOException; import java.io.Reader; import java.util.Iterator; public class IteratorReader extends Reader { // The iterator from which we'll read private final Iterator<? extends Object> _iterator; // Optionally, an object to close when we're done private Closeable _closeable; // Buffer to hold character pulled from iterator before they're read private char[] _leftoverCharsFromLastRead = new char[ 0 ]; // Flag to indicate when iterator is out of elements private boolean _iteratorExhausted = false; /** * Creates a new IteratorReader. * @param iterator the Iterator to read from */ public IteratorReader( Iterator<? extends Object> iterator ) { _iterator = iterator; _closeable = null; } /** * Creates a new IteratorReader whose Iterator is backed by a Closeable * object that must be cleanly closed when no longer needed. * @param iterator the Iterator to read from * @param closeable the Closeable object backing the Iterator */ public IteratorReader( Iterator<? extends Object> iterator, Closeable closeable ) { _iterator = iterator; _closeable = closeable; } /** * Closes the Closeable object on which this reader's Iterator depends. * If there is no such Closeable, or it has already been closed, this * method does nothing. This method is automatically called when Iterator's * hasNext method returns false, but can be called earlier. * @throws IOException if the Closeable encounters a problem when closing */ @Override public void close() throws IOException { if( _closeable != null ) { _closeable.close(); _closeable = null; } } /** * Reads characters into a portion of an array. See Reader. * @param outBuf array to copy the characters into * @param outBufOffset offset at which to start storing characters * @param charsRequested maximum number of characters to read * @return the number of characters read, or -1 if the end of the iterator has been reached * @throws IOException if the Closeable encounters a problem when closing */ @Override public synchronized int read( char[] outBuf, int outBufOffset, int charsRequested ) throws IOException { // System.out.format( "read called: outBufOffset=%d, charsRequested=%d\n", outBufOffset, charsRequested ); // System.out.format( "current state: _leftoverCharsFromLastRead has %d characters, _iteratorExhausted=%b\n", // _leftoverCharsFromLastRead.length, _iteratorExhausted ); // Have we already read enough characters from the iterator to feed this request? if( charsRequested <= _leftoverCharsFromLastRead.length ) { // Yes, we already have enough characters, copy them into output buffer System.arraycopy( _leftoverCharsFromLastRead, 0, outBuf, outBufOffset, charsRequested ); // Are there any left over? int remainder = _leftoverCharsFromLastRead.length - charsRequested; assert( remainder >= 0 ); if( remainder > 0 ) { // Copy remaining characters to new buffer (i.e. shrink buffer) char[] tempBuf = new char[ remainder ]; System.arraycopy( _leftoverCharsFromLastRead, charsRequested, tempBuf, 0, remainder ); _leftoverCharsFromLastRead = tempBuf; } else { // None left over, so reset buffer to zero-length _leftoverCharsFromLastRead = new char[ 0 ]; } // Return the number of characters read // (in this case, all the characters requested) return charsRequested; } else { // We have been asked for more characters than we currently have, so we // can return what we have (if there are no more in the iterator) or // try to acquire more from the iterator // If iterator is exhausted and read has been called again, clean up and // return straight away, after copying as many characters as we have left if( _iteratorExhausted ) { int charsAvailable = _leftoverCharsFromLastRead.length; if( charsAvailable == 0 ) { // Nothing in the iterator or the buffer, we're done return -1; } else { // Copy what we have into output buffer System.arraycopy( _leftoverCharsFromLastRead, 0, outBuf, outBufOffset, charsAvailable ); // Clean up our own buffer and return number of characters copied _leftoverCharsFromLastRead = new char[ 0 ]; return charsAvailable; } } else { // There's still data in the iterator, so we can attempt to satisfy the whole request // by doing another read -- open a stringbuilder of the desired length StringBuilder sb = new StringBuilder( charsRequested ); // Insert however many characters we do have and reset our buffer to zero-length if( _leftoverCharsFromLastRead.length > 0 ) { sb.append( _leftoverCharsFromLastRead ); _leftoverCharsFromLastRead = new char[ 0 ]; } int charsStillRequired = charsRequested - _leftoverCharsFromLastRead.length; // Iteratively add new strings until no more characters are required while( charsStillRequired > 0 && !_iteratorExhausted ) { // Read another string from the underlying iterator String string = nextString(); // Add it to stringbuffer sb.append( string ); // Adjust number still required charsStillRequired = charsStillRequired - string.length(); } // Did we read to the end of the iterator? if( _iteratorExhausted ) { // We have read all the strings from the iterator, but can only return // as many characters as we managed to read, or as many as were requested, // whichever is lower int charsObtained = sb.length(); char[] tempBuf = sb.toString().toCharArray(); // charsToReturn is the number of chars requested, or obtained, whichever is lower int charsToReturn = Math.min( charsRequested, charsObtained ); // Copy this many characters into output buffer System.arraycopy( tempBuf, 0, outBuf, outBufOffset, charsToReturn ); // Do we have any left over in our buffer? if( charsObtained > charsRequested ) { // Yes -- more obtained than requested -- save them for next request int charsToSave = charsObtained - charsRequested; assert( charsToSave + charsToReturn == tempBuf.length ); _leftoverCharsFromLastRead = new char[ charsToSave ]; System.arraycopy( tempBuf, charsToReturn, _leftoverCharsFromLastRead, 0, charsToSave ); } if( charsObtained == 0 ) { // No characters left in buffer or iterator; return -1 immediately _leftoverCharsFromLastRead = new char[ 0 ]; return -1; } else { // There are some remaining in buffer for next time, so just return // the number we acquired this time return charsToReturn; } } else { // sb now contains text to return, and there are more strings to iterate through. // We can save a bit of effort by putting the entire contents of sb into // our 'leftover' characters buffer, and calling this method again to copy it over _leftoverCharsFromLastRead = sb.toString().toCharArray(); return read( outBuf, outBufOffset, charsRequested ); } } } } private String nextString() throws IOException { // This should never get called after _iteratorExhausted has been set assert( !_iteratorExhausted ); if( _iterator.hasNext() ) { return _iterator.next().toString(); } else { _iteratorExhausted = true; close(); return ""; } } }
{ 17 } Comments
Have you tried XML pull parsers? They work really well, and provide a nice API.
You mean as in StAX for example — XMLStreamReader and XMLEventReader? Yes, I use them quite a lot, however in this instance I specifically needed something that could wrap an Iterator. This is for the FuncNet project I was talking about in Uppsala last year…
Scenario: a data access object is iterating over a large result set from a database, and for each row, converting it into a simple XML representation. Meanwhile I’ve built the header for a SOAP response using a JAX-WS Provider. If I can give the Provider an InputStream or Reader view of the data, then I can stream it straight out of the database and into the SOAP response without caching it anywhere (important if you want to be able to return 10-20MB in each response).
Unfortunately JAX-WS doesn’t let you build a Provider<XMLStreamReader> or Provider<XMLEventReader>, and neither of those classes actually implements Reader so they can’t be used as the seed for a StreamSource… Unless I’ve missed something.
(It’s annoyed me on several occasions that an XMLStreamReader isn’t a Reader…)
Great post Andrew – thanks for writing this. I agree with you – it looks like a gap in the JAX-WS spec. There are many cases in which a service might want to write a large response to the client via an output stream. The JAX-WS WebServiceProvider interface doesn’t support this very nicely.
I find it slightly amusing that online examples gloss over the construction of the return value:
http://java.sun.com/mailers/techtips/enterprise/2006/TechTips_July06.html
https://jax-ws.dev.java.net/2.1.5/docs/provider.html
Your solution is great if the data is coming from an Iterator, but unfortunately that doesn’t fit my situation.
I found another solution by creating a PipedReader / PipedWriter pair (using the PipedReader for the input stream and writing to the PipedWriter in a background thread) but I didn’t like using additional threads so I discarded it.
Finally I settled on a low-tech but effective solution: Implementing my SOAP service as a simple servlet (using XPath to read the input parameters, StAX to write the response, and checking for a “?wsdl” parameter to allow the client to request the WSDL). Came out surprisingly neat.
For large compressible data sets I highly recommend the servlet filter com.planetj.servlet.filter.compression.CompressingFilter. It can be used in conjunction with simple servlets, with CXF, or with any other web service stack to transparently negotiate gzip with HTTP clients that support it.
Hi!
Quite like your solution and may adapt to my particular case. I’ve found it quite tricky to find any examples on streaming web services for large amount of data…
I may have missed something, but in your example it seems that you are reading all the results from the DB before starting to stream it, which (at least in my case), can already be quite a lot of memory.
I intend to get the “nextString” to get the next result from my DAO instead.
Thanks for sharing the example!
Monica: The whole class is basically agnostic as to where the underlying iterator gets its data from. So nextString() just reads off another object from the iterator.
Whether the iterator represents a list of objects that have already been pulled from the database, or a sequence of objects that are created on demand, is of no concern to the IteratorReader :-)
e.g. if you use a Hibernate Query, object you can construct the IteratorReader around the iterator returned by the Query’s iterate() method. Then when the IteratorReader needs more data, it’ll call the iterator’s next() method, which in turn causes Hibernate to produce the next object in the query result set.
Uhmm, I see. I like that :)
Better separation of concerns…
May need to re-adapt my code again, so I take the DB knowledge out of the “IteratorReader” class.
At the moment more concerned with getting it to actually work. I must be doing something wrong with the configuration, as CXF fails to call my Provider and gives me an error on the call to invoke, grrrrrr
Hoping to get some help from the mailing list… I’m probably missing something but there’s not really much info about streaming web services *sigh*
Yeah I saw your message — sadly I’ve never hit that exception before, and the CXF config file looks pretty similar to mine.
Best thing I can think of is to try to boil it down to a really simple example that causes the same error, without any of your business logic behind it, and then post it (pref. on the CXF JIRA) so people can have a go at it themselves.
Hi Andrew,
Have you come across with your code to an error like this?
com.ctc.wstx.exc.WstxEOFException: Unexpected EOF in CDATA section at [row,col {unknown-source}]: [1,17055]
I think it has got till the end of the stream, but it has not detected the end of the cdata section even it being there.
You think it can be problematic if the cdata goes across several reads?
Thanks,
Monica
Hmm, not seen that, but then none of my services use CDATA sections. I’ve seen
Unexpected EOF in prolog
when you try to parse XML from an empty stream (e.g. when a buggy service doesn’t return anything).
What’s your setup like? I’m guessing you have an Iterator that’s yielding XML, then an IteratorReader wrapping that, then wrapping the IteratorReader in a StreamSource to return from a webservice. Is that right?
You can check the IteratorReader is working correctly by reading from it into a string, and then comparing the contents of the string with the original XML (ignoring whitespace differences etc.). That’s basically how my unit tests work. If they turn out the same, the bug’s not in my code :-)
:), I’m not saying your code is buggy, most likely something I’ve done :)
Just wondered if you had seen it.
I’ve removed the CDATA section and I get it now with the normal XML:
Caused by: com.ctc.wstx.exc.WstxParsingException: Unexpected close tag ; expected .
at [row,col {unknown-source}]: [1,536722333]
So it looks like somehow sometimes it gets confused, or it misses some characters … Thought it does not happen all the time, the same request will always consistently fail.
Yes, my setup is very similar to what you describe. I’ll try to test it better.
Thanks,
Monica
“Iām not saying your code is buggy” …
I’m not saying it *isn’t*, though :-)
If you do find an input XML document that looks different after going through the iterator, please let me know! A good way to test it would be to put all of the strings that make up one of the problematic XML docs into an array or list, wrap that object’s Iterator in the IteratorReader, then read from that into a StringBuffer until exhausted. Then compare the contents of the StringBuffer with the original strings.
Or, if it doesn’t contain confidential info, send me the document and I can test it here…
Thanks!
There’s a lot of things you can do to make this more efficient. One, you don’t need to keep allocating arrays in tempbuf! You can just work from the string created from the iterated object.
You can also remove the complex part of the read() method because the read() method specification in java.io.Reader does not require the input buffer to be filled. You can even return 0-length outputs. If you want output buffers maximally filled for some reason (though clients should be careful not to expect that of Reader instances), you can wrap in a BufferedReader, which’ll do the buffering you’re doing a bit more efficiently.
Your read() method doesn’t need to declare a thrown IOException.
What happens with iterators that return null from next() [it's legal, because collections can have null members]? Should we get null pointer exceptions, or iterate to the next object?
To make the read() method more robust to buggy clients, you should make sure to verify that the args are legal and fail early if they’re not. Implementations like BufferedReader throw an undocumented IndexOutOfBoundsException, doing some redundant checking along the way:
if ((off cbuf.length)
|| (len cbuf.length)
|| ((off + len) < 0))
What were they thinking? The following's enough:
(off < 0 || len cbuf.length)
I’ll e-mail you a simpler version with the arg tests and null handling; I don’t think it’ll fit in the comment :-)
Thanks for the Friday afternoon programming puzzle; this is better than TopCoder!
Andrew,
I am interested in the way you implemented to deal with the large data set. I am currently in a project to create a CXF-based webservice to retrieve all genome features based on a taxonomyID. For some reason,the output can not be paged or indexed and can not random accessed which means we need to retrieve all at one time. The return (as an ArrayList) is supposed to bind to XML and send back to user. However it is not possible due to the size and get outofmemory exception all the time. I looked through your idea and found it is so promising to apply it in my project. I am new to webservice technology. Could you please send me some sample code like provider you implemented then I can use the sample code to test in my site.
Thanks in advance.
Tian
Hi Tian,
Have a look at BioMinerPredictorProvider.java …
This invokes a DAO object which queries a database for XML directly, returning each matching row from the DB as a String containing the XML pre-formatted for returning in the web service. This is in the form of a List, although it could be any sort of Iterator, but bear in mind you don’t want the DAO to keep the database connection open longer than necessary.
Then it wraps this List in an IteratorReader, providing a header and footer with the opening and closing response element tags, wraps the IteratorReader in a StreamSource, and returns that to CXF.
You can pretty much follow this pattern, and just drop in your own DAO which queries your DB and returns strings containing records as XML elements.
There’s lots of things slightly non-perfect about the code, both of this class and the IteratorReader (see Bob’s comments above), but it’s well-tested and works well in production, so I haven’t been motivated to go back and fiddle any more.
Hi, Andrew,
Thank you so much for the help. I will look into the sample code and try to adapt my DAO into it and take a test. I will let you know the result.
Thanks again.
Dear Andrew,
Thanks again for all your help. I still have some problem. I tried to search web to get more information how to use Provider but I can not find a suitable case. I create a regular Webservice and the interface as below:
@WebService
@XmlSeeAlso({Object[].class, java.lang.Object.class})
public interface PatricDatabaseService {
public String getTaxonomyFromNCBITaxonId(@WebParam(name=”ncbiTaxonId”) String ncbiTaxonId);}
public String getTaxonomyFromPatricTaxonId(@WebParam(name=”patricTaxonId”) String patricTaxonId);
And this webservice is deployed on jboss portal and I have a webservice client to invoke this service to pass for example 8333 and return a string. (For test StreamSource purpose), I create a StreamSourceProvider service based on the provider code you send to me. I don’t know how to invoke the Provider service to pass information in and get response out. Could you please give me some information or sample code on this?
Thanks in advance.
Tian
Generally the way Providers work is that one class (implementing Provider) implements all the operations exposed by one web service, all via a single invoke() method. So if the service exposes multiple operations, the invoke() method has to look inside the XML payload to determine what operation to carry out.
In the case of document/literal services, this means looking at the name of the outermost XML element in the payload, as this corresponds to the name of the operation. So your invoke method would have to look at this element and see if it was called GetTaxonomyFromNCBITaxonId or GetTaxonomyFromPatricTaxonId and take whichever action was appropriate.
Some further documentation:
http://cwiki.apache.org/CXF20DOC/provider-services.html
http://java.sun.com/mailers/techtips/enterprise/2006/TechTips_July06.html
For the implementation class, you use @WebServiceProvider instead of @WebService. That’s all covered in the CXF docs. I don’t know how to deploy them to JBoss Portal though, sorry.
{ 1 } Trackback
[...] Highlighter Plus: Supplies the lovely code formatting like you see on this page, and supports a whole bunch of [...]
Post a Comment