One notorious problem when having Mule instances consuming files from the same FTP server is that it’s possibile for two or more Mule instances to concurrently download the same file. This leads to duplicate messages: Mule messages representing the same file in different Mule nodes.
The clustering solution offered by Mule EE 3.2 addresses this issue by having Mule instances coordinate among each other on who will download files from the FTP server.
Clustering introduces an in-memory data grid to your Mule nodes. An in-memory data grid adds significant complexity which is fine if you intend to use it for load balancing VM queues, share router state, etc… But for me, all that complexity is un-necessary if you intend to use clustering just for the sake of preventing Mule instances grabbing the same file on a FTP server. It’s like hitting a nail with a sledge-hammer.
What’s the alternative to using a cluster? The alternatives my colleagues and me came up with weren’t exactly pretty and all of them involved doing some playing around inside of Mule. These days I’ve been travelling which gave me time to think more clearly about the problem. It turns out that a potential simple solution (at the expense of a small increase in latency) lies outside of Mule:
In above setup, the FTP proxy between the set of Mule instances and the FTP server is responsible for ensuring that a Mule node gets served a ‘free’ file. Such a solution should be relatively easy to develop and add little complexity compared to others, right? To be sure, I set about implementing the FTP proxy. Luckily, most of the work was already done for me. James Healey took the trouble of developing an event-driven FTP framework in Ruby: em-ftpd. All I had to do was to add callbacks which corresponded to FTP actions (e.g., ls, get, size, etc…). After a couple of hours I got a naive implementation up and running:
MuleFtpProxyDriver represents a Mule FTP connection. This means that it’s instantiated every time Mule attempts to do a FTP login. It declares two class variables:
- @@hidden_files is a list containing files Mule instances are downloading. As you will see, a Mule instance isn’t permitted to start a download on a file contained within the list. This removes the chance of having duplicate messages.
- @@mutex is the lock used by the MuleFtpProxyDriver instances to access safely @@hidden_files.
The dir_contents method returns the list of files in a directory. The proxy lists only files which are ‘free’. Therefore, Mule doesn’t attempt the download a file that is being download since it’s hidden from view.
In the get_file method, the requested file is compared against the list of hidden files. If there is a match, the proxy tells Mule that the file is unavailable. Otherwise, it transfers the file from the FTP server to Mule.
The delete_file method, in addition to deleting the file from disk, removes the file entry from @@hiddenfiles. This allows a new file, with the same name as the deleted file, to be downloaded.
The unbind method is invoked whenever the FTP connection is broken or closed. It makes again visible those files which didn’t download completely.
As usual, instructions on configuring and running the FTP proxy are found in the repo. Try it out and let me know your thoughts about using an FTP proxy as an alternative to clustering.