Unzipping a File in S3 to S3
In the work I'm doing, I've got a service that chooses to return a group of files in a standard Zip file format, and then I can easily store it in S3 using amazonica and a little bit of code:
(defn save! "Given a byte array or string `data` and a bucket `b`, saves the bytes as an encrypted file with file name `fname`. Optional keys `content-type` and `overwrite` may also be passed, where the `:content-type` key indicates the content type (defaulted to S3's inference), and the `:overwrite` key indicates whether to replace\ the file if it already exists (defauted to false)." [data b fname & {:keys [content-type overwrite input-stream-size]}] (if (and (or (string? data) (byte-array? data) (instance? InputStream data)) b) (let [nm (or fname (uuid)) echo (if-not fname {:name nm}) [inp cnt] (cond (string? data) [(ByteArrayInputStream. (.getBytes data)) (count data)] (byte-array? data) [(ByteArrayInputStream. data) (count data)] (instance? InputStream data) [data input-stream-size] :else [nil nil])] (if (or overwrite (false? (file-exists? b nm))) (try-amzn 3 (merge echo (put-object (get-cred) :bucket-name b :key nm :input-stream inp :metadata (merge (if content-type {:content-type content-type}) (if cnt {:content-length cnt}) {:server-side-encryption "AES256"})))) (merge echo {:error "FileNotSaved"})))))
But what if it's a zip file? If we want to do this one-pass, we have to load the entire contents of the file into memory, and then piece it apart. That's certainly possible, but what if the files are very large? Why not unzip the stream, and write it back to S3 as a stream?
Then, we don't have to have a large memory footprint in order to process the large zip files. That would be nice.
(defn unzip! "Function to unzip the provided file in the provided bucket into the same S3 bucket ehere the directory is the name of the file - without the extension, and all the files from the zip file are deposited into the directory under their names in the zip archive. The downside of this function is that it has to read the zip file from S3 'n' times - one for each of the files in the zip archive. That means that it's not at all fast. This returns a sequence of all the files that have been unzipped to S3: [\"13B7E73B053C497D82F8FCC28FC8127F/13b7e73b053c497d82f8fcc28fc8127f.XML\" \"13B7E73B053C497D82F8FCC28FC8127F/Form0001.PDF\" \"13B7E73B053C497D82F8FCC28FC8127F/Index.ctl\"] " [bkt fname] (if (file-exists? bkt fname) (if-let [base (some identity (rest (re-matches #"(?i)(.*)\.zip$|(.*)\.xfr$" fname)))] (let [afn (atom []) push (fn [ze zis] (if (not (.isDirectory ze)) (let [to-file (str base "/" (.getName ze)) res (save! zis bkt to-file :overwrite true :input-stream-size (.getSize ze))] (if-not (:error res) (swap! afn conj to-file))))) s (get-stream bkt fname)] (with-open [z (ZipInputStream. s)] (doseq [e (entries z) :let [en (.getName e) ms (get-stream bkt fname)]] (with-open [mis (ZipInputStream. ms)] (let [me (entries mis)] (push (first (drop-while #(not= en (.getName %)) me)) mis)))) @afn)))))
The result is something that has to read the stream once for each file in the unzipped file, but then it can write each of these back to S3. It's not super network efficient, but the existing library closes the stream when it's done reading a file, and if they just hadn't done that, then I could have done it all in one pass.
Still... this is nice. It works great, and does just what I need.