I use Flexi-streams to read 1,000 lines from a ZSTD archive and find the average length of lines. My program looks like the one below.
(defun average-1000-etipitaka-flexi-streams ()
(with-open-file (f #P"etipitaka.txt.zst" :element-type '(unsigned-byte 8))
(zstd:with-decompressing-stream (zs f)
(let ((s (utf8-input-stream:make-utf8-input-stream zs)))
(loop for line = (read-line s nil nil)
until (null line)
count 1 into l
sum (length line) into c
when (> l 1000) do (return (float (/ c l)))
finally (return (float (/ c l))))))))
The text in the file etipitaka.txt.zst looks like below:
สิงสถิต เขตเมืองเวรัญชา พร้อมด้วยภิกษุสงฆ์หมู่ใหญ่ประมาณ ๕๐๐ รูป เวรัญชพราหมณ์
ได้สดับข่าวถนัดแน่ว่า ท่านผู้เจริญ พระสมณะโคดมศากยบุตร ทรงผนวชจากศากยตระกูล
ประทับอยู่ ณ บริเวณต้นไม้สะเดาที่นเฬรุยักษ์สิงสถิต เขตเมืองเวรัญชา พร้อมด้วยภิกษุสงฆ์
The average line length is 68.8 bytes.
I ran the average-1000-etipitaka-flexi-streams on SBCL 2.2.5-1.1-suse on my laptop with Celeron N4500. It took 1.591 seconds.
Then I change the file to my-data.ndjson.zst, whose average line length is 515.5 bytes. Running average-1000-ndjson-flexi-streams took 4.411 seconds.
So I also tested with my customized utf8-input-stream. Running average-1000-etipitaka-utf8-input-stream, and average-1000-ndjson-utf8-input-stream took 0.019 seconds, and 0.043 seconds respectively, which means utf8-input-stream is 83X faster for short lines, and 102X faster for long lines, than Flexi-streams in these tests.
Top comments (0)