One of the frequent questions is how do you read a file line by line using Common Lisp?
A canonical answer, as formulated by the Practical Common Lisp, section 14. Files and File I/O is essentially the same as the one provided by the Common Lisp Cookbook (Reading a File one Line at a Time):
(let ((in (open "/some/file/name.txt" :if-does-not-exist nil))) (when in (loop for line = (read-line in nil) while line do (format t "~a~%" line)) (close in)))
And basically it does the job.
But what happens if you deal with a log that has captured random bytes from a crashing application? Lets simulate this scenario by reading from
/dev/urandom. SBCL will give us a following result:
debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread #: :UTF-8 stream decoding error on #: the octet sequence #(199 231) cannot be decoded. Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ATTEMPT-RESYNC ] Attempt to resync the stream at a character boundary and continue. 1: [FORCE-END-OF-FILE] Force an end of file. 2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at a character boundary and continue. 3: [ABORT ] Exit debugger, returning to top level. (SB-IMPL::STREAM-DECODING-ERROR-AND-HANDLE # 2)
The same will be reported on other Lisp implementations. However, dealing with this problem is not really portable, and requires platform-specific switches and boilerplate code.
For example, on SBCL it is possible to specify a replacement character in the external-format specification:
(with-open-file (in "/dev/urandom" :if-does-not-exist nil :external-format '(:utf-8 :replacement "?")) ;; read lines )
Other Lisps require a different and incompatible external format specification.
But there are actually other ways to read a file line-by line. cl-faster-input looks into some of them. Namely:
- A standard
read-line-into-sequencesuggested by Pascal Bourguignon in a cll discussion. Unlike the standard
read-linethis function reads lines into a pre-allocated buffer, reducing workload on the garbage collector.
read-ascii-linethat is the part of the COM.INFORMATIMAGO.COMMON-LISP.CESARUM library.
ub-read-line-stringfrom the ASCII-STRINGS package that is a part of the CL-STRING-MATCH library
Please check the
src/benchmark-read-line.lisp in the sources repository.
Benchmarks show that the
ub-read-line-string outperforms the standard
read-line approach, does not require platform-specific switches, and allows trivial character substitution on the fly (like up/down casing the text, replacing control characters etc.)
Sample usage (from the sources):
(with-open-file (is +fname+ :direction :input :element-type 'ascii:ub-char) (loop with reader = (ascii:make-ub-line-reader :stream is) for line = (ascii:ub-read-line-string reader) while line count line))
On developer’s desktop it takes 1.71 seconds to complete the benchmark with the standard
read-line, and 1.076 seconds with the
ub-read-line-string benchmark. Memory consumption is on the same level as the standard
read-line, though significantly higher than the
On Clozure CL 1.9 the
read-ascii-line benchmark fails. The
ub-read-line-string falls into an infinite loop.
On Embeddable CL 16.0 all functions work, but the
ub-read-line-string takes almost 10 times more time to complete than any of the alternatives.
Conclusion: It might be reasonable to look at different approaches for reading files line-by-line if you plan to deal with large volumes of text data with a possibility of presence of malformed characters. Check the sources of cl-faster-input for different ideas, tweak and run the benchmarks as it suits your tasks.
P.S. this post has been written in September of 2015 but never published. As it appeared to be pretty complete I decided to post it now, in the January of 2018. Stay tuned…