Reading a file line-by-line revisited

One of the frequent questions is how do you read a file line by line using Common Lisp?

A canonical answer, as formulated by the Practical Common Lisp, section 14. Files and File I/O is essentially the same as the one provided by the Common Lisp Cookbook (Reading a File one Line at a Time):

(let ((in (open "/some/file/name.txt" :if-does-not-exist nil)))
  (when in
    (loop for line = (read-line in nil)
        while line do (format t "~a~%" line))
    (close in)))

And basically it does the job.

But what happens if you deal with a log that has captured random bytes from a crashing application? Lets simulate this scenario by reading from /dev/urandom. SBCL will give us a following result:

debugger invoked on a SB-INT:STREAM-DECODING-ERROR in thread
#:  :UTF-8 stream decoding error on
#:   the octet sequence #(199 231) cannot be decoded.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ATTEMPT-RESYNC   ] Attempt to resync the stream at a character boundary
                         and continue.
  1: [FORCE-END-OF-FILE] Force an end of file.
  2: [INPUT-REPLACEMENT] Use string as replacement input, attempt to resync at
                         a character boundary and continue.
  3: [ABORT            ] Exit debugger, returning to top level.

(SB-IMPL::STREAM-DECODING-ERROR-AND-HANDLE # 2)

The same will be reported on other Lisp implementations. However, dealing with this problem is not really portable, and requires platform-specific switches and boilerplate code.

For example, on SBCL it is possible to specify a replacement character in the external-format specification:

(with-open-file (in "/dev/urandom"
                      :if-does-not-exist nil
                      :external-format '(:utf-8 :replacement "?"))
  ;; read lines
)

Other Lisps require a different and incompatible external format specification.

But there are actually other ways to read a file line-by line. cl-faster-input looks into some of them. Namely:

  • A standard read-line.
  • read-line-into-sequence suggested by Pascal Bourguignon in a cll discussion. Unlike the standard read-line this function reads lines into a pre-allocated buffer, reducing workload on the garbage collector.
  • read-ascii-line that is the part of the COM.INFORMATIMAGO.COMMON-LISP.CESARUM library.
  • ub-read-line-string from the ASCII-STRINGS package that is a part of the CL-STRING-MATCH library

Please check the src/benchmark-read-line.lisp in the sources repository.

Benchmarks show that the ub-read-line-string outperforms the standard read-line approach, does not require platform-specific switches, and allows trivial character substitution on the fly (like up/down casing the text, replacing control characters etc.)

Sample usage (from the sources):

(with-open-file (is +fname+ :direction :input :element-type 'ascii:ub-char)
    (loop with reader = (ascii:make-ub-line-reader :stream is)
       for line = (ascii:ub-read-line-string reader)
       while line
       count line))

On developer’s desktop it takes 1.71 seconds to complete the benchmark with the standard read-line, and 1.076 seconds with the ub-read-line-string benchmark. Memory consumption is on the same level as the standard read-line, though significantly higher than the read-line-into-sequence.

On Clozure CL 1.9 the read-ascii-line benchmark fails. The ub-read-line-string falls into an infinite loop.

On Embeddable CL 16.0 all functions work, but the ub-read-line-string takes almost 10 times more time to complete than any of the alternatives.

Conclusion: It might be reasonable to look at different approaches for reading files line-by-line if you plan to deal with large volumes of text data with a possibility of presence of malformed characters. Check the sources of cl-faster-input for different ideas, tweak and run the benchmarks as it suits your tasks.

P.S. this post has been written in September of 2015 but never published. As it appeared to be pretty complete I decided to post it now, in the January of 2018. Stay tuned…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s