Mainly Tech projects on Python and Electronic Design Automation.

Monday, October 21, 2013

Unifying Pythons string and list sequences for fsplit

I was looking at a particular post A more functional split function by Simon Grondin and wondered what it would look like in Python.

Simon describes his function thus:
"Fsplit takes a single ‘predicate’ function as argument. It loops over the array and adds one element at a time to a ‘buffer’ array. After adding an element, it calls the predicate on the buffer. If the predicate returns true, the buffer without the last element is added to the ‘ret’ array. At the end, that ‘ret’ array is returned, filtered from any 0-length elements."
Now a Python version would take both the predicate and array as arguments and, since it is to be in a functional style I decided to make it a generator function.

fsplit_list

Simon's first fsplit function works on arrays and had as its first test to split an array of integers on the occurrence of  integer 3. The second test was to split to " get all consecutive intervals with a sum of 4 or less"

I coded that up as the following Python:

>>> def fsplit_list(predicate, array):
    buf = []
    for c in array:
        buf.append(c)
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = [] if predicate([c]) else [c]
    else:
        if buf: yield buf


>>> # Split on 3
>>> list(fsplit_list(lambda x: 3 in x, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [4, 5, 1, 2], [4, 5]]
>>> # Get all consecutive intervals with a sum of 4 or less
>>> list(fsplit_list(lambda x: sum(x) > 4, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [3], [4], [1, 2], [3], [3], [4]]
>>>

fsplit_string

SImon then needed another function in his typed language that took a predicate and a string as arguments and worked in a similar way for strings. The test in this case was to split a string into substreings with no more than two vowels in them. My Python code was the similar function fsplit_string:

>>> import re
>>> def fsplit_string(predicate, string):
    buf = ''
    for c in string:
        buf += c
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = '' if predicate(c) else c
    else:
        if buf: yield buf


>>> # String
>>> list(fsplit_string(lambda string: len(re.findall(r'[aeiou]', string)) > 2, "lorem ipsum dolor sit amet"))
['lorem ', 'ipsum d', 'olor s', 'it am', 'et']
>>>

Unification

I wanted to have one function that worked on both lists and strings but without executing separate code dependant on the type of the sequence being split. Specifically it should work equally on lists and strings without testing for sequence type.

Comparing the two functions above I saw differences in the initialization of buf; the fact that iterating through a string leads to c being a string but for a list c is not a list - it is whatever object is in the list. This last point affects how buf is extended - for strings you can use += but lists have to be appended to.

My solution with the same tests is as follows:

>>> def fsplit(predicate, sequence):
    buf = type(sequence)()
    for c in (sequence[i:i+1] for i in range(len(sequence))):
        buf += c
        if predicate(buf):
            if buf[:-1]: yield buf[:-1]
            buf = type(sequence)() if predicate(c) else c
    else:
        if buf: yield buf


>>> # Split on 3
>>> list(fsplit(lambda x: 3 in x, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [4, 5, 1, 2], [4, 5]]
>>> # Get all consecutive intervals with a sum of 4 or less
>>> list(fsplit(lambda x: sum(x) > 4, [1,2,3,4,5,1,2,3,3,4,5]))
[[1, 2], [3], [4], [1, 2], [3], [3], [4]]
>>> # String
>>> list(fsplit(lambda string: len(re.findall(r'[aeiou]', string)) > 2, "lorem ipsum dolor sit amet"))
['lorem ', 'ipsum d', 'olor s', 'it am', 'et']
>>>

That strange initialization of buf works because calling the type of a sequence gives the empty sequence of that type.
I realised that if I could create successive one member slices from a list as name c then they could be concatenated using += just like for strings hence the for statement that generates successive items from a list as one element lists - it works in a similar way for strings too - giving successive elements of a string as one element strings (but that is what the "normal" for statement did in fsplit_string).

I guess some of the fiddling about is because strings are immutable whereas lists are mutable in Python, but I wonder if there is a better/more pythonic way of writing function fsplit to the same constraints?

END.


2 comments:

  1. As you say, not really, since they aren't doing the same thing (__iadd__ on str and on list are different beasts) - though this is close enough.

    Only, `for c in (spam for i in eggs): whatever` is clearer written as `for i in eggs: c = spam; whatever`. Flat is better than nested.

    About that type(sequence)()... if practicality didn't beat purity at that moment, __new__ would be a proper classmethod, and you would be able to call simple sequence.__new__(). Unfortunately, it happened only in the alternative universe, where __new__ was invented after classmethod descriptor, so you still have to provide a type.

    Of course, you can force it to be classmethod, but then you can't simply call it, and can't monkeypatch it since str and list are builtin, so the cure is much worse then the disease.
    `classmethod(sequence.__new__).__get__(sequence)()` :-S

    ReplyDelete
    Replies
    1. I don't think I use the pattern you mention"for c in (spam for i in eggs):" in my blog post???

      Alternative universe? I find it hard to understand your second and third paragraphs.

      Delete

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive