Home > Oracle, PL/SQL, SQL > Parsing a CSV file in PL/SQL

Parsing a CSV file in PL/SQL

The ability to parse a CSV file in PL/SQL seems like a simple requirement and one would think that you could either a) easily implement it yourself, or b) find some examples of it on the web.   Well if you have tried option A, you probably realized it gets real tricky when you have commas and double quotes in your actual data as well as having them as your deliminators and optionally enclosed by characters as well.  Plus all that substr‘ing and instr‘ing can really hurt your head after a while.  If you tried option B, then you probably discovered that there are some solutions out there, but they all seems to either incomplete, or just overly complex.

So I decided to write my own simple, yet complete CSV parser in PL/SQL.  It handles all data, both optionally enclosed by some character or not, as well as both DOS (CR+LF) and UNIX (LF only) end-of-line file formats.  And all this in less than 100 lines of code (with comments) and with only three distinct calls to substr() and NO calls to instr().

I wanted to share this in hopes that others find it useful.

create or replace procedure parse_csv(
  p_clob clob,
  p_delim varchar2 default ',',
  p_optionally_enclosed varchar2 default '"' ) is
  --
  CARRIAGE_RETURN constant char(1) := chr(13);
  LINE_FEED constant char(1) := chr(10);
  --
  l_char char(1);
  l_lookahead char(1);
  l_pos number := 0;
  l_token varchar2(32767) := null;
  l_token_complete boolean := false;
  l_line_complete boolean := false;
  l_new_token boolean := true;
  l_enclosed boolean := false;
  --
  l_lineno number := 1;
  l_columnno number := 1;

begin

  loop
    -- increment position index
    l_pos := l_pos + 1;

    -- get next character from clob
    l_char := dbms_lob.substr( p_clob, 1, l_pos);

    -- exit when no more characters to process
    exit when l_char is null or l_pos > dbms_lob.getLength( p_clob );

    -- if first character of new token is optionally enclosed character
    -- note that and skip it and get next character
    if l_new_token and l_char = p_optionally_enclosed then
      l_enclosed := true;
      l_pos := l_pos + 1;
      l_char := dbms_lob.substr( p_clob, 1, l_pos);
    end if;
    l_new_token := false;

    -- get look ahead character
    l_lookahead := dbms_lob.substr( p_clob, 1, l_pos+1 );

    -- inspect character (and lookahead) to determine what to do
    if l_char = p_optionally_enclosed and l_enclosed then

      if l_lookahead = p_optionally_enclosed then
        l_pos := l_pos + 1;
        l_token := l_token || l_lookahead;
      elsif l_lookahead = p_delim then
        l_pos := l_pos + 1;
        l_token_complete := true;
      else
        l_enclosed := false;
      end if;

    elsif l_char in ( CARRIAGE_RETURN, LINE_FEED ) and NOT l_enclosed then
      l_token_complete := true;
      l_line_complete := true;

      if l_lookahead in ( CARRIAGE_RETURN, LINE_FEED ) then
        l_pos := l_pos + 1;
      end if;

    elsif l_char = p_delim and not l_enclosed then
      l_token_complete := true;

    elsif l_pos = dbms_lob.getLength( p_clob ) then
      l_token := l_token || l_char;
      l_token_complete := true;
      l_line_complete := true;

    else
      l_token := l_token || l_char;
    end if;

    -- process a new token
    if l_token_complete then
      dbms_output.put_line( 'R' || l_lineno || 'C' || l_columnno || ': ' ||
                            nvl(l_token,'**null**') );
      l_columnno := l_columnno + 1;
      l_token := null;
      l_enclosed := false;
      l_new_token := true;
      l_token_complete := false;
    end if;

    -- process end-of-line here
    if l_line_complete then
      dbms_output.put_line( '-----' );
      l_lineno := l_lineno + 1;
      l_columnno := 1;
      l_line_complete := false;
    end if;
  end loop;
end parse_csv;
/

And here is a little test procedure to show it working. I have made the end-of-line different for each like to demonstrate this will work with all EOL terminators. In real-life (I hope) your CSV file will have just one.

declare
  l_clob clob :=
    -- DOS EOL
    'A,B,C,D,E,F,G,H,I' || chr(13) || chr(10) ||
    -- Apple up to OS9 EOL
    '1,"2,3","1""2","""4,",",5"' || chr(13) ||
    -- Acorn BBD and RISC OS EOL
    '6,"this is a ""test",""",8","9"",","10,"""' || chr(10) || chr(13) ||
    -- Unix and OS X EOL
    'normal,"commas,,,in the field","""enclosed""","random "" double "" quotes","commas,,, "" and double """" quotes"' || chr(10) ||
    -- Line with EOF only
    '",F""",,,,abcde';
begin
  parse_csv( l_clob );
end;
/

And when I run it I get…

R1C1: A
R1C2: B
R1C3: C
R1C4: D
R1C5: E
R1C6: F
R1C7: G
R1C8: H
R1C9: I
-----
R2C1: 1
R2C2: 2,3
R2C3: 1"2
R2C4: "4,
R2C5: ,5
-----
R3C1: 6
R3C2: this is a "test
R3C3: ",8
R3C4: 9",
R3C5: 10,"
-----
R4C1: normal
R4C2: commas,,,in the field
R4C3: "enclosed"
R4C4: random " double " quotes
R4C5: commas,,, " and double "" quotes
-----
R5C1: ,F"
R5C2: **null**
R5C3: **null**
R5C4: **null**
R5C5: abcde
-----

I think I have covered all the bases and possibilities for parsing a CSV file. You can easily modify the code to store the tokens as rows in a table or push them into an Apex collection for further processing later. I just used dbms_output.put_line() to show it working.

Give it a try and let me know if you find a case that this code does not handle.

Enjoy.

About these ads
Categories: Oracle, PL/SQL, SQL Tags: , , , ,
  1. Dan
    April 3, 2012 at 6:29 pm

    Hi Chris,

    Did you see that APEX 4.1 now has this built in? It’s called Data Upload and it’s awesome
    !
    http://www.oracle.com/technetwork/developer-tools/apex/application-express/apex-41-new-features-459652.html#data_upload

    Regards,
    Dan

  2. Christopher Beck
    April 3, 2012 at 6:35 pm

    Hey Dan,

    Yup, I saw it and have use it in the past. But what I am currently working on could not take advantage of it so I needed to implement my own parser for non-apex related projects. I know, *gasp*, non-apex? Is there such a thing. ;-)

    chris.

  3. April 3, 2012 at 7:33 pm

    why is p_clob a varchar2 and not a clob? especially since you use dbms_lob calls on it later. You’ll be forcing the weight of implicit conversions for lob manipulations but still restricted to 32K of input

    p_clob varchar2,

    • Christopher Beck
      April 3, 2012 at 7:35 pm

      Opps. Nice catch. Just a typo on my part. Should be a clob. I’ll fix it. Thanks for pointing that out.

  4. April 3, 2012 at 7:53 pm

    glad to help, one thing I did for my clob parser was to pull out 32K varchar2 chunks and parse those using varchar2 functions rather than dbms_lob functions.

    I had to code extra steps to look for delimiters within my chunk so I wouldn’t accidentally split a field; but, even with that extra work I got 20-25 times speed improvement.

    So yes, it would add complexity and force additional instr/substr calls but you might want to consider it if you have to do lots of parsing on large targets.

  5. April 4, 2012 at 3:00 am

    Another way to skin this cat is to use External Tables, e.g.:

    CREATE TABLE mytable
    ( FIELD01 VARCHAR2(1000 BYTE)
    , FIELD02 VARCHAR2(1000 BYTE)
    , FIELD03 VARCHAR2(1000 BYTE)
    , FIELD04 VARCHAR2(1000 BYTE)
    , FIELD05 VARCHAR2(1000 BYTE)
    )
    ORGANIZATION EXTERNAL
    (TYPE ORACLE_LOADER
    DEFAULT DIRECTORY MYDIR
    ACCESS PARAMETERS
    (RECORDS DELIMITED BY X’0D0A’
    LOGFILE MYLOGDIR:’myfile.csv.log’
    BADFILE MYLOGDIR:’myfile.csv.bad’
    FIELDS LDRTRIM
    MISSING FIELD VALUES ARE NULL
    REJECT ROWS WITH ALL NULL FIELDS
    )
    LOCATION (MYDIR:’myfile.csv’)
    )
    REJECT LIMIT UNLIMITED;

    Mind you, this method requires you to know what the line endings will be ahead of time.

    • Christopher Beck
      April 4, 2012 at 11:59 am

      Jeffrey,

      I love external tables, used them plenty, but they are limited in that they are static. They need to be created prior to loading. You need to have CREATE DIRECTORY priv to access the file. You need a way to get the file from the user to the filesystem of the database server. UTL_FILE could write it once in the database, sure, but, in many cases, is not allowed by the customer. I don’t think you could use external tables on apex.oracle.com. So I offered up this solution to serve as a generic solution in any case. If you have all the access to the server you need, and you know more about the the CSV being parsed, then external tables may be the way to go. But in the generic case, not really feasible.

  6. Rich
    April 4, 2012 at 4:09 am

    What would the advantage of this be over simply using an external table to process the csv file?

    • Christopher Beck
      April 4, 2012 at 12:18 pm

      Rich,

      See my response to Jeffery above. External tables are fine, if you have the access and privs to create them. In many cases, customers will not allow you either.

  7. April 4, 2012 at 7:47 am

    Hi Chris, check out my blog post on this topic from a couple of years ago:

    http://ora-00001.blogspot.com/2010/04/select-from-spreadsheet-or-how-to-parse.html

    The latest version of the code can be found in the Alexandria Utility Library for PL/SQL:

    http://code.google.com/p/plsql-utils/

    - Morten

    • Christopher Beck
      April 4, 2012 at 12:14 pm

      Morten,

      I don’t believe your code handles optionally enclosed values, but only handles comma separated values. I needed a parser that accounted for CSV data that was optionally enclosed with some character, delimited by another, and potentially have the data within each cell contain either or both of those characters as well. So although I did find you code when looking for solutions on the web, the parser was just too limited for my needs.

      • April 4, 2012 at 12:39 pm

        Chris,

        As I pointed out, the latest version of the csv_util_pkg package can be found in the Alexandria library, and this does indeed support optionally enclosed values.

        I just tested it with your example data:

        select *
        from table(csv_util_pkg.clob_to_csv(‘normal,”commas,,,in the field”,”””enclosed”””,”random “” double “” quotes”,”commas,,, “” and double “””” quotes”‘))

        And this splits the data into 5 columns:

        c001 = normal
        c002 = commas,,,in the field
        c003 = “enclosed”
        c004 = random ” double ” quotes
        c005 = commas,,, ” and double “” quotes

        (I suppose I should remove the older code from the blog post and just direct people to download the latest library code.)

        - Morten

      • Christopher Beck
        April 4, 2012 at 1:33 pm

        Morten, Yea, that would have been better. The blog examples seem to show it only handling the simple cases. Wish I had realized. Probably would not have written my version then. Oh well. Thanks for pointing that out.

  8. July 11, 2012 at 1:03 pm

    This was very helpful! Thanks Chris!

  9. July 11, 2012 at 2:41 pm

    My file upload procedure creates a BLOB data type. Does this procedure require CLOB, if so, how would I convert it.

    • Christopher Beck
      July 11, 2012 at 3:47 pm

      You could use a procedure like this to convert it.

        function blob_to_clob( p_lob in blob ) return clob is
           l_clob_result   clob := 'X';
           l_dest_offsset integer := 1;
           l_src_offsset  integer := 1;
           l_lang_context integer := dbms_lob.default_lang_ctx;
           l_warning      integer;
        begin
           if p_lob is not null and length(p_lob) > 0 then
              dbms_lob.converttoclob(dest_lob     => l_clob_Result,
                                     src_blob     => p_lob,
                                     amount       => dbms_lob.lobmaxsize,
                                     dest_offset  => l_dest_offsset,
                                     src_offset   => l_src_offsset,
                                     blob_csid    => dbms_lob.default_csid,
                                     lang_context => l_lang_context,
                                     warning      => l_warning);
              if l_warning != 0 then
                 dbms_output.put_line('Function blob_to_clob warning:' || l_warning);
                 return null;
              end if;
              return l_clob_result;
           else
              return null;
           end if;
        exception
           when others then
              dbms_output.put_line('Function blob_to_clob error:' || SQLCODE);
              return null;
        end blob_to_clob;
      
  10. November 17, 2012 at 8:11 am

    Hi Chris,

    First, thanks for sharing this piece of code. It comes useful to me.I have one remark.
    In the definition of CSV it is possible to enclose CR and LF characters within a field. When I offer such a content to the parser, it created a new line at the point of the CR and LF.
    Does your parser support this CSV feature?

    Dick

    • Christopher Beck
      February 21, 2013 at 12:14 am

      Just fixed it. I modified line 56.

  11. Danny Mathis
    February 14, 2013 at 1:26 am

    Hey Chris,

    Awesome code! Though it doesn’t look like it correctly parses multi-line values. For instance:

    l_clob :=
    ‘a,b,c,”d’ || ch(13) || ch(10) || ‘more”, e’;

    The values should be: (a), (b), (c), (dCRLFmore), (e)
    But are instead: (a), (b), (c), (d), (more”), (e)

    I believe that’s legal, according to this pseud-spec: http://tools.ietf.org/html/rfc4180#page-2

    Let me know if you think I’m mistaken.

    Thanks again!
    - Danny

    • Christopher Beck
      February 21, 2013 at 12:10 am

      Good catch. Seems I missed that possibility. But it’s an easy fix. Just change line 56 to


      elsif l_char in ( CARRIAGE_RETURN, LINE_FEED ) and NOT l_enclosed then

      That will then handle it.


      1 declare
      2 l_clob clob :=
      3 'a,b,c,"d' || chr(13) || chr(10) || 'more", e';
      4 begin
      5 parse_csv( l_clob );
      6* end;
      SQL> /
      R1C1: a
      R1C2: b
      R1C3: c
      R1C4: d
      more
      R1C5: e
      -----

      PL/SQL procedure successfully completed.

  12. March 18, 2014 at 2:36 pm

    Not working when last token is enclosed in double quotes, then it will not move to next rows, but token will contail EOL – see this example:

    set serveroutput on
    declare
    p CSV_PARSER;
    l_clob clob :=
    ‘”Email”,”Phone”,Manager Email’||chr(10)||
    ‘”lucas.adam@oracle.com”,”+420221438013″,”fabio.x.gorbi@oracle.com”‘||chr(10)||
    ‘”dila.x.akyuz@oracle.com”,”+420220437022″,”anna.x.petrosyan@oracle.com”‘||chr(10)||
    ‘”sonia.albert@oracle.com”,”+40213678683″,”adina.mandaghios@oracle.com”‘||chr(10)||
    ‘”alice.alecu@oracle.com”,”+40213677118″,”mirela.popovici@oracle.com”‘||chr(10)||
    ‘”sofia.alexe@oracle.com”,”+40213678562″,”khaled.radwan@oracle.com”‘||chr(10)||
    ‘”danciu.anca.elena@oracle.com”,” +40 21 367 8461″,”alexandra.popa@oracle.com”‘||chr(10)
    ;
    begin

    parse_csv( l_clob );
    end;

    First line is ok, but then everything is on one line.

    This helped to fix it:

    — add aditional brach after line 53 to handle enclosed tokens at end of line
    elsif l_lookahead IN ( CARRIAGE_RETURN, LINE_FEED ) THEN
    l_enclosed := false;
    END IF;

    • Christopher Beck
      March 18, 2014 at 3:33 pm

      Great catch. Not sure how this is was not discovered before. But the check after line 53 should just be an else

      else
      l_enclosed := false;

      Doesn’t matter what l_lookahead is at that point. If l_char = p_optionally_enclosed and the l_lookahead p_optionally_enclosed or p_delim, then you are no longer enclosed, so l_enclosed := false.

  13. March 28, 2014 at 5:34 pm

    Thanks

  14. shailesh
    April 1, 2014 at 4:01 pm

    Thanks for the amazing code. Just one question will this process a csv file with a blank line between the records. Is it a valid scenerio?
    Thanks.

    • Christopher Beck
      April 1, 2014 at 5:59 pm

      Yes.

      SQL>declare
      2 l_clob clob :=
      3 'A,B,C' || chr(13) || chr(10) ||
      4 chr(10) || chr(13) ||
      5 'D,E,F' || chr(10) || chr(13);
      6 begin
      7 parse_csv( l_clob );
      8 end;
      9 /

      R1C1: A
      R1C2: B
      R1C3: C
      -----
      R2C1: **null**
      -----
      R3C1: D
      R3C2: E
      R3C3: F
      -----

      PL/SQL procedure successfully completed.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: