Parsing a CSV file in PL/SQL
The ability to parse a CSV file in PL/SQL seems like a simple requirement and one would think that you could either a) easily implement it yourself, or b) find some examples of it on the web. Well if you have tried option A, you probably realized it gets real tricky when you have commas and double quotes in your actual data as well as having them as your deliminators and optionally enclosed by characters as well. Plus all that substr‘ing and instr‘ing can really hurt your head after a while. If you tried option B, then you probably discovered that there are some solutions out there, but they all seems to either incomplete, or just overly complex.
So I decided to write my own simple, yet complete CSV parser in PL/SQL. It handles all data, both optionally enclosed by some character or not, as well as both DOS (CR+LF) and UNIX (LF only) end-of-line file formats. And all this in less than 100 lines of code (with comments) and with only three distinct calls to substr() and NO calls to instr().
I wanted to share this in hopes that others find it useful.
create or replace procedure parse_csv( p_clob clob, p_delim varchar2 default ',', p_optionally_enclosed varchar2 default '"' ) is -- CARRIAGE_RETURN constant char(1) := chr(13); LINE_FEED constant char(1) := chr(10); -- l_char char(1); l_lookahead char(1); l_pos number := 0; l_token varchar2(32767) := null; l_token_complete boolean := false; l_line_complete boolean := false; l_new_token boolean := true; l_enclosed boolean := false; -- l_lineno number := 1; l_columnno number := 1; begin loop -- increment position index l_pos := l_pos + 1; -- get next character from clob l_char := dbms_lob.substr( p_clob, 1, l_pos); -- exit when no more characters to process exit when l_char is null or l_pos > dbms_lob.getLength( p_clob ); -- if first character of new token is optionally enclosed character -- note that and skip it and get next character if l_new_token and l_char = p_optionally_enclosed then l_enclosed := true; l_pos := l_pos + 1; l_char := dbms_lob.substr( p_clob, 1, l_pos); end if; l_new_token := false; -- get look ahead character l_lookahead := dbms_lob.substr( p_clob, 1, l_pos+1 ); -- inspect character (and lookahead) to determine what to do if l_char = p_optionally_enclosed and l_enclosed then if l_lookahead = p_optionally_enclosed then l_pos := l_pos + 1; l_token := l_token || l_lookahead; elsif l_lookahead = p_delim then l_pos := l_pos + 1; l_token_complete := true; else l_enclosed := false; end if; elsif l_char in ( CARRIAGE_RETURN, LINE_FEED ) and NOT l_enclosed then l_token_complete := true; l_line_complete := true; if l_lookahead in ( CARRIAGE_RETURN, LINE_FEED ) then l_pos := l_pos + 1; end if; elsif l_char = p_delim and not l_enclosed then l_token_complete := true; elsif l_pos = dbms_lob.getLength( p_clob ) then l_token := l_token || l_char; l_token_complete := true; l_line_complete := true; else l_token := l_token || l_char; end if; -- process a new token if l_token_complete then dbms_output.put_line( 'R' || l_lineno || 'C' || l_columnno || ': ' || nvl(l_token,'**null**') ); l_columnno := l_columnno + 1; l_token := null; l_enclosed := false; l_new_token := true; l_token_complete := false; end if; -- process end-of-line here if l_line_complete then dbms_output.put_line( '-----' ); l_lineno := l_lineno + 1; l_columnno := 1; l_line_complete := false; end if; end loop; end parse_csv; /
And here is a little test procedure to show it working. I have made the end-of-line different for each like to demonstrate this will work with all EOL terminators. In real-life (I hope) your CSV file will have just one.
declare l_clob clob := -- DOS EOL 'A,B,C,D,E,F,G,H,I' || chr(13) || chr(10) || -- Apple up to OS9 EOL '1,"2,3","1""2","""4,",",5"' || chr(13) || -- Acorn BBD and RISC OS EOL '6,"this is a ""test",""",8","9"",","10,"""' || chr(10) || chr(13) || -- Unix and OS X EOL 'normal,"commas,,,in the field","""enclosed""","random "" double "" quotes","commas,,, "" and double """" quotes"' || chr(10) || -- Line with EOF only '",F""",,,,abcde'; begin parse_csv( l_clob ); end; /
And when I run it I get…
R1C1: A
R1C2: B
R1C3: C
R1C4: D
R1C5: E
R1C6: F
R1C7: G
R1C8: H
R1C9: I
-----
R2C1: 1
R2C2: 2,3
R2C3: 1"2
R2C4: "4,
R2C5: ,5
-----
R3C1: 6
R3C2: this is a "test
R3C3: ",8
R3C4: 9",
R3C5: 10,"
-----
R4C1: normal
R4C2: commas,,,in the field
R4C3: "enclosed"
R4C4: random " double " quotes
R4C5: commas,,, " and double "" quotes
-----
R5C1: ,F"
R5C2: **null**
R5C3: **null**
R5C4: **null**
R5C5: abcde
-----
I think I have covered all the bases and possibilities for parsing a CSV file. You can easily modify the code to store the tokens as rows in a table or push them into an Apex collection for further processing later. I just used dbms_output.put_line() to show it working.
Give it a try and let me know if you find a case that this code does not handle.
Enjoy.
Hi Chris,
Did you see that APEX 4.1 now has this built in? It’s called Data Upload and it’s awesome
!
http://www.oracle.com/technetwork/developer-tools/apex/application-express/apex-41-new-features-459652.html#data_upload
Regards,
Dan
Hey Dan,
Yup, I saw it and have use it in the past. But what I am currently working on could not take advantage of it so I needed to implement my own parser for non-apex related projects. I know, *gasp*, non-apex? Is there such a thing. 😉
chris.
why is p_clob a varchar2 and not a clob? especially since you use dbms_lob calls on it later. You’ll be forcing the weight of implicit conversions for lob manipulations but still restricted to 32K of input
p_clob varchar2,
Opps. Nice catch. Just a typo on my part. Should be a clob. I’ll fix it. Thanks for pointing that out.
glad to help, one thing I did for my clob parser was to pull out 32K varchar2 chunks and parse those using varchar2 functions rather than dbms_lob functions.
I had to code extra steps to look for delimiters within my chunk so I wouldn’t accidentally split a field; but, even with that extra work I got 20-25 times speed improvement.
So yes, it would add complexity and force additional instr/substr calls but you might want to consider it if you have to do lots of parsing on large targets.
Hi Sean,
We used Chris’s code for our development, but for the volume we’re loading its not quick enough. Have you got the code you made to pull out in chunks that you could share?
Long shot but hope you can help.
Thanks!
Another way to skin this cat is to use External Tables, e.g.:
CREATE TABLE mytable
( FIELD01 VARCHAR2(1000 BYTE)
, FIELD02 VARCHAR2(1000 BYTE)
, FIELD03 VARCHAR2(1000 BYTE)
, FIELD04 VARCHAR2(1000 BYTE)
, FIELD05 VARCHAR2(1000 BYTE)
)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
DEFAULT DIRECTORY MYDIR
ACCESS PARAMETERS
(RECORDS DELIMITED BY X’0D0A’
LOGFILE MYLOGDIR:’myfile.csv.log’
BADFILE MYLOGDIR:’myfile.csv.bad’
FIELDS LDRTRIM
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
)
LOCATION (MYDIR:’myfile.csv’)
)
REJECT LIMIT UNLIMITED;
Mind you, this method requires you to know what the line endings will be ahead of time.
Jeffrey,
I love external tables, used them plenty, but they are limited in that they are static. They need to be created prior to loading. You need to have CREATE DIRECTORY priv to access the file. You need a way to get the file from the user to the filesystem of the database server. UTL_FILE could write it once in the database, sure, but, in many cases, is not allowed by the customer. I don’t think you could use external tables on apex.oracle.com. So I offered up this solution to serve as a generic solution in any case. If you have all the access to the server you need, and you know more about the the CSV being parsed, then external tables may be the way to go. But in the generic case, not really feasible.
What would the advantage of this be over simply using an external table to process the csv file?
Rich,
See my response to Jeffery above. External tables are fine, if you have the access and privs to create them. In many cases, customers will not allow you either.
Hi Chris, check out my blog post on this topic from a couple of years ago:
http://ora-00001.blogspot.com/2010/04/select-from-spreadsheet-or-how-to-parse.html
The latest version of the code can be found in the Alexandria Utility Library for PL/SQL:
http://code.google.com/p/plsql-utils/
– Morten
Morten,
I don’t believe your code handles optionally enclosed values, but only handles comma separated values. I needed a parser that accounted for CSV data that was optionally enclosed with some character, delimited by another, and potentially have the data within each cell contain either or both of those characters as well. So although I did find you code when looking for solutions on the web, the parser was just too limited for my needs.
Chris,
As I pointed out, the latest version of the csv_util_pkg package can be found in the Alexandria library, and this does indeed support optionally enclosed values.
I just tested it with your example data:
select *
from table(csv_util_pkg.clob_to_csv(‘normal,”commas,,,in the field”,”””enclosed”””,”random “” double “” quotes”,”commas,,, “” and double “””” quotes”‘))
And this splits the data into 5 columns:
c001 = normal
c002 = commas,,,in the field
c003 = “enclosed”
c004 = random ” double ” quotes
c005 = commas,,, ” and double “” quotes
(I suppose I should remove the older code from the blog post and just direct people to download the latest library code.)
– Morten
Morten, Yea, that would have been better. The blog examples seem to show it only handling the simple cases. Wish I had realized. Probably would not have written my version then. Oh well. Thanks for pointing that out.
This was very helpful! Thanks Chris!
My file upload procedure creates a BLOB data type. Does this procedure require CLOB, if so, how would I convert it.
You could use a procedure like this to convert it.
Hi Chris,
First, thanks for sharing this piece of code. It comes useful to me.I have one remark.
In the definition of CSV it is possible to enclose CR and LF characters within a field. When I offer such a content to the parser, it created a new line at the point of the CR and LF.
Does your parser support this CSV feature?
Dick
Just fixed it. I modified line 56.
Hey Chris,
Awesome code! Though it doesn’t look like it correctly parses multi-line values. For instance:
l_clob :=
‘a,b,c,”d’ || ch(13) || ch(10) || ‘more”, e’;
The values should be: (a), (b), (c), (dCRLFmore), (e)
But are instead: (a), (b), (c), (d), (more”), (e)
I believe that’s legal, according to this pseud-spec: http://tools.ietf.org/html/rfc4180#page-2
Let me know if you think I’m mistaken.
Thanks again!
– Danny
Good catch. Seems I missed that possibility. But it’s an easy fix. Just change line 56 to
elsif l_char in ( CARRIAGE_RETURN, LINE_FEED ) and NOT l_enclosed then
That will then handle it.
1 declare
2 l_clob clob :=
3 'a,b,c,"d' || chr(13) || chr(10) || 'more", e';
4 begin
5 parse_csv( l_clob );
6* end;
SQL> /
R1C1: a
R1C2: b
R1C3: c
R1C4: d
more
R1C5: e
-----
PL/SQL procedure successfully completed.
Not working when last token is enclosed in double quotes, then it will not move to next rows, but token will contail EOL – see this example:
set serveroutput on
declare
p CSV_PARSER;
l_clob clob :=
‘”Email”,”Phone”,Manager Email’||chr(10)||
‘”lucas.adam@oracle.com”,”+420221438013″,”fabio.x.gorbi@oracle.com”‘||chr(10)||
‘”dila.x.akyuz@oracle.com”,”+420220437022″,”anna.x.petrosyan@oracle.com”‘||chr(10)||
‘”sonia.albert@oracle.com”,”+40213678683″,”adina.mandaghios@oracle.com”‘||chr(10)||
‘”alice.alecu@oracle.com”,”+40213677118″,”mirela.popovici@oracle.com”‘||chr(10)||
‘”sofia.alexe@oracle.com”,”+40213678562″,”khaled.radwan@oracle.com”‘||chr(10)||
‘”danciu.anca.elena@oracle.com”,” +40 21 367 8461″,”alexandra.popa@oracle.com”‘||chr(10)
;
begin
parse_csv( l_clob );
end;
First line is ok, but then everything is on one line.
This helped to fix it:
— add aditional brach after line 53 to handle enclosed tokens at end of line
elsif l_lookahead IN ( CARRIAGE_RETURN, LINE_FEED ) THEN
l_enclosed := false;
END IF;
Great catch. Not sure how this is was not discovered before. But the check after line 53 should just be an else
else
l_enclosed := false;
Doesn’t matter what l_lookahead is at that point. If l_char = p_optionally_enclosed and the l_lookahead p_optionally_enclosed or p_delim, then you are no longer enclosed, so l_enclosed := false.
Thanks
Thanks for the amazing code. Just one question will this process a csv file with a blank line between the records. Is it a valid scenerio?
Thanks.
Yes.
SQL>declare
2 l_clob clob :=
3 'A,B,C' || chr(13) || chr(10) ||
4 chr(10) || chr(13) ||
5 'D,E,F' || chr(10) || chr(13);
6 begin
7 parse_csv( l_clob );
8 end;
9 /
R1C1: A
R1C2: B
R1C3: C
-----
R2C1: **null**
-----
R3C1: D
R3C2: E
R3C3: F
-----
PL/SQL procedure successfully completed.
Hi, I implemented this, and it works great… my main problem however is speed.. With a 20,000 line CSV file, it takes nearly 8 minutes to extract… What can I do to make this significantly faster? Someone mentioned varchar2 functions, but I’m not sure how you would do it
Well I know in at least 12c and I believe back in 11g as well, substr() and length() are overloaded to support clobs as well as varchar2. So instead of using dbms_lob.substr() and dbms_lob.getLength(), you could try using the native built-ins and see if you get a performance increase? I’d be interested in what you find? I may give it a try myself if I can find some time.
Also, in reviewing at the code, I could have been more efficient and determined the length of p_clob once, stored it in a local variable and then referenced that variable instead of calling dbms_lob.getLength() over and over again.
Well I just ran a test on the code as it stands above ( just commented out the dbms_outpu.put_line calls since I did not want to include that processing in the overall timing ) for 20,000 rows and it ran in 113 seconds in oracle 12c, 12.1.0.2. And that is before making any of the changes.
I used…
declare
l_clob clob;
l_start_time number;
begin
for i in 1 .. 20000 loop
l_clob := l_clob || ‘11111,22222,333333,444444,555555,’ ||
‘666666,777777,888888,999999,101010101010’ ||
‘aaaaa,bbbbb,ccccc,ddddd,eeeee,fffff,ggggg’ ||
‘hhhhh,iiiiiii,jj,kkkkkkkkkkkk,llllllllllll,m’ || chr(10);
end loop;
l_start_time := dbms_utility.get_time;
parse_csv( l_clob );
dbms_output.put_line( (dbms_utility.get_time – l_start_time)/100 );
end;
/
Then I quickly made the changes I specified in the comments above and it ran in 69 seconds, so it looks like you will get an significant increase in performance using the built-ins. But the longer the line being processed, the longer it will take to parse since you will perform that many more calls to substr().
Thanks for posting a very elegant solution. I can post the version of your code I updated to use a buffer but wanted to get your approval first. I would really like to get your input on my buffered implementation and if you find any potential problems or if you have suggestions for a better approach using a buffer.
I had to add a section to catch if the cr was found at the end of a buffer and if the first character of the next buffer would be the line feed
running oracle 11g using virtualbox
original
converted blob – clob size: 14583797
–* line count: 88701
–* parse time: 689.42
original changing dbms_lob.substr to substr
converted blob – clob size: 14583797
–* line count: 88701
–* parse time: 1451.97
buffered
converted blob – clob size: 14583797
–* line count: 88701
–* parse time: 13.9
Send me your code and I can take a look at it.
What email should I send it to?
christopher.beck@oracle.com
Chris-
wondering if you had a chance to review the buffered solution mark sent.
I’m trying to find a solution that can parse 30K rows.
Thanks,
I did and I’m working on the last of the testing and hope to have it published by end-of-week.
Chris – Can you please share the latest code that uses buffer approach?
I’m trying. I know it’s been awhile, but been swamped with work. Hoping to finish up testing and publish by end-of-week.
how wide is your data? I’ve written code to parse a clob and I’ve tested it on texts as large as the King James Bible and War and Peace. It’s not too hard to get decent performance if no single line is greater than 32K.
I just realized that’s kind of a tease. The code I’m referring to can be found here:
using the split_clob function there and str2tbl (Kyte’s or others) to parse each line into rows.
The queries below first parse the bible (4,452,066 characters) in two ways 100,222 lines
and then again into 851,334 words.
Of course this is somewhat artificial because I’m not “doing” anything with the results, I’m simply pumping them as fast as sqlplus will read them.
But, the point here was to illustrate the parsing speed, the performance of what is done with the data will of course vary on the app.
I hope it helps!
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 – 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> set timing on
SQL> set autotrace traceonly
SQL> set arraysize 5000
SQL> select x.* from etexts,table(split_clob(text)) x
2 where etexts.name = ‘King James Bible’;
100222 rows selected.
Elapsed: 00:00:00.56
Execution Plan
———————————————————-
Plan hash value: 2639684787
————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
————————————————————————————————-
| 0 | SELECT STATEMENT | | 8168 | 845K| 32 (0)| 00:00:01 |
| 1 | NESTED LOOPS | | 8168 | 845K| 32 (0)| 00:00:01 |
|* 2 | TABLE ACCESS FULL | ETEXTS | 1 | 104 | 3 (0)| 00:00:01 |
| 3 | COLLECTION ITERATOR PICKLER FETCH| SPLIT_CLOB | 8168 | 16336 | 29 (0)| 00:00:01 |
————————————————————————————————-
Predicate Information (identified by operation id):
—————————————————
2 – filter(“ETEXTS”.”NAME”=’King James Bible’)
Statistics
———————————————————-
29 recursive calls
0 db block gets
219 consistent gets
1524 physical reads
0 redo size
4576710 bytes sent via SQL*Net to client
764 bytes received via SQL*Net from client
22 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100222 rows processed
SQL> select y.* from etexts,table(split_clob(text)) x,TABLE(str2tbl(x.COLUMN_VALUE,’ ‘)) y
2 where etexts.name = ‘King James Bible’;
851334 rows selected.
Elapsed: 00:00:02.76
Execution Plan
———————————————————-
Plan hash value: 3493471110
————————————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
————————————————————————————————–
| 0 | SELECT STATEMENT | | 66M| 6871M| 221K (1)| 00:00:09 |
| 1 | NESTED LOOPS | | 66M| 6871M| 221K (1)| 00:00:09 |
| 2 | NESTED LOOPS | | 8168 | 845K| 32 (0)| 00:00:01 |
|* 3 | TABLE ACCESS FULL | ETEXTS | 1 | 104 | 3 (0)| 00:00:01 |
| 4 | COLLECTION ITERATOR PICKLER FETCH| SPLIT_CLOB | 8168 | 16336 | 29 (0)| 00:00:01 |
| 5 | COLLECTION ITERATOR PICKLER FETCH | STR2TBL | 8168 | 16336 | 27 (0)| 00:00:01 |
————————————————————————————————–
Predicate Information (identified by operation id):
—————————————————
3 – filter(“ETEXTS”.”NAME”=’King James Bible’)
Statistics
———————————————————-
23 recursive calls
2 db block gets
250 consistent gets
1524 physical reads
0 redo size
5356930 bytes sent via SQL*Net to client
2414 bytes received via SQL*Net from client
172 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
851334 rows processed
Thank you for this utility.
I have one remark. To work this with UTF-8 data you have to define char variables like this:
l_char char(1 char);
l_lookahead char(1 char);
Chris, your little utility here was just what I needed. I’ve adapted to my needs, and it works great; thank you!
Chris, your little utility here was just what I needed. I’ve adapted it to my needs, and it works great; thank you!
HI Chris,
I tried this, unfortunately I am only able to get the first line from my CSV file, can’t get the other lines to Output. Seems its not looping thru other lines within CSV not sure if I am missing something? Odd thing is that when I hard code the lines it works just not from file.
John
John, Not sure what the issue could be? What is the newline character in your file?
Also, you may want to look into using the supplied package.function APEX_DATA_PARSER.PARSE() You can supply the function a blob that is your csv file and you can use it as a table in a select statement. Far superior to my function here that I wrote a decade ago. Just a thought.
Hi Chris,
I am trying to accomplish this by PL/SQL we don’t have Apex as a plug in. My Newline character shown in Notepad++ is “CR LF”.
Apex is implemented 100% in PL/SQL. You could download and install in via sqlplus. But regardless, can you send me an example of the file you are trying to parse so I can test?
Thanks I sent the file to your email address christopher.beck@oracle.com