forked from apache/commons-csv
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathoverview.html
More file actions
375 lines (373 loc) · 14.8 KB
/
Copy pathoverview.html
File metadata and controls
375 lines (373 loc) · 14.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<html>
<head>
<title>Apache Commons CSV Overview</title>
</head>
<body>
<img src="../images/commons-logo.png" alt="Apache Commons CSV">
<p>
You can find the Javadoc package list at the <a href="#all-packages-table">bottom of this page</a>.
</p>
<section>
<h1>Introducing Commons CSV</h1>
<p>Apache Commons CSV reads and writes files in variations of the Comma Separated Value (CSV) format.</p>
<p>
Common CSV formats are predefined in the <a href="org/apache/commons/csv/CSVFormat.html">CSVFormat</a> class:
<table>
<caption>CSV Formats</caption>
<thead>
<tr>
<th scope="col">CSVFormat</th>
<th scope="col">Description</th>
<th scope="col">Since Version</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#DEFAULT">DEFAULT</a></td>
<td>IO for the Standard Comma Separated Value format, like <a href="https://datatracker.ietf.org/doc/html/rfc4180">RFC 4180</a> but allowing
empty lines.
</td>
<td>1.0</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#EXCEL">EXCEL</a></td>
<td>IO for the <a href="https://support.microsoft.com/en-us/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba">Microsoft
Excel CSV.</a> format.
</td>
<td>1.0</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#INFORMIX_UNLOAD">INFORMIX_UNLOAD</a></td>
<td>IO for the <a href="https://www.ibm.com/docs/en/informix-servers/14.10?topic=statements-unload-statement">Informix UNLOAD TO file_name</a>
command.
</td>
<td>1.3</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#INFORMIX_UNLOAD_CSV">INFORMIX_UNLOAD_CSV</a></td>
<td>IO for the <a href="https://www.ibm.com/docs/en/informix-servers/14.10?topic=statements-unload-statement">Informix UNLOAD CSV TO
file_name</a> command with escaping disabled.
</td>
<td>1.3</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#MONGODB_CSV">MONGODB_CSV</a></td>
<td>IO for the <a href="https://docs.mongodb.com/manual/reference/program/mongoexport/">MongoDB CSV <code>mongoexport</code></a> command.
</td>
<td>1.7</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#MONGODB_TSV">MONGODB_TSV</a></td>
<td>IO for the <a href="https://docs.mongodb.com/manual/reference/program/mongoexport/">MongoDB Tab Separated Values (TSV)<code>mongoexport</code></a>
command.
</td>
<td>1.7</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#MYSQL">MYSQL</a></td>
<td>IO for the <a href="https://dev.mysql.com/doc/refman/8.0/en/mysqldump-delimited-text.html">MySQL CSV</a> format.
</td>
<td>1.0</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#ORACLE">ORACLE</a></td>
<td>IO for the <a href="https://docs.oracle.com/database/121/SUTIL/GUID-D1762699-8154-40F6-90DE-EFB8EB6A9AB0.htm#SUTIL4217">Oracle CSV</a> format
of the SQL*Loader utility.
</td>
<td>1.6</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#POSTGRESQL_CSV">POSTGRESQL_CSV</a></td>
<td>IO for the <a href="https://www.postgresql.org/docs/current/static/sql-copy.html">PostgreSQL CSV</a> format used by the <code>COPY</code>
operation.
</td>
<td>1.5</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#POSTGRESQL_TEXT">POSTGRESQL_TEXT</a></td>
<td>IO for the <a href="https://www.postgresql.org/docs/current/static/sql-copy.html">PostgreSQL Text</a> format used by the <code>COPY</code>
operation.
</td>
<td>1.5</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#RFC4180">RFC4180</a></td>
<td>IO for the RFC-4180 format defined by<a href="https://datatracker.ietf.org/doc/html/rfc4180">RFC 4180</a>.
</td>
<td>1.0</td>
</tr>
<tr>
<td><a href="org/apache/commons/csv/CSVFormat.html#TDF">TDF</a></td>
<td>IO for the <a href="https://en.wikipedia.org/wiki/Tab-separated_values">Tab Delimited Format</a> (also known as Tab Separated Values).
</td>
<td>1.0</td>
</tr>
</tbody>
</table>
<p>Custom formats can be created using a fluent style API.</p>
</section>
<section>
<h1>Parsing Standard CSV Files</h1>
<p>
Parsing files with Apache Commons CSV is relatively straight forward. Pick a
<code>CSVFormat</code>
and go from there.
</p>
<section>
<h2>Parsing an Excel CSV File</h2>
<p>To parse an Excel CSV file, write:</p>
<pre>
<code>
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}
</code>
</pre>
</section>
</section>
<section>
<h1>Parsing Custom CSV Files</h1>
<p>
You can define your own using IO rules by building your own CSVFormat instance. Starting with
<code>CSVFormat.builder()</code>
lets you start from a predefined format and customize. For example:
</p>
<pre>
<code>
CSVFormat myFormat = CSVFormat.DEFAULT.builder()
.setCommentMarker('#')
.setEscape('+')
.setIgnoreSurroundingSpaces(true)
.setQuote('"')
.setQuoteMode(QuoteMode.ALL)
.get()
</code>
</pre>
</section>
<section>
<h1>Handling Byte Order Marks</h1>
<p>
To handle files that start with a Byte Order Mark (BOM), like some Excel CSV files, you need an extra step to deal with the optional BOM bytes. Using the
<a href="https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html"> BOMInputStream </a> class from <a
href="https://commons.apache.org/proper/commons-io/">Apache Commons IO</a> simplifies this task; for example:
</p>
<pre>
<code>
try (Reader reader = new InputStreamReader(BOMInputStream.builder()
.setPath(path)
.get(), "UTF-8");
CSVParser parser = CSVFormat.EXCEL.builder()
.setHeader()
.get()
.parse(reader)) {
for (CSVRecord record : parser) {
String string = record.get("ColumnA");
// ...
}
}
</code>
</pre>
<p>You might find it handy to create something like this:</p>
<pre>
<code>
/**
* Creates a reader capable of handling BOMs.
*
* @param path The path to read.
* @return a new InputStreamReader for UTF-8 bytes.
* @throws IOException if an I/O error occurs.
*/
public InputStreamReader newReader(final Path path) throws IOException {
return new InputStreamReader(BOMInputStream.builder()
.setPath(path)
.get(), StandardCharsets.UTF_8);
}
</code>
</pre>
</section>
<section>
<h1>Using Headers</h1>
<p>
Apache Commons CSV provides several ways to access record values. The simplest way is to access values by their index in the record. However, columns in
CSV files often have a name, for example: ID, CustomerNo, Birthday, etc. The CSVFormat class provides an API for specifying these <i>header</i> names and
CSVRecord on the other hand has methods to access values by their corresponding header name.
</p>
<section>
<h2>Accessing column values by index</h2>
<p>To access a record value by index, no special configuration of the CSVFormat is necessary:</p>
<pre>
<code>
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in);
for (CSVRecord record : records) {
String columnOne = record.get(0);
String columnTwo = record.get(1);
}
</code>
</pre>
</section>
<section>
<h2>Defining a header manually</h2>
<p>Indices may not be the most intuitive way to access record values. For this reason it is possible to assign names to each column in the file:</p>
<pre>
<code>
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.builder()
.setHeader("ID", "CustomerNo", "Name")
.build()
.parse(in);
for (CSVRecord record : records) {
String id = record.get("ID");
String customerNo = record.get("CustomerNo");
String name = record.get("Name");
}
</code>
</pre>
Note that column values can still be accessed using their index.
</section>
<section>
<h2>Using an enum to define a header</h2>
<p>Using String values all over the code to reference columns can be error prone. For this reason, it is possible to define an enum to specify header
names. Note that the enum constant names are used to access column values. This may lead to enums constant names which do not follow the Java coding
standard of defining constants in upper case with underscores:</p>
<pre>
<code>
public enum Headers {
ID, CustomerNo, Name
}
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.builder()
.setHeader(Headers.class)
.build()
.parse(in);
for (CSVRecord record : records) {
String id = record.get(Headers.ID);
String customerNo = record.get(Headers.CustomerNo);
String name = record.get(Headers.Name);
}
</code>
</pre>
Again it is possible to access values by their index and by using a String (for example "CustomerNo").
</section>
<section>
<h2>Header auto detection</h2>
<p>Some CSV files define header names in their first record. If configured, Apache Commons CSV can parse the header names from the first record:</p>
<pre>
<code>
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.builder()
.setHeader()
.setSkipHeaderRecord(true)
.build()
.parse(in);
for (CSVRecord record : records) {
String id = record.get("ID");
String customerNo = record.get("CustomerNo");
String name = record.get("Name");
}
</code>
</pre>
This will use the values from the first record as header names and skip the first record when iterating.
</section>
</section>
<section>
<h1>Printing with headers</h1>
<p>To print a CSV file with headers, you specify the headers in the format:</p>
<pre>
<code>
Appendable out = ...;
CSVPrinter printer = CSVFormat.DEFAULT.builder()
.setHeader("H1", "H2")
.build()
.print(out);
</code>
</pre>
<p>To print a CSV file with JDBC column labels, you specify the ResultSet in the format:</p>
<pre>
<code>
try (ResultSet resultSet = ...) {
CSVPrinter printer = CSVFormat.DEFAULT.builder()
.setHeader(resultSet)
.build()
.print(out);
}
</code>
</pre>
</section>
<section>
<h1>Working with JDBC</h1>
<section>
<h2>Exporting JDBC Result Sets</h2>
<p>
To export row data from a JDBC
<code>ResultSet</code>
, use <a href="org/apache/commons/csv/CSVPrinter.html#printRecords(java.sql.ResultSet)">CSVPrinter.printRecords(ResultSet)</a> :
</p>
<pre>
<code>
final StringWriter sw = new StringWriter();
final CSVFormat csvFormat = CSVFormat.DEFAULT;
try (Connection connection = DriverManager.getConnection("jdbc:h2:mem:my_test;", "sa", "")) {
try (Statement stmt = connection.createStatement();
CSVPrinter printer = new CSVPrinter(sw, csvFormat);
ResultSet resultSet = stmt.executeQuery("select ID, NAME, TEXT, BIN_DATA from TEST")) {
printer.printRecords(resultSet);
}
}
final String csv = sw.toString();
System.out.println(csv);
</code>
</pre>
</section>
<section>
<h2>Limiting rows from JDBC Result Sets</h2>
<p>SQL lets you limit how many rows a SELECT statement returns with the LIMIT clause.</p>
<p>
When you can't or don't want to change the SQL used to generate rows, JDBC lets you limit how many rows a JDBC Statement returns with the <a
href="https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/Statement.html#setMaxRows(int)">Statement.setMaxRows(int)</a> method.
</p>
<p>
When you get a JDBC ResultSet from an API like <a
href="https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/DatabaseMetaData.html#getProcedures(java.lang.String,java.lang.String,java.lang.String)">
DatabaseMetaData.getProcedures(...)</a>, there is no SQL or JDBC Statement to use to set a limit, the ResultSet class does not have an API to limit rows.
</p>
<p>
To simplify limiting ResultSet rows, Commons CVS offers the <a href="org/apache/commons/csv/CSVFormat.Builder.html#setMaxRows(long)">CSVFormat.Builder.setMaxRows(long)</a>
method. For example:
</p>
<pre>
<code>
CSVFormat csvFormat = CSVFormat.DEFAULT
.setMaxRows(5_000)
.get();
try (ResultSet resultSet = ...) {
csvFormat.printer().printRecords(resultSet);
}
</code>
</pre>
<p>
Using the above, calling <a href="org/apache/commons/csv/CSVPrinter.html#printRecords(java.sql.ResultSet)">CSVPrinter.printRecords(ResultSet)</a> will
limit the row count to the maximum number of rows specified in setMaxRows().
</p>
<p>Note that setMaxRows() works with the other methods that print a sequence of records.</p>
</section>
</section>
</body>
</html>