I’m little consused to use td
CLI to import my own data into Treasure Data service. In terms of bulk import, there some concept you should know.
Of course it is not difficult. Once you understand the internals, you might be able to make the import process more efficient. The detail is here.
Steps
There are some steps to do bulk import with td CLI. To begin with, I’d like to explain these steps.
create
td import has a concept called session. By using this session, you can upload multiple data and do a transactional import. The required information to create session are database name and table name.
$ td import:create my_session my_db my_table
prepare
The original your data often is huge. It may be troublesome and inefficient to upload these data as it is. So prepare
aims to convert the format into MessagePack and compress.
In this phase no data are transferred into TreasureData service. All tasks of prepare
phase can be done on your local machine.
$ td import:prepare ./mylogs_20151028.csv \
--format csv \
-o ./output_20151028
upload
After converted, you can upload these data with upload
subcommand into your session.
$ td import:upload my_session ./output_20151028/*
The data is uploaded through secure connection into TreasureData row-based storage system.
perform
Then the uploaded data is transformed into our column-oriented data format using MapReduce. With this process, the uploaded data is converted into more efficient format.
## In order to prevent other script from
## uploading data into this session
$ td import:freeze
$ td import:perform my_session
Then your data will be stored into columnar-based storage. If you want to upload additional data with the same session, unfreeze
command can be used.
$ td import:unfreeze my_session
commit
After you confirm the perform job is completed, you can import the data into your target table of your database.
td import:commit my_session
All you want to do is finished. (No additional data to upload) You can delete your session used for this upload.
$ td import:delete my_session
Last but not least
Although it is important to understand internal of importing process, it is a little tough work to do always. So you can do these process with one command
by using --auto-XX
options. This is the easist way to import!
$ td import:upload \
--auto-create my_database.my_table \
--auto-perform \
--auto-commit \
--column-header \
--output output_today \
data_*.csv