Datasets

What is a 'Dataset'?

A dataset is made up from one or more data sources, which can be files, database tables or queries, or data from an Excel workbook.

The input data can then be combined, cleaned, and augmented to fit your specific needs.

Automatic Datasets

Datasets are easy to create, but XLFetch will also automatically create a dataset if you specify a sheet name, table name, or named range name in a query.

This makes using data in your current workbook easy if you have mostly clean data.

Where is my data stored?

By default XLFetch stores your data in a Sqlite database locally on your PC. Sqlite is reliable and used by millions of applications worldwide (i.e. it's not propritary to XLFetch).

For advanced users you can also save your data to a RDBMS like MS Sql Server, Postgres, or Oracle Mysql.

How Do I Create A Dataset?

1. Choose a data source.

2. Setup your destination.

XLFetch will show you a small preview of each data source, and provide a way clean your data with a few of the most frequently-used transformation rules.

3. Add one or more data transformation rules (optional). In most cases you'll want to make sure the values in your data are unique within each column.

4. Click 'Process' on the ribbon to process your data.

How can I speed up my retrievals?

1. Run 'Analyze Retrieve Sheet'.

2. Review the results, and consider adding one or more aggregates and/or indexes inside the 'Performance' ribbon item. This will increase processing time but decrease retrieval time, so there is a trade-off.

The results show which cells were slow as well as other helpful statistics. If a small % of the table was queried an index should help. However if a large % of rows were queried an aggregate will probably be a better choice.

How Long Does It Take To Process A Dataset?

Processing time depends on several things, including your computer speed, whether you have a traditional or solid-state hard drive, the types of transformations you need to perform , and of course how much data you have.

For the 'bakery' dataset used in most of the demos, this 1.8 million row dataset (~155 meg csv file) with no transformation and 5 aggregate tables takes 58 seconds to process on my 8+ year old desktop (with a traditional hard drive).

I've also run much larger data loads with 10-20 million records from 2-3 gig csv files that take 50-70 minutes.

Basic data cleaning, renaming columns, etc are very quick, while merging multiple sources together can increase the runtime for very large datasets. It also turns out that sqlite is quite efficient at storing data, and is typically much faster than saving data to an external database like MS Sql Server, Postgres, or Oracle MySql.

Can I Connect Directly To An External Database As A Source?

You can, however in the vast majority of cases I recommend that instead of connecting directly to a database as a source you use the 'Create CSV from Sql' option when adding a source.

By downloading data locally you create the ability to control your data, transform and clean it quickly and iteratively, and have access to the data when your database is down for maintenance or issues.

XLFetch has to gather statistics on the data you wish to query from an exteral database, and this can be slow to execute as well.

Therefore I would only recommend trying direct database access for a dataset when the amount of data is so large it cannot be downloaded locally due to space or runtime.