Click below or use the site search to find answers to your questions
What skills do I need to use this service?
You need to be able to upload your file to our website and download your results file from our website.
Once you have your results file, you will need some basic Excel skills to work with it. We suggest filtering in Excel and sorting on the scores we have provided.
If you have chosen the high volume/high security option, you will need to be able to create a unique integer rownumber as the first column of your file. This can be done in Excel using the formula row(). Then, when you receive your results file you will need to merge the two files on rownumber. Since both files will be in the same order, this could be as easy as copying and pasting in Excel, or using Excel formula vlookup().
If your file has more than 100,000 records it will always be processed using the high volume/high security method. This means that you must provide a unique integer rownumber as the first column of your file if your file is this large.
I am trying your service for the first time. How should I proceed?
First, upload one of our free trial datasets and go through the process. It will cost you nothing and you will gain a better understanding of how the service works and what to expect.
Next, upload your own data. After your file is uploaded successfully, read the messages produced and wait for the free data profiling report to be generated. Save this report to your computer and spend time looking at it. The report contains a profile of each column and you can click on “toggle details” to find out more. Does the profiling report accurately reflect your understanding of the data? If not, you might need to make changes and upload a new file.
Once you are satisfied with the file you have uploaded, you are ready to shop. Your uploaded file will create inventory, including one or more data retention options. If you are still unsure about placing an order at this stage, you could place a smaller order. To do this, go back and make a subset of your file and upload this smaller file. Then you can shop using the smaller file instead.
Are there any constraints on the file I can provide?
Yes, the file must be either .csv or .xlsx. If your file has more than 100,000 records, you must provide a unique integer rownumber as the first column of your file. The number of columns should not exceed 200. The overall filesize should not exceed 500 MB for csv files (50 MB for Excel). At approximately 100 million cells of data, this limit is enough for most situations.
If your Excel file is too large, save it as .csv instead.
If you are bumping up against the column limit, you could upload datasets containing a subset of the columns. The data profiling report will point out any columns that are potentially redundant. You could then amend your dataset and resubmit it.
If you are bumping up against the size limit (but not the column limit), you could divide your dataset into smaller pieces and submit them as separate datasets.
What will my results file look like?
The results file contains two new columns: an outlier score and a reason code. The outlier score is a rank and a rank of 1 indicates the biggest outlier in your dataset. The reason code gives you hints on why the record looks unusual. Ranks and reason codes are only provided for the top 3% records in your dataset.
Worked examples are here.
I think my data contains more outliers than 3%. How can I identify these?
Outliers are, by definition, unusual and 3% should cover most cases. To obtain the next 3%, prepare a new input file from the old one and remove the 3% of records that have already been identified. Upload your new file and submit a new order.
File validation has given me a message about Unicode UTF-8, what does this mean?
Your file may contain unusual or foreign characters. For example, names may contain accent characters. Converting data into UTF-8 will ensure that those characters are properly preserved. Several tools can be used to make the conversion. We recommend Notepad++. In Notepad++, open your csv file and go to encoding->convert to UTF-8, then save your file.
I have opened my results file in Excel and some of the characters are funny
The results file is encoded in Unicode UTF-8 and should preserve all characters. But if you just click to open the csv file in Excel, it may not display correctly if your file contains unusual or foreign characters. If instead you start with a blank workbook and do data ->from text->[results file name] and say the data origin is Unicode UTF-8, you should get more sensible characters. An example of this is the free trial dataset “Fifa19”. You may need to open the results file this way to get the names of the players and clubs to show up correctly.
Placing an order
Why do you collect my shipping address and geo location data?
This is to meet our tax obligations. We are based in Canada and collect sales taxes from Canadian taxpayers. If your physical address is outside Canada, your geo location is outside Canada, and you self-declare at checkout, then there will be no taxes on your order.
When will my results be ready?
Most results are ready within an hour of your order. The time to process a file varies and depends on the size and character of the file.
We live in Toronto, and remember the blackout of 2003 and the ice storm of 2013, both of which left us without power for days. For this reason, we set ourselves an overall service level of three Canadian business days. If your order is delayed, we will advise you as soon as possible. If we do not meet our service level, you will receive a full refund. Depending on the nature of the issue, we may or may not be able to deliver your results file. Either way, you will not pay for the work.
How are prices determined?
Pricing depends on your data file, namely the number of rows, the number of columns and the type of data within it. These things have a direct impact on our processing costs. Secondly, you can choose your data retention option – standard, high volume/high security or donation.
Clearly, you could save money by skipping rows or columns but you should be careful. We do have a minimum price, and you can check this by uploading a really small file. If you are at the minimum price, then the price is the price.
Assuming that you are above the minimum price, you need to understand that all rows are relevant since the algorithm needs to distinguish between “normal” data and “outlier” data. So, if you remove what looks like obviously “normal” data, this impairs the analysis. A better rule for removing rows may be based on time. Let’s say your data contains records from five years ago. Are the patterns in those old records still relevant? If you were to find outliers in those older records, would you be able to action them? If not, then you can omit those older records.
With columns, if you see no reason why a column might distinguish an outlier from a normal record, then you can remove it. An example of such a column is one that has the same value for all records. Another example is a column that uniquely identifies the record. The data profiling report will point out any columns that are potentially redundant. Again, be careful about this because seemingly uninteresting columns like timestamps or customerids provide context to the algorithms. And, under the high volume/high security option you must provide an integer rownumber as the first column of your file.
Tell me more about the file donation data retention option
File donation is a data retention option available to you if the dataset you are working with is either public or, if it is private, you have the authority to donate it to us. If you donate a file, it should not contain personal or sensitive information.
This option means that users such as students, researchers, journalists and small manufacturers are given a discount when they place an order.
Once a file is donated to us, it becomes our property. We may choose to use it for any purpose, or we may choose to delete it. We will send you an email acknowledging the donation and we do keep a record of each donation. Your name and organization will be associated with the dataset and we may publicly credit the donated dataset to you and your organization.
How safe is my data?
All data is stored on Amazon Web Services servers located in Canada. Amazon Web Services (AWS) is a trusted name in e-commerce and cloud computing.
Our website is encrypted to the typical standard for e-commerce. Other security measures include strong passwords, Google recaptcha and optional two factor authentication. This protects the modest amount of data that we collect about our customers and their orders. We do not store credit card numbers, instead we use PayPal to securely collect payments.
When you upload a data file, it goes directly to an S3 bucket at AWS and it is encrypted both in transit and when it reaches its destination. Customer data files follow a data retention policy. See next question
What is your data retention policy?
|Invalid upload file||Five days. Can be deleted sooner from the tidy uploads menu|
|Valid upload file which is not part of an order||31 days. Can be deleted sooner from the tidy uploads menu|
|Upload feedback report||Indefinitely|
|Data profiling report||Indefinitely|
|Valid upload file which is part of an order||One day under the high volume/high security option, 31 days under the standard option, indefinitely under the donate my file option|
|Derived files and intermediate results files generated during order processing||One day under the high volume/high security option, 31 days under the standard option, indefinitely under the donate my file option|
|Final results files||31 days from order completion|
Customer data files follow a data retention policy. Uploaded customer data files are retained for one month or, under the high volume/high security option, your data is deleted a day after your order is completed. Any other copies of your data follow the same policy, as do intermediate results files produced during order processing.
Results files also follow a retention policy. Results files are retained for one month. Under the high volume/high security option, the results file only contains one column from your original data (rownumber).
The only exception to this is when you place an order using the donate my file option. Under this option, Penny Analytics may retain the uploaded customer data file and all its artifacts.
How can I prepare my file if it contains sensitive data?
If you have sensitive data, you should consider taking extra steps. Firstly, you can choose to simply omit certain columns (such as customer names) from your data file. Or data can be made anonymous using techniques such as masking. Columns can be given new names like COL1, COL2, etc. Data changes like these will not impact the outlier detection.
Can I send you personal data about my customers?
Please avoid sending personal data. Firstly, you need the consent of your customer to share such data. Secondly, data like customer names rarely helps the outlier detection. If you still want to use sensitive data, we suggest how to securely prepare the data in the above answer. Then, you should always choose the high security/high volume option, so that your uploaded file is deleted promptly.
Once I upload my data, what happens to it?
When you upload your data, it is sent, encrypted, directly to an Amazon Web Services (AWS) S3 bucket located in AWS’s Canada region. The bucket is private and the data within it remains encrypted “at rest”. After we receive your data, we store only high level statistics about your file. No further data is created until after you place an order.
After an order is placed and processing begins, derived copies of your data are also stored in secure AWS S3 buckets. The processing also generates intermediate results files, which contain a rownumber and scores only.
Once processing is complete, your results file is made available via a secure URL link which has an expiry date. We do not send URLs in emails, instead you need to sign in to the website and go to the downloads section. The URLs are updated every few days and are taken down from the website after a month. Your results file will be deleted one month after your order is completed.
A month is the interval of time after which all other data files (uploaded files, derived copies and intermediate results) are deleted permanently, but this is from the date of upload. These files are deleted even more quickly under the high security/high volume policy which you can choose when you place your order. Also, if an uploaded file is not part of an order, you can choose to delete it earlier using the tidy uploads menu.
What data do you retain?
We keep your customer information and a history of your orders, and you can view these details when you sign in to the website.
We also keep high level statistics about uploaded customer files. For example, we know the name of the file, when it was uploaded, the file size, the number of rows and the number of columns. The information in the upload feedback report provided to you at the time of data upload is retained and this contains some information about the dataset e.g. is it a time series, does it contain dates. The information in the data profiling report provided to you if your upload was successful is also retained and this includes a summary of each column, such as the mean, min and max values.
Time series and dates
My data is a time series. Can you process time series?
Yes, our system can recognize time series data files and apply the relevant algorithms to them.
Time series files include readings that are taken at a regular interval. In order to have your file recognized as a time series, it should contain just one date/datetime column and be sorted in that order. There should be no missing values in the date/datetime column and the time interval between each row should always be the same. During file upload, if your file has been recognized as a time series, we will advise you.
Example: Let’s say we have a data file from the Bank of Canada. It contains daily exchange rates for three currencies USD-CAD, GBP-CAD and EUR-CAD. But because forex exchanges are not open on weekends or holidays, there are gaps in the series. The time interval between rows is not the same. This data file needs to be fixed so that it includes rows for the missing dates. (You can choose to keep the values in those rows as blank or instead carry forward the values from the previous row – we can handle either scenario).
How to check for time series gaps in Excel: Copy your date/datetime column to a new sheet. Next to your date/datetime column write a formula which subtracts the first date from the second date. Copy this formula down. Filter on the formula and check that there are only two possible values. Select one, then the other. All but one row should contain the same value.
Finally, we recommend you avoid ambiguous dates – see next question.
What are ambiguous dates and why should I avoid them?
There are many possible date formats and some of them are ambiguous. To avoid any confusion, we recommend your dates include all four digits of the year (2001, 2011 etc.) and spell out the month (Jan, Feb, Mar etc.).
Ambiguous dates in a csv file can be fixed by reading the file into Excel. Select the column, go to number formats, choose an unambiguous format, and save the file again as csv. An example of an unambiguous format is:
“2-FEB-2008 01:00:00 AM”
This can be achieved by using the following custom format: “dd-mmm-yyyy h:mm:ss AM/PM”. If your date does not include time you can use “dd-mmm-yyyy” instead. You can check that this worked correctly by opening up your new csv file in a text editor such as Notepad. (If you open it up again in Excel, it may look like nothing changed).
If your uploaded data file contains dates, please review the data profiling report we provide. Look at variables–your date variable–toggle details–histogram and check that the distribution of dates is what you expect.
Dates are important to outlier detection because they provide context e.g. there is more shopping in December than other months. Dates also help us identify time series, which require their own set of algorithms.
If you provide dates that are ambiguous, there is a risk that we will misinterpret them. An example is “01-02-08 01:00:00”. We will assume this is 1 February, 2008 at 1:00 AM. i.e. we assume that the first digit of the date is the day and the last digit is the year. If AM or PM is not specified, we assume it is a 24 hour clock. Please remove this risk by formatting your dates unambiguously.