MX and mysql data indexes

jlmr731 · Post by **jlmr731** » Mon 11 Nov 2019 3:06 am

Would like to know if anyone has done any indexing to tables, like monthly or realtime ? I only have a few years on it and about 1.7 million records so in time it will grow even larger, most query's are quick and a few taking less than half to almost a full second to do, as it has been increasing over time. I do one pivot table that takes 2.7 secs and this has me starting to read on optimization. I see there are no indexes in either of these tables and if anyone has something would like to hear about it. Or any one that is a sql guru that can help, with work and life its going to take time to study (yes im looking for a short cut) but also to help others out.

Thanks to all

HansR · Post by **HansR** » Mon 11 Nov 2019 7:00 am

My question would be: what would you like to query ?
In meteorology it is time series which are most interesting and all logfiles are nicely chronological . Personally and through experience, I don' see much optimisation here. Reading ascii files is pretty fast I can tell you and databases have overhead both in searching as in building indexes.

Did you do some measurement and calculations?

Cheers, Hans

mcrossley · Post by **mcrossley** » Mon 11 Nov 2019 12:02 pm

My experience is that *nearly all* of the queries I use invoke full table scans, so indexes are of no use there. I have tried adding some indexes, but saw little performance improvement.

As Hans says you need to analyse how your queries are interrogating the database, then decide if an index could help that query. Over indexing can hit your performance too (at work I once had one Oracle DB query that was taking over 2 hours to process - I removed one index and it took ~20 seconds!) , it is not a simple question, nor a one size fits all answer.

HansR · Post by **HansR** » Mon 11 Nov 2019 1:50 pm

mcrossley wrote: ↑Mon 11 Nov 2019 12:02 pm Over indexing can hit your performance too (at work I once had one Oracle DB query that was taking over 2 hours to process - I removed one index and it took ~20 seconds!) , it is not a simple question, nor a one size fits all answer.

Right! Been there, done that

Always be careful for overkill and idealising databases. Dedicated algorithm may cost less time, than building and maintaining a database. SQL is not free (in resources).

ExperiMentor · Post by **ExperiMentor** » Mon 11 Nov 2019 11:49 pm

I have at the back of my mind a bit of a plan to see if I can import the archive files into Tibco Spotfire.
That'll not be running on my Raspberry Pi of course, but I think it wouldd be a good package to do the heavy lifting without needing dedicated programming.

Any thoughts?

jlmr731 · Post by **jlmr731** » Tue 12 Nov 2019 1:22 am

Ok makes sense , Like i said im not close to an expert I like to read about how things work and saw indexing helps for certain data type and figured this was going to grow over the years so maybe start now while its young. Now hans you made perfect point that it is chronological and just a bunch of numbers and yes mark full scans is what i see take the longest but still very fast. guess its the one query i made that does take up the most time, and im sure i could make it better but it works and gives me the output i want. seems like a simple pivot type (or maybe its case statements) query and doing math averaging does take a bit to do, it was faster with out the pivot but then the php would of been harder (for me) to format it the way i want. Guess ill go back to the basics and relearn problem is I dont do any of this as work, but I do it for fun.
Thanks you for the input and here is the one query that started me down this rat hole

Code: Select all

SELECT LogDateTime, Temp as temp , round(avg(CASE WHEN month = 1 THEN temp END),2) January , round(avg(CASE WHEN month = 2 THEN temp END),2) Febuary , round(avg(CASE WHEN month = 3 THEN temp END),2) March , round(avg(CASE WHEN month = 4 THEN temp END),2) April , round(avg(CASE WHEN month = 5 THEN temp END),2) May , round(avg(CASE WHEN month = 6 THEN temp END),2) June , round(avg(CASE WHEN month = 7 THEN temp END),2) July , round(avg(CASE WHEN month = 8 THEN temp END),2) August , round(avg(CASE WHEN month = 9 THEN temp END),2) September , round(avg(CASE WHEN month = 10 THEN temp END),2) October , round(avg(CASE WHEN month = 11 THEN temp END),2) November , round(avg(CASE WHEN month = 12 THEN temp END),2) December FROM (SELECT Monthly.* , EXTRACT(YEAR FROM LogDateTime) year , EXTRACT(MONTH FROM LogDateTime) month FROM Monthly ) Monthly GROUP BY year

Also i do this on data i get from noaa ids and it does work better but it has separate columns for year, mouth, day
It seemed like a nice way to get average monthly temps

HansR · Post by **HansR** » Tue 12 Nov 2019 6:11 am

@experimentor: you got me to a tool which is new to me. Interesting! But very suspicious that you do not see meteorology among their fields of solutions. I am looking forward to your results of the experiment.

@jlmr731: thanks for the query, not sure on which data you run this, but if you run it only on dayfile and not on monthly log files - use those only if really necessary - you have a lot less records and probably the same results. The query itself ,I think, is going to determine the month first and then starts to loop over the month. Difficult to say from a distance, but I assume more than one (probably more) tablescan will be involved. You might start by simply entering the months as literals. There are only twelve and they are known. Why calculate with a query? And for analytical purposes you might want to extract only one (known) year and do some math on the speed (try to find a timing technique).

mcrossley · Post by **mcrossley** » Tue 12 Nov 2019 10:14 am

Jeff, I'm not 100% sure what you wanted from that query, but as Hans said, definitely use the day data table where ever possible. You already have an average temp per day in there, so you can average those values over a month.

So in MySQL something like this...

Code: Select all

SELECT
    YEAR(LogDate),
    MONTHNAME(LogDate),
    AVG(AvgTemp)
FROM daydata
GROUP BY YEAR(LogDate), MONTH(LogDate)

Which takes 23 ms on my Raspberry Pi with data since January 2009

jlmr731 · Post by **jlmr731** » Wed 13 Nov 2019 1:26 am

hans yes I run that on monthly and yes i know dayfile will give me same results but I had to see if it you work just a little slower than i thought but sometimes i do stuff cause its hard and i will learn something.

Mark yes and thanks

Both of you have been a bit of help so tahnks a bunch
Now if there was only a way to automatically record snowfall Yes we had a few inches today

ExperiMentor · Post by **ExperiMentor** » Thu 14 Nov 2019 12:49 am

HansR wrote: ↑Tue 12 Nov 2019 6:11 am @experimentor: you got me to a tool which is new to me. Interesting! But very suspicious that you do not see meteorology among their fields of solutions. I am looking forward to your results of the experiment.

Well, I had to rise to my own challenge!

Ignoring the warnings above, I started with 6 years of monthLog.txt files.Mainly at 10 minute intervals, but I seem to have a few months in 2019 with 1-minute intervals (someone must have fiddled

) and March 2016 is an empty file ... real world data this!

So I concatenated them (any order as the filenames aren't the cleverest - the rows are time stamped so Spotfire can sort that out

). I's 61MB and has 0.5 million rows.

Run Spotfire, 'Add data tables'. Spotfire noted European text format, comma separated, first column a date, second a time, and the rest a mix of integer and real numbers. Click OK
and it imported the whole lot in 5 seconds (as a contrast, my complicated multi-table data sets in my day job take several minutes to load). Manually adding sensible column labels took a little while. Now I can play.

From the date field, I'm automatically given options like Year, Quarter, Month, Day of month, Day of year. Weather varies a lot from start of a month to the end, and Day of year is too granular so I chose 'week of year' to please our Swedish friends. Drag and drop to plot a histogram of Max daily rain by week of year. Immediately see that some weeks have silly maxima (300 mm/day? Not here!). Knowing that high winds rattle the pole and cause false highs, it's easy to 'colour code by max wind speed', and YES that seem to explain some of the high values as shown below. Changed graphs plot instantly.

Weather plot.jpg

Then if I click on the bar for Week 7, select 'Details-on-Demand' a spreadsheet of the values plotted in that bar appears. Click on the column header to sort and find the highest values and I scan start to sort things out. I also notice That I haven't got the column headings right ... things to fix for another day ....

Proper work may follow ...

HansR · Post by **HansR** » Thu 14 Nov 2019 5:08 am

Interesting!

ExperiMentor wrote: ↑Thu 14 Nov 2019 12:49 am [...] and it imported the whole lot in 5 seconds (as a contrast, my complicated multi-table data sets in my day job take several minutes to load). [...]

I can hardly imagine Spotfire making a difference that big (5 secs to 3(?) minutes is 36 times as fast). There is fysical limit to reading a fysical file from disk / SD-card. I actually assume Spotfire reads a small part to satisfy you with some data and while you are busy looking at the first results it reads the rest on the background. Simple response trick to make customers believe.

Now that I write this, @mcrossley: that could be a solution while reading the monthly files for the edit functions

You could time the real read yourself by simply making a very short program opening the file and do a File.ReadAllLines. Time before/after and there you are. If that is done in 5 seconds, then Spotfire is as fast as can be. It also means your own method of reading is queued for improvement (is that SQL?)

ExperiMentor wrote: ↑Thu 14 Nov 2019 12:49 am Proper work may follow ...

Looking forward.
I may try something myself if I find the time.

ExperiMentor · Post by **ExperiMentor** » Thu 14 Nov 2019 6:25 am

HansR wrote: ↑Thu 14 Nov 2019 5:08 am
ExperiMentor wrote: ↑Thu 14 Nov 2019 12:49 am [...] and it imported the whole lot in 5 seconds (as a contrast, my complicated multi-table data sets in my day job take several minutes to load). [...]
I can hardly imagine Spotfire making a difference that big (5 secs to 3(?) minutes is 36 times as fast). There is fysical limit to reading a fysical file from disk / SD-card. I actually assume Spotfire reads a small part to satisfy you with some data and while you are busy looking at the first results it reads the rest on the background. Simple response trick to make customers believe.

Sorry, you have misunderstood me!
I was making an aside comment that my 0.5 million record single file (14 million datapoints) is handled by Spotfire much faster than the incredibly complicated database I use it on in my day-job. On the larger dataset (relational database) it's definitely doing pre-processing as it loads - linking items from different tables etc.

Our huge 'weather' dataset seems complicated to us, but it's trivial compared to a high-res photo. And yet it's not until I try to process every frame of a video that my PC refuses to even try to do it in real time.

Selectively pre-loading data definitely has its uses (eg, only need the first few pages of a Word document, or the low-then-increasing resolution as sequential passes of a JPG are loaded, but that is not the explanation here. I said later that the graph drawing is instant - and that histogram involves data from 4 columns and all 0.5 million rows. All rows x 27 columns must be larger than my 16GB RAM, so this must be being read then swapped back to disk. These Operating Systems do clever things under the bonnet.

mcrossley · Post by **mcrossley** » Thu 14 Nov 2019 1:17 pm

@hans, nice idea, but the records data needs to read all the data to get any results. The process is CPU bound on a rPi rather than IO bound. On my laptop the import and processing only takes a couple of seconds.

I tried importing the data into an in-memory sqlite db, but so far the import performance is even worse than then whole import and processing using flat files. If the data is already in MySQL (as it is on my rPi) then the processing is v. fast even on pi.

HansR · Post by **HansR** » Thu 14 Nov 2019 2:01 pm

mcrossley wrote: ↑Thu 14 Nov 2019 1:17 pm but the records data needs to read all the data to get any results.

True, should have thought about that.
And interesting SQL experiment.

Cumulus Support

MX and mysql data indexes

MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes

Re: MX and mysql data indexes