SQL (3) Research questions, databases, and analytics; Importing data, exporting data, using other tools Information Structures and Implications 2015 Bettina Berendt Last updated: 2015-10-30 1 Where are we? 2 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 3 How many parliamentarians does each country have? Number of Parliamentarians 500 450 400 350 300 250 200 150 100 50 0 4 How long are political functions held, on average? Average time the position was held, in years lp:Role/chair_of_the_bureau lp:Role/vice-chair/member_of_the_bureau lp:Role/secretary_to_the_bureau lp:Role/deputy_chair lp:Role/substitute lp:Role/co-chair lp:Role/vice-chair lp:Role/chair lp:Role/member lp:Role/treasurer lp:Role/deputy_treasurer 0 1 2 3 4 5 6 5 How often do countries vote for/against things? (Note: artificial data!) 100% 90% 80% 70% 60% no 50% yes 40% 30% 20% 10% 0% IE PT GR DE ES FR GB BE DK IT LU NL 6 Is there a relation between length of time in office and age? 35000 Age at start of term (in days) 30000 25000 20000 15000 10000 5000 0 0 1000 2000 3000 4000 5000 6000 Time in office (in days) 7 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 8 Incremental changes to databases • Let us see how we can add information to an existing database. • Let us modify – The conceptual model (EER) – The logical model (relations) – The physical model (database) in turn 9 The diagram 10 Assume we have voting data • Just some examples of real EU voting data – http://www.elprg.eu/data.htm – http://personal.lse.ac.uk/hix/ (overview, link to the next one) – http://www.votewatch.eu/ – http://www.itsyourparliament.eu/api/ • For simplicity, assume we have a CSV file – If it‘s a different format, need some more transformation • For simplicity, I generated random data 11 Artifical voting data (votes2.csv: 504 votes) 12 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 13 Adding these data to the database: (1) Creating a new table Table a_votes (Missing: primary and foreign keys) 14 Adding these data to the database: (2) Importing the data into the table LOAD DATA INFILE 'C:\\Users\\kurt\\Documents\\Lehre\\ISI15\\Session 7 - SQL3\\votes2.csv' INTO TABLE a_votes FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n' (Note: The file path specification is different on Mac.) 15 Note • LOAD DATA INFILE is of course not only useful for adding data to an existing database. • You could also build a database from scratch in this way. 16 Linking the new to the old data (just another join) 17 Scenario 2 (more common in real life): Our new data do not have the same key information as the old data 18 New table & data import for scenario 2 Table a_votes2 LOAD DATA INFILE 'C:\\Users\\kurt\\Documents\\Lehr e\\ISI15\\Session 7 SQL3\\artificial_votes2.txt' INTO TABLE a_votes2 FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' (Missing: primary and foreign keys) 19 Sample data for scenario 2 20 Linking the new to the old data (record linkage – not necessarily via the primary keys) 21 What are the risks and opportunities of scenario 2? 22 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 23 How many parliamentarians does each country have? (Order the result by country name) 24 For how long do parliamentarians hold a political function? (in days) 21,852 rows … http://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html 25 How long are positions held, on average? 26 OK, minus and AVG and COUNTs are fine, but what about more complex measures? • For example, is there a relation between – length of time in office – and age? (do older parliamentarians stay longer in office than younger people, or vice versa)? • You could investigate this hypothesis with the help of the Pearson correlation coefficient 27 Pearson correlation in SQL (1) https://www.vanheusden.com/misc/pearson.php 28 Pearson correlation in SQL (2) SELECT user1, user2, ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r, n FROM (SELECT n1.user AS user1, n2.user AS user2, SUM(n1.rating) AS sum1, SUM(n2.rating) AS sum2, SUM(n1.rating * n1.rating) AS sum1sq, SUM(n2.rating * n2.rating) AS sum2sq, SUM(n1.rating * n2.rating) AS psum, COUNT(*) AS n FROM testdata AS n1 LEFT JOIN testdata AS n2 ON n1.movie = n2.movie WHERE n1.user > n2.user GROUP BY n1.user, n2.user) AS step1 ORDER BY r DESC, n DESC Don‘t worry, you will probably never have to do such a thing ... https://www.vanheusden.com/misc/pearson.php 29 A general question: Can you compute anything in SQL? = Can you compute anything that can be computed (by a programming language such as python)? • In principle, yes (Theoretical result about Turing equivalence: cf. http://stackoverflow.com/questions/900055/is-sql-or-even-tsql-turing-complete) • So what do you need (e.g. python) programs and other software for? 30 Answer: For example, to calculate your analytics in more comfortable ways Excel makes it very easy to calculate a correlation 1. Create (in SQL) one or more tables with the information 2. Export to CSV 3. Import/Load into Excel 4. Calculate the correlation coefficient there 31 Answer (2): or for generating a chart 1. Create (in SQL) one or more tables with the information 2. Export to CSV 3. Import/Load into Excel 4. Create a chart 32 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 33 How many parliamentarians does each country have? (1) SELECT name, count( * ) INTO OUTFILE 'C:\\Users\\kurt\\Documents\\Lehre\\ISI15\\Session 7 - SQL3\\countries_parliamentarians.csv' FIELDS TERMINATED BY ‘;' LINES TERMINATED BY '\n' FROM represents, country WHERE represents.countryacronym = country.acronym GROUP BY countryacronym ORDER BY name 34 How many parliamentarians does each country have? (2) Number of Parliamentarians 500 450 400 350 300 250 200 150 100 50 0 35 How long are political functions held, on average? Average time the position was held, in years lp:Role/chair_of_the_bureau lp:Role/vice-chair/member_of_the_bureau lp:Role/secretary_to_the_bureau lp:Role/deputy_chair lp:Role/substitute lp:Role/co-chair lp:Role/vice-chair lp:Role/chair lp:Role/member lp:Role/treasurer lp:Role/deputy_treasurer 0 1 2 3 4 5 6 36 Is there a relation between length of time in office and age? 37 TimeInOffice / age (1): Option 1: export the new table directly SELECT datediff( End_date , Start_date) , datediff( Start_date, date_of_birth ) INTO OUTFILE 'C:\\Users\\kurt\\Documents\\Lehre\\ISI15\\Sessio n 7 - SQL3\\time2age.csv' FIELDS TERMINATED BY ‘;' LINES TERMINATED BY '\n' FROM parliament_member, in_political_function WHERE parliament_member.MEP_ID= in_political_function.MEP_ID 38 TimeInOffice / age (2): And then compute the correlation with Excel... 39 TimeInOffice / age (3): Option 2: Create a new table in the database (which you can later export) CREATE TABLE time_in_office2age SELECT datediff( End_date , Start_date`) , datediff( Start_date`, date_of_birth ) FROM parliament_member, `in_political_function` WHERE parliament_member.`MEP_ID` = `in_political_function`.`MEP_ID` 40 How often do countries vote for/against things? (1) Basic queries (combining these into one query is a bit tricky, so I recommend to query and export this separately): number of YESs grouped by country, number of Nos grouped by country SELECT countryacronym, count( * ) FROM parliament_member, represents, a_votes WHERE parliament_member.MEP_ID = represents.MEP_ID AND parliament_member.MEP_ID = a_votes.MEP_ID AND member_vote LIKE ‘yes%' GROUP BY countryacronym ORDER BY countryacronym SELECT countryacronym, count( * ) FROM parliament_member, represents, a_votes WHERE parliament_member.MEP_ID = represents.MEP_ID AND parliament_member.MEP_ID = a_votes.MEP_ID AND member_vote LIKE 'no%' GROUP BY countryacronym ORDER BY countryacronym 41 How often do countries vote for/against things? (2) 100% 90% 80% 70% 60% no 50% yes 40% 30% 20% 10% 0% IE PT GR DE ES FR GB BE DK IT LU NL 42 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 43 What data and operations to answer our research question? EUP database Role to duration (CSV) Import SQL query + export Role to duration (XLS) Excel command Average time the position was… lp:Role/secreta… lp:Role/vice-chair lp:Role/deputy… 0 2 4 6 44 What data and operations to answer our research question? EUP database Import Voting data (CSV) Y/N Votes by Country (CSV) Import SQL query + export Y/N Votes by Country (XLS) Excel command 100% 80% 60% no 40% yes 20% 0% IE GR ES GBDK LU 45 Agenda 1. 2. 3. 4. 5. 6. Our goal: answer interesting questions Changing databases – a design view Importing, and more on combining data Creating analytics and storing their values Exporting data Putting it all together: From goal to flowchart of data and processing steps 7. Preview: Database connectivity – (Python and other) programs and databases 46 Python and other programs can ... ... access databases: • “import“ data from the database while the program is running • compute something with it • “export“ (write something) to the database • Show selected database content to users • Ask for their input • Do something accordingly • Examples? E.g. Web search engines, e-Commerce sites, ... • Mechanics? See later in the term, Scripting Languages! 47 Next 3 weeks • Continuing this • Bringing in text analytics – How long are speeches on average, by country? – Do people from different countries use certain words/terms more often than others? – ... 48 Reading • For details of all commands, see the MySQL documentation: http://dev.mysql.com/doc/refman/5.7/en/ 49