More servicesWindows Live
HomeHotmailSpacesOneCare
 
MSN
Sign in
 
 
Spaces home  Denny Lee's Windows Live...PhotosProfileFriendsMore Tools Explore the Spaces community

Denny Lee

View spaceSend a message
Occupation:
Location:
I am a PM for the Best Practices group in Microsoft's SQL Customer Advisor Team specializing in enterprise data warehousing, analysis services, and data mining. The postings are provided "AS IS" with no warranties, and confers no rights. (Sorry, gotta add the legal stuff, eh?!)
Links to DW reference links
Updated 9/24/2007
Updated 12/27/2006
Updated 10/20/2006
Updated 5/13/2006
Updated 12/3/2006
Updated 8/30/2006
Updated 12/29/2006
Updated 6/21/2005
Updated 6/23/2005

Denny Lee's Windows Live Space

SQL, Analysis Services, Data Mining, Data Warehousing, and Medical Informatics
July 01

Reporting Services Performance and Scalability Technical Notes

Over the next few months, I will be publishing a series of technical notes on sqlcat.com that provide guidance on how to build and deploy large scale SQL Server Reporting Services environments.  This series provides general guidance on how to set up, implement, and optimize an enterprise scale-out architecture for your SQL Server Reporting Services (SSRS) environment; these notes provide guidance for both SQL Server 2005 and 2008 Reporting Services.
 
  • Report Catalog Best Practices: Provides guidance and best practices on the report server catalogs—the underlying databases that provide metadata (parameters, snapshots, history, etc.) used by Reporting Services to provide your reports.
  • SSRS Scale-Out Deployment Best Practices [link provided when published]: Provides guidance and best practices on deployment details for scaling out your Reporting Services environment including configurations and the use of File System snapshots.
  • SSRS Performance Optimization Configurations [link provided when published]: Provides guidance and best practices on using specific Reporting Services features and configurations to optimize the performance of your Reporting Services environment.
  • SSRS Troubleshooting Tips [link provided when published]: Provides various troubleshooting tips that are helpful to better understand issues that may occur within your Reporting Services environment
June 02

Great Data Mining resource

To learn more information about  Data Mining, check out the Rafal Lukawiecki's TechNet Spotlight at: http://www.microsoft.com/emea/spotlight/event.aspx?id=99
 
Data Mining and Business Intelligence for Enterprises
In this one-day seminar, Rafal Lukawiecki aims to show IT Professionals how data mining can be used in IT infrastructure to support real business scenarios demystifying the perception that Data Mining is complex, untested or only for specialists. This has become possible since Microsoft has taken the technology to new levels making it accessible to all. In four easy-to-understand yet packed with practical information sessions you will learn about what Data Mining and Business Intelligence can do for you, how to deploy and manage it, how to use it, and how to make it available to other parts of your IT environment.
 
Enjoy!
April 17

Analysis Services Distinct Count Optimization

Finally, after about 7 months of analysis, customer validation, and just general randomization - I have finally propped up the Analysis Services Distinct Count Optimization whitepaper which is available here: http://sqlcat.com/whitepapers/archive/2008/04/17/analysis-services-distinct-count-optimization.aspx
 
Quick summary of this whitepaper:
Distinct count (such as unique visitor counts on a Web site) calculations provide valuable information but come with a number of performance challenges. This white paper describes tests that were performed to determine how best to optimize these calculations and includes best practices based on the test results.
 
 
February 03

Introducing SQLCAT.com

The SQL Customer Advisory Team has put the final touches on the site http://sqlcat.com/ which is our new hub for all things that are part of the SQLCAT!  Our new site is a portal to all SQLCAT resources including:

  • Top 10 Lists: Summary list of Best Practices and Recommendations
  • Technical Notes: Deep level technical short papers
  • Technical Spotlights: Technical end-to-end customer case studies
  • Searchable and Tagger friendly; post comments and provide feedback!
  • Search all of our SQLCAT Best Practices Whitepapers
  • Easily find our SQLCAT Blogs and other materials

The site is regularly updated with deep technical information on enterprise SQL implementations.

 

December 11

SQL 2008 Case Study on Auditing, DMF, Encryption, Performance, Resource Governor, and Reporting Services

One of the first case studies for SQL Server 2008 was developed by Caregroup Healthcare Group using various features of SQL Server 2008 for their 2TB patient information system.  The case study goes into their use of Advanced Data Auditing, Transparent Data Encryption, Declarative Management Framework, Performance Data Collection, Resource Governor, and Reporting Services to provide a high performance Patient Billing and Lab Results database.  For more information, please refer to the Caregroup SQL Server 2008 case study.
 
 
November 07

Correction to our PASS Reporting Services Slides

As some of you may or may not know, Lukasz Pawlowski and I had presented the deck Building & Deploying Large Scale SSRS farms using Lessons Learned from Customer Deployments (DBA-416-M) at the 2007 SQL PASS conference.  In one of our slides, I had indicated inexplictly that "RS runs only on x64 only".  I must have been guilty of copying/pasting from some other slides because RS 2005 indeed does work on IA64 (there are many customers that can attest to that).  In general, x64 is a more common deployment for RS because it is often less expensive (and IA64 being reserved for SQL).  Saying this, yes, RS2005 does work on IA64.
 
If you want to know more information about Reporting Services, please don't forget to head to Lukasz's excellent blog Musings on Reporting Services and Notification Services.
 
 
November 02

Analysis Services: For Distinct Count measure NULL = 0

If you are to look at the table of values where you want to do a distinct count on the ID column

ID ColA
NULL blah
NULL blah
0 blah
3 blah
1 blah
1 blah
2 blah
2 blah

Within SQL, you will get a value of 4 in which there are four distinct values, 0, 1, 2, 3 and NULL is not counted.

But within Analysis Services, if you were to put a distinct count measure on top of the ID column, you would also get a value of 4 for four distinct values, but those distinct values are 1, 2, 3, and 0.  In this situation, NULL and 0 are the same value.  This is a little bit more apparent if your table had the values of

ID ColA
1 blah
2 blah
2 blah
3 blah
NULL blah

Within SQL, you will get a value of 3 (distinct values of 1, 2, and 3) but within AS, you will get a value of 4 (distinct values of 1, 2, 3, and NULL or 0).  There are various philosophical reasons why this is happening ranging from the fact that there are actually quite a few definitions for what NULL is (I really don't want to get into that) and within AS we're looking at things from a multi-dimensional point of view (i.e. from the dimension members) hence the exclusion of a NULL value implies exclusion of the rows.  Regardless of the philosophical point of view on this, one should just be aware of this when working with their distinct count measures.

Oh note, thanks to John Lam for reminding me that if you turn on the NullProcessing property it will preserve the NULL value - i.e. the first table will have 5 distinct values of 0, 1, 2, 3, and NULL.

 

October 22

CMU/MSR Mindswap on Privacy

As some of you may have noticed in some of my more recent blog postings, I have started writing about my interest in the field of privacy.  With all that is going on these days with tonnes of information floating on the web and the ability to join disparate data sources together to reveal interesting patterns (or individuals) - it will become imperative that we find ways to guarantee privacy of the individual if we are ever to find interesting patterns within our data.
 
The example that I had used with my most recent blog on the subject of privacy and Channel 8 interview was that of data mining in medical research.  In this specific example, I had noted that I had performed data mining (using Analysis Services of course) against phenotypic (i.e. not genetic) information of Asthmatic patients.  An interesting tid bit I like quoting from this research is that if an asthmatic patient misses their regular scheduled appointments, s/he may have 2x times greater chance of suffering from an asthmatic adverse event (death, hospitalization, ER visit).  Now, to give context, the number of adverse events were quite small to begin with and the subsequent analysis can explain that this most likely has to do with the fact that people who didn't skip their visits were typically more proactive to their health instead of reactive (i.e. would not wait until they were really sick before getting a checkup or visit to prevent their asthma from getting worse). 
 
One of the issues that I had while I was doing this analysis was that of the issues revolving around privacy.  I (rightly so) had to go through a lot of hoops in order to make sure that I had the data within a safe environment and the analysis I was doing would not reveal who an individual was from the medical records.  Saying this, the Analyzing Data while Protecting Privacy - A Case Study blog is one of the various postings that I have written (and will continue to write) on the subject.  
 
But what it comes down to is these theme: It will not be possible for us to perform interesting analytics (e.g. data mining, machine learning, etc.) to help patients (e.g. build a concept of prognostics) unless we can guarantee the privacy of the patient. 
 
Saying this, what does this have to do CMU/MSR Mindswap on Privacy as per the title of this blog?  Well, if you refer to the linked blogs, you'll notice that the concepts of differential privacy (also termed in the blog as privacy preserving data analysis) are based on the research from MSR (Microsoft Research).  As well, you may notice that I reference some examples of privacy from Latanya Sweeney whom is from CMU (Carnagie Mellon University).  Last week, the Center for Computational Thinking at CMU had setup a mindswap event which combines some of the top researchers from both institutions to get together and swap information on the advancing field of research into privacy.  You can find out more information by referring to the link with powerpoint slide references at: CMU/MSR Mindswap on Privacy.
 

Scalable Shared Databases

There have been some questions that have come to my attention concerning the concept of Scalable Shared Databases and comparing it within the context of Oracle RAC.    If you want to read the SSD vs. Oracle RAC comparison, you can refer to the link Comparing SQL Server 2005 and Oracle RAC.  As well, you can also refer to the SQL Server Books Online content on the subject of Scalable Shared Databases on the subject.
 
Saying this, the purpose of Scalable Shared Databases is to make use of SAN technology in order to allow you to attach a single read-only database to multiple server instances so that way you can have scale out many concurrent queries of your SQL database to the aforementioned multiple server instances.    We have versions of this approach for Analysis Services one specific to using SANs (Scale-Out Querying with Analysis Services using SAN Snapshots) and one not specific to the use of SANs (Scale-Out Querying with Analysis Services).  Saying this, you could always rename the concept of Scalable Shared Databases as Scale-Out Querying with SQL Server
 
But I digress, extracted directly from the technical (hence most important) link is the Scalable Shared Databases are supported by SQL Server 2005 from Microsoft Support:
Scalable shared databases
Scalable shared databases let you attach a read-only reporting database to multiple server instances over a storage area network (SAN). A reporting database is a read-only database that is built from one or more production databases that are used exclusively for reporting purposes. To be made into a scalable shared database, a reporting database must reside on one or more dedicated read-only volumes. The primary purpose of these read-only volumes is to host the reporting database or a coordinated set of reporting databases. These volumes are known as reporting volumes.

Benefits
Scalable shared databases offer the following benefits: • Provide workload scale-out of reporting databases by using commodity servers. A scalable shared database is a cost-effective way of making read-only data marts or data warehouses available to multiple server instances for reporting purposes, such as running queries or using SQL Server 2005 Reporting Services. 
• Provide workload isolation. Each server uses its own memory, CPU, and tempdb database. 
• Guarantee the same view of reporting data from all servers if all the server instances are configured identically. For example, all servers would use a single collation.
Note Optionally, you can update the reporting database on a second reporting volume. For more information, see the "Maximize the availability of a scalable shared database" section.
 
Restrictions
The following restrictions exist for a scalable shared database: • The database must be on a read-only volume.
• The data files can be accessed over an SAN. 
• Scalable shared databases are supported only on Microsoft Windows Server 2003 Service Pack 1 (SP1) or a later version of Windows Server 2003.
For the technical details, please refer to Scalable Shared Databases are supported by SQL Server 2005 
 
October 19

Using ByAttribute or ByTable Processing Group Property with Analysis Services 2005

For any of you whom are considering using the ByTable (vs. ByAttribute) Processing Group property, I've propped up a blog called Using ByAttribute or ByTable Processing Group Property with Analysis Services 2005 with help from Richard Tkachuk, Akshai Mirchandani, and Eric Jacobsen.  Basically, while using ByTable can at times be helpful, you really need to be careful on how you use it.  Read the blog for more info, eh?! 
 
 
October 12

Incorrect Impersonation Information may crash or hang Analysis Services

If you are processing your AS cube using impersonation mode, if you incorrectly place the impersonation information as MYDOMAIN/UserName instead of MYDOMAIN\UserName, you may crash or hang Analysis Services.  A quick cursory examination noted that with Sybase and DB2 relational data sources, AS ended up crashing.  If SQL Server or Oracle is your relational data source, then the processing is exceedingly slow to the point where the server was hanging; in the end eventually failing.

 

Thanks to John Lam, Nicholas Birke, and Wei Zhang

 

 

October 09

Precision Considerations for Analysis Services Users Whitepaper now available

For those of you whom want to know a little more about precision considerations for Analysis Services users, please check out this recently released whitepaper at http://www.microsoft.com/downloads/details.aspx?familyid=bae8beec-9892-4ecd-a9db-292254895f9c&displaylang=en.

This white paper covers accuracy and precision considerations in SQL Server 2005 Analysis Services. For example, it is possible to query Analysis Services with similar queries and obtain two different answers. While this appears to be a bug, it actually is due to the fact that Analysis Services caches query results and the imprecision that is associated with approximate data types. This white paper discusses how these issues manifest themselves, why they occur, and best practices to minimize their effect.

Authors: Denny Lee and Eric Jacobsen

 

September 28

Data Mining applied to Medical Research

As a follow up to my blog below concerning analyzing medical data and privacy, check out my channel8.msdn.com interview concerning the application of data mining to medical research and the reason why privacy is so important.
 
 
September 24

Analyzing Data while Protecting Privacy - A Case Study

Abstract

Analysis and sharing of aggregate data (e.g., number of users whose favorite color is blue) is crucial to understanding patterns of the population being studied. The problem is that even summary data can expose the individuals who make up this information. Therefore, we are studying the use of privacy preserving data analysis techniques that will protect the individual and allow analysts to understand general trends. We believe that while there are unique perturbations of the data with these techniques, analysts will be able to use the data and protect the users at the same time. The case study involves applying these algorithms to MSN reporting data. One group focused on event usage (e.g., number of page views on Moneycentral.com) while another group focused on usage data (e.g., the number and frequency of users who use MSN Messenger). After four months of analysis, the first group did not like the effects of privacy preserving data analysis. In contract, the second group is prepared to use this type of reporting in production. Further discussions and inquiries revealed that the first group had perceived data trust issues that conflicted with use of the algorithms. But, the second group had no trust issues with the data and was very familiar with privacy issues, hence the successful outcome. Therefore, if users trust the data it is possible to use this privacy preserving algorithm to allow the analysis of data while protecting privacy.

 

Introduction

Based on an analysis of the 1990 US Census, 87% of the United States population is uniquely identifiable by the three attributes of zip code, date of birth, and gender [1]. Latanya Sweeney’s breakthrough research with the Governor William Weld case clearly depicts that the masking or hiding of publicly available information does not adequately protect the privacy of an individual. Sweeney was able to obtain masked medical data from the Group Insurance Commission of Massachusetts whom is responsible for state employees’ health insurance. The masked medical data contained only non-identifiable information such as ethnicity, visit date, diagnosis, procedure, medication, total charge, zip code, date of birth, and gender. For a cost of $20, Sweeney then obtained the publicly available Cambridge, MA voter list. Referring to the Venn diagram below, each circle graphically represents the two data sources and by itself reveals little information. The masked medical data on the left provides only medical information where only gender, date of birth, and zip code were identified. The voter list on the right provides personal information but limited only to voting information, address, date of birth, and gender. But recall that 87% of the US population is uniquely identifiable by only the three attributes of zip code, date of birth, and gender. In the case of the Massachusetts governor, there were only six people in Cambridge, MA whom had the same date of birth and only three of them were men. Of the men remaining, he was the only person to live in his particular zip code. Through these two pieces of information joined only by gender, date of birth, and zip code; Sweeney was able to identify and reveal the medical records of then Governor William Weld.

image

Figure 1: Unmasking of masked medical data

As one can see, this particular scenario stresses the inadequacies of ensuring privacy by masking medical data; it will require us to change our perceptions of what data privacy really entails. It takes only a small amount of information to reveal the individual underneath the data.

Many current privacy solutions involve the concept of masking the data such that personally identifiable information remains hidden. Regulatory bodies such as HIPAA provide very explicit instructions on what information can and cannot be revealed in different scenarios [2]. Businesses such as Microsoft® provide detailed statements (e.g., “Microsoft Online Privacy Statement”) detailing the personal information collected, uses of the information, and the choice to opt-out [3]. Academic institutions, such as Oregon Health & Science University, have requirements in addition to HIPAA standards staging the permitted uses and disclosures associated with protected health information [4]. Yet, as one can see from the William Weld case, the masking and explicit statements of use and disclosures may not necessarily insure privacy. There are many other masking solutions and/or perturbations of data as can be seen in J.C. Cannon and A. Cavoukian’s book “Privacy: What Developers and IT Professionals Should Know” [5]. But the potential weakness of these approaches is that the solutions may end up being too complicated to implement or that the modification of data will result in changing the meaning of the data. This latter point is especially important because of its affect on analysis and research. If statistical analysis ranging from mean/median to advanced data mining techniques result in inconsistent values, the results and conclusions based on these statistics may be incorrect.

How is it then possible to protect an individual’s privacy in a relatively simple manner while ensuring consistent statistical analysis for research purposes? This is the purpose of “Privacy Preserving Data Analysis” which is based on the work of Microsoft Research (MSR) researchers Cynthia Dwork and Frank McSherry (as well as S Chawla, K Talwar, A Blum, K Nissim, and A Smith) [6, 7, 8]. The basic premise of their research is that the addition of noise based on the exponential distribution to the data will be able to protect the individuals underneath the aggregate data. Recall that to attack aggregate data, one need only to ask enough questions to drill down to a specific individual with very distinct attributes. In the Governor Weld case, the first question would be the number of people who had his date of birth (answer is 6). The next question is number of those people who were male (answer is 3). The final question is the number of those people who lived in his 5-digit zip code (answer is 1). In this example, it required only three questions for the author, Sweeney, to drill down to Governor Weld and his medical records [6]. But, if we were to add exponential noise (some integer value between –∞ to +∞) to the consistently change the above values, it would be impossible to drill down to this one person. For example, if the noise values to be applied to these questions were that of (-2, +1, +6), then the results of the above questions would be:

 

Question

Original Answer

Noise Values

PPH-applied Answer

How many users with Governor Weld’s date of birth

6

-2

4

How many of those users are male

3

+1

4

How many of those users are in Governor Weld’s Zip Code

1

+6

7

When looking at the PPH-applied answers, the additional noise has made it impossible to drill down to the single individual, therefore the data is “privacy preserving”. In this particular example, the noise is too large but in the methods section we describe how to better calibrate the noise. Ultimately, the key is to add enough noise to the system so that the answer changes but the overall statistics do not. To prevent attacks to the noise algorithm itself, more noise needs to be added as more questions are asked. The background section of this paper provides more details on how this works. The efficacy of this methodology is not in question for this case study as this has already been amply researched and verified. Therefore, the purpose of this case study is to determine the applicability of this noise algorithm in an actual reporting environment.

Background

The privacy preserving data analysis concepts involve the addition of noise to the data in order to protect the individuals underneath the data. This case study involved the use of the privacy preserving histogram (PPH), which uses exponentially-distributed noise (the algorithm is better described in the methods section). These concepts are based on the research in “Practical Privacy: The SuLQ Framework” [7]. The next few background sections describe both the sanitization concept and the privacy mechanism theorem associated with it.

 

Sanitization Concept

To protect the individuals comprising the data, we will mask these individuals within the data by creating a sanitization point between the user interface and the data.

clip_image004

Figure 2: Sanitization Concept

The left green shape on the left represents the database while the right-most circle represents the answer or result set provided to the user interface. The middle section is the interactive sanitizer, defined К, where it introduces random noise to produce uncertainty, hence privacy. Note the magnitude of the noise is given by the theorem:

If many queries f1, f2, … are to be made, noise proportional to Σ­iΔfi suffices. For many sequences, we can often use less noise than Σ­iΔfi . Note that Δ Histogram = 1, independent of number of cells

The sanitizer requires the creation of a carefully detailed algorithm and will be discussed in more detail in the differential privacy section below. A safe answer on the amount of noise to apply is that of standard deviation equal to the total number of queries. If the number of queries is not known, then the standard deviation can be proportional to the square of the queries asked so far. As for the noise itself, it should be newly seeded each time a query is applied.

By doing this, this algorithm will be able to address all attacks. Consequently, for each person, the increase in probability of the individual being attacked (or anyone else for that matter) due to the contribution of their data is nominal. The example given is foiled for two reasons: a) the addition of noise will (formally) complicate the polynomial reconstruction and b) the number of queries is limited by the degree of privacy guaranteed, and N is generally going to be way too many queries.

Mathematically, at this sanitization point, the PPH will apply of a little bit of noise within each cell of the histogram. Descriptively, if one’s result set from the database is a report with a set of rows and columns, for each value within each row and column cell a small bit of error is added to the original number. Provided that the sanitization point can limit the number of questions being asked and/or add more noise as more questions are being asked, then the PPH can guarantee the privacy of the individuals that make up these aggregates.

 

Differential Privacy

Our privacy mechanism, К, gives ε-differential privacy for all transcripts t, all databases DB, and all data items (rows in the DB) Me, the ratio

clip_image006

Figure 3: Differential Privacy Formulae

The differential privacy theorem is a simple mathematical formula that describes the fact that nothing more can be learned about an individual when her information is in the DB (DB+Me) than when it is not in the DB (DB-Me). This is important from the context of joining one set of data that is in the DB (e.g., the masked medical data) with another set of data that is not in the DB (e.g., Cambridge, MA voter list). If there is no noticeable difference between f(DB-Me) and f(DB+Me) then there is no perceptible risk by joining the two data sets together. Therefore, to achieve this differential privacy, we will need to add scaled symmetric noise exp(-|x| ε/ Δf).