Data science is revolutionizing the world around us. Data science is an interdisciplinary field composed of computer science, math and statistics, and domain knowledge that seeks to derive insight from data. Data science is the intersection of these three respective disciplines. Another way to think about it is that data science is the intersection of data engineering and the scientific method. With data science, we're using large-scale data systems to drive the scientific method. The goal of data science is to transform data into knowledge, knowledge that can be used to make rational decisions so that we can take actions that help us achieve our goals. We refer to this process as transforming data into actionable insight.
Data Science- Solving Problems in Various Sectors
Data Science methods and tools can solve some of the world's greatest challenges in sectors including:
- Defense and national security
- Medicine and Health
- Imaging and optics
- Energy and the environment
- Food and agriculture
- Economics and finance
Research Challenges in Data Science
- Storing and Processing Terabytes, petabytes of data generated each day;
- Almost every discipline is facing big data analysis problems, including medical sciences, life sciences, bio-informatics, law school, civil engineering and government;
- Data comes in different forms, such as free text, structured data, audio/video, images;
- Analysis tasks performed over the data are becoming more and more sophisticated;
- High performance computing platforms are advancing fast (e.g., cloud computing, Parallel Computing, multi-core machines, GPU, mobile-computing);
- Communication and feedback needs to be established between machine, algorithms and people.
Skills of Data Science
In general, the skills commonly associated with data science are programming computers using programming languages, like SQL, Python, and R, working with data, that is collecting, cleaning, and transforming data, creating and interpreting descriptive statistics, that is numerically analyzing data, creating and interpreting data visualizations, that is visually analyzing data, creating statistical models and using them for statistical inference, hypothesis testing, and prediction, handling big data, data sets that are of volume, velocity, or variety beyond the limitations of conventional computing architecture, automating decision-making processes using machine learning algorithms, and deploying data science solutions into production or communicating results to a wider audience.
Data Science Ecosystem
Algorithms for Data Science
- Methods for organizing data, e.g. hashing, trees, queues, lists, priority queues.
- Streaming algorithms for computing statistics on the data. Sorting and searching.
- Basic graph models and algorithms for searching, shortest paths, and matching. Dynamic programming. Linear and convex programming.
- Floating point arithmetic, stability of numerical algorithms, Eigenvalues, singular values, PCA, gradient descent, stochastic gradient descent, and block coordinate descent. Conjugate gradient, Newton and quasi-Newton methods.
- Large scale applications from signal processing, collaborative filtering, recommendations systems, etc.
will update soon!!!
The past decade has witnessed the emergence of a new era of interdisciplinary sciences with statistics playing a pivotal role, especially in biological sciences.. This trend will continue into the foreseeable future and therefore it is extremely important for Indian institutes to train and develop the next generation of statisticians, mathematicians and computer scientists with focus on applications to biological sciences. It is most fitting that an institute has been established in honour of Professor C.R.Rao , who is regarded worldwide as one of the greatest statisticians and scientists of our time and whose contributions to statistical theory and practice are fundamental to modern science and technology. It is very timely to create the Centre for computational Genomics at this institute. Such a centre would provide much needed impetus for cutting edge research and HRD in Indian biosciences in general and statistical/mathematical study of genomics in particular.
In simple terms, Genomics is the study/ Investigation of the genome of an organism. In the recent years, genomes of several important species, including humans, have been sequenced. This had led to a flood of research worldwide to address key problems on disease, public health, plant/animal/micro-biology and environment using the vast pool of genomic information. As a result, several important areas of high-throughput ‘omic’ research have emerged such as functional genomics, which includes the study of genome-wide expression in an organism with microarray platforms, as well as other fields like proteomics, metabolomics, cancer genomics, etc. A key feature shared by these new fields is the high-throughput nature of the experimental investigation in each and therefore their central dependence on statistical data analysis.
These new areas have now developed into distinct disciplines on their own right. Thus the center would develop expertise in these core fields. To begin with, at the very minimum, each field would require one senior faculty member (at the rank of tenure-tract Assistant Professor) with potential for excellent publication record. The main criterion for tenure and promotion would be strong research publication record in high impact journals.
The center would also provide adjunct appointments to outstanding researchers in other institutions of academic excellence and to interested faculty members from other academic units within University of Hyderabad. These adjunct professors shall co-supervise and mentor doctoral and post-doctoral fellows. Such adjunct appointments together with the permanent core faculty, described above, would provide the desired initial intellectual critical mass which can be expanded by adding more positions as the disciplines evolve over time. Although the center will not pay salaries to any of the adjunct faculty, where necessary, it will cover all local and travel expenses to meet with the students and permanent faculty. Adjunct appointments are not only cost effective but they also help foster cross-pollination of ideas across disciplines and institutions.
Distinguished visiting scholars
The centre would encourage researchers (within India and abroad) to spend their sabbaticals at the center, thus fostering greater communication and collaborations with international experts. Guest-house facilities would be provided to all visiting scholars. Further, on a case by case basis, some of the visiting scholars would be financially supported.
Given the fast-paced and highly inter-disciplinary nature of present genomics research, the contribution of adjunct faculty members and distinguished visiting scholars towards the excellence of the centre is likely to be of great importance.
Ph.D. students and Post-doc fellows.
Each year the center would admit 4 to 6 Ph.D. students and 2 to 3 post-doctoral fellows on the basis of national competitive exams and interviews, using a process similar to those at elite institutions such as ISI, IISc, IIT, etc. Qualified foreign nationals with doctorates and proven international publication record may also be considered for postdoctoral fellowships. The admitted Ph.D. students and the post-doctoral fellows would receive a suitable monthly stipends. The students and post-doctoral fellows admitted to the center would be required to take suitable courses in statistics, computer science and biology (offered by AIMSCS and University of Hyderabad). Further, they would be required to submit 4 papers per year on average to international peer-reviewed journals.
Modern Computer lab and other facilities
The proposed center would have a modem computer lab equipped with desktop computers for each student in the lab (with access to color printers) equipped with high-speed internet connection. Further, since analysis of genomic data requires processing of very large data sets with computationally intensive methods, the center would seek access to super computing facilities at the University of Hyderabad and other institutes in the city, and have a least one Linux cluster with a sufficient number of nodes with parallel processing capabilities. Further, the computer lab must be manned by at least two IT professionals who shall maintain and update all relevant hardware, software and databases. The computer lab, seminar rooms/class rooms should be equipped with state of the are multimedia useful for lecturing and conferencing purposes. Each faculty member would have high speed desktop computers in their offices with access to color printers.
Library and on-line publications
Students and faculty would have access to latest issues of high-profile national and international journals and periodicals (and the on-line subscriptions thereof). Major book titles and conference proceedings should be available. Further, the center should team-up with other institutions in the country to form inter-library loan system, as done in the US.
A great amount of planning and designing is needed to develop a center for a subject which is multi-disciplinary and enriched with rich potential. Attempts will be made to foster collaboration with some of the leading institutes in genomics in India and aboard and to take advantage of the genomic research that is already well developed, in order to prepare a group of young researchers and graduates in state-of-the-art technology and science in a relatively small amount of time (may be 5 years).
With the availability of research personnel with requisite qualifications, newer fields will be pursued such as energy bioinformatics (e.g. biofuel genomics), environmental bioinformatics (e.g. metagenomics), synthetic biology, plant bioinformatics, computational immunology and so on.
P.C. Mahalanobis visualized statistics as a key technology useful for national development. He made a significant contribution to the building up of the statistical system in the country and using the economic statistics generated by that system for planning national development. There has been a marked deterioration in the quality of economic statistics generated by the Indian statistical system in recent years. There has been serious concern regarding the reliability and timeliness of the statistics generated by the system. One must note that the Indian economy has gone through a paradigm shift from a planned economy with the public sector taking a commanding role to a free enterprise economy with the state’s role being only to ensure competitiveness and justice through state regulation. With this change a lot of economic activity is carried out by the private sector. Most of the types of information traditionally collected could be of little relevance while some new types of information are becoming relevant. In view of these developments the government constituted a National Statistical Commission with Dr C Rangarajan as the Chairman.. The Commission submitted its report in 2001 and many of its recommendations are being implemented through the New National Statistical Commission headed by Professor S. Tendulkar.
The structure of the industrial economy at the macro level has now become more of a concern for the private sector and the regulators, and not of the central government or the planning commission. Private sector’s need for economic data differs from the state’s need for such data. Hence there must now be a reform in this part of the statistical system-a reform in which centralized data collection through a National Sample Survey may partly be replaced or supplemented by special purpose sample surveys conducted by the government and industry. Large sample surveys conducted five years apart in a slow growth economy may have to be replaced by smaller surveys canvassed more frequently. The private sector must actively participate in collecting independent information regarding the industrial structure through industry associations and chambers of commerce and industry. The private sector also must have a larger say now than before in data collection by the government and its agencies. This calls for a large number of trained persons in the private sector with knowledge on sample survey designs and analysis.
Once we have such data on industrial structure collected by two or more independent agencies two important questions emerge. How should one combine the information from independent sources? How do we standardize the concepts and methods so as to make such pooling of information more useful? Even at an individual firm level a large data is becoming available, leading to statistical and computing aspects of data warehousing, data mining, and data analysis of large bits of information. When firms and industries collect independent data and perform independent data analyses another question arises on performing meta-analysis that gathers combined evidence from the individual independent evidence.
With the introduction of the 73rd and 74th Amendments to the Indian constitution there is a gradual shift from centralized planning to decentralized planning for social development. The public policies now focus on socio economic structure prevailing in small local areas, as the decentralized planning for social development requires that type of data. There are special problems associated with sample surveys for small areas, called small area statistics. Some of the interesting issues in this field are:
1. The rare events become rarer
2. Smallness of samples creates large variances for sample estimates
3. Confidentiality issue of responders
As with small area statistics it becomes easier to identify the respondent. In addition to sample survey data other forms of data such as officially recorded data, data gathered by government and quasi-government organizations, may be needed for decentralized planning. Again the question arises as to how should one combine different data obtained through different methods with different degrees of precision and reliability. The information collected by methods other than random sampling methods may be treated as prior information or conditional information and one may use Bayesian methods and regression models to combine such information.
Thus the newly emerging economic scenario poses new data collection and data analysis issues that have to be examined by sample survey experts and other theoretical and applied statisticians. One of the reasons for the deterioration of the Indian statistical systems is the de-linking of the data collection system from the user interface, and the academic researchers in particular. This way there has been very little interaction between the academic community that uses the data and the government that collects the data. It is proposed that the statistics and computer science groups of AIMSCS provide the needed academic interaction with data collecting agencies through periodic workshops and seminars.
AIMSCS plans to organize a series of workshops and conferences on the following topics of interest to the public sector and private sector:
- Planning for knowledge economy
- Data Users’ conference
- Data base of the Indian economy
- Conference on recent advances in sample survey methods
- Small area statistics
- Data warehousing and data mining
- Meta analysis
The basic idea of conducting such workshops and seminars is to direct advanced research in econometrics and statistics towards practical problems faced by the business, industry, and government.
We also see a tremendous scope for advanced research in econometrics with special emphasis on the econometric modeling useful for the financial sector. This research effort could be directed towards:
- Exploration of better economic theories by harnessing the information content of large bodies of data becoming increasingly available
- Application of Statistical Design of Experiments to Economic Experiments in a laboratory setting
- Better choice of theories among alternatives available
- Analysis of risk and Value at Risk (VaR)
- Improving the financial models for pricing of financial assets
Econometric research was carried out for more than five decades under an environment of availability of only a limited amount of data with an assumption of either a static model or a dynamic model with a stable structure. This environment has changed significantly in the recent years. Micro data at the firm level has become available regarding the amount of sales to different groups of customers, at different locations, and at different points in time. The institutional environment is constantly changing so that the concept of treating firm as a unit of analysis is becoming meaningless as a result of mergers and acquisitions.
Econometric research had been carried out with more emphasis on a particular mathematical theory forming the a priori basis for understanding the empirical reality. The selection of a particular theory is not based on rigorous scientific principles that keep a check between the theory and observed reality. This state of economic science has been well recognized by the economics profession as reflected by the presidential addresses before the American Economic Association. There is a need for an intensive activity of econometric research leading to better choice of theories. Such better choice must be based on using an extensive database and extensive experimentation. One may use one half of the data for model selection and the other half for drawing inferences from the data. There is also a need to define and extract “information” about the underlying economic structure from the data generated in a non-experimental set up.
There is a new branch of economics that is gaining importance in getting scientific credibility to economics as a science. This is experimental economics. Economists do most of the research in experimental economics with rudimentary attention to the statistical design of experiments. In many of the research papers on experimental economics even the statistical design of the experiment is not elaborated. There is a very wide scope for experimental economics to build and test economic theories. Economic experiments require payments to the subjects of the experiment to induce them to play as per the design of the experiment. Here India has a tremendous advantage over other industrial countries as the cost of conducting experiments is much less in India than in those countries.
Experimental economics is widely used in conducting experiments on different types of market organization. In the newly evolving economic environment, where markets replace state’s role and where regulation is deemed mandatory to curb market power and safeguard the consumer interests, it is quite useful to design the appropriate market mechanisms for such things as the allocation of bandwidth by TRAI, auctioning of government land for special economic zones, creating markets for public utilities such as electric power and water. One of the major fields of application of experimental economics is that of examining the effects of introducing different regulatory rules on market efficiency and protection of investor interests in capital markets.
Uncertainty confronts economics, business, and our personal lives every day. Making decisions under such risky environment calls for understanding what is meant by “risk”, and how to protect against severe consequences of such risk. Globalization and increased competitive environment make our commodity and financial markets subject to good amount of risk. The adverse impact of such risk is measured by what financial economists call “Value at Risk”. This value at risk concept is closely associated with the concept of probability large deviations. Only recently some theoretical contributions are made to define and refine this concept of probability of large deviations. There is an immense scope for further advanced research on “Risk Analysis”.
As the rare events are unlikely to be observed one must estimate the probability of their occurrence by observing some regularity between such rare phenomena in relation to not so rare phenomena within the same data generating process. When one takes observations from a stochastic environment each observation contains information not only about its own occurrence but its occurrence relative to the other observations in the ensemble. Hence the sample provides information on the variable itself, its deviation from the average, its rate of decline or increase at any level and so on. One uses factor analysis or Chernoff’s faces to characterize the pattern of multi-dimensional data. One can think, likewise, characterizing the probability distribution of a single-dimensional random variable by the probability distribution of several derived measures from the given sample of observations. Information on such derived measures as positive deviations, negative deviations, rate of increase or rate of decrease, along with their probabilities can be used to obtain the joint distribution of those “dimensions” of data, or “faces” of data. It then becomes possible to estimate the conditional probability of rare events or probability of large deviations that are very unlikely from the information on probability of small deviations that are more likely. When it is difficult to determine theoretically the form of such joint probability distribution of different dimensions of the data it is possible to perform computer simulation to obtain empirically the joint distribution.
Risk analysis and option pricing models are based on combining some basic economic principles with observed stochastic movements of prices. The basic economic principles are rationality and no possibility for profiting through arbitrage over time and space. The principle of rationality is contingent on the assumptions one makes about people’s expectations. Such expectations are functionals of the underlying probability distributions, possibly subjective probability distributions. The influence of large deviations of expectations on price could be substantial. Hence probability of large deviations becomes quite important even in modeling the stochastic prices of financial assets.
In our opinion there is an immense scope for advanced research on large deviations with applications to the financial sector. The recent introduction of special provisioning, in Basel-II norms on capital adequacy, for an organization’s efforts at measuring value at risk will perhaps enhance research in this area. AIMSCS is ideally situated to bring together a fusion of efforts from economists (financial or managerial), statisticians (probability theorists), and computer scientists in this vital applied area of research useful to the entire financial sector.
(The prestigious Abel Prize (the equivalent of Nobel Prize) is awarded recently to Professor S.R.S. Varadhan of the Courant Institute of Mathematical Sciences for his work on probability theory of large deviations. Professor Vardhan is a member of the International Advisory Committee of AIMSCS.)
As was mentioned earlier these are the areas on which AIMSCS plans to initiate activities and other statistical research will be based on the interests and expertise of the highly competent researchers AIMSCS will recruit. The AIMSCS can and will collaborate with all the Universities/Institutions /organizations and post graduate colleges and selected undergraduate colleges in this venture , research and training activities useful for planning and national development .