Qualitative data is subjective, rich, and in-depth information normally presented in the form of words. In undergraduate dissertations, the most common form of qualitative data is derived from semi-structured or unstructured interviews, although other sources can include observations, life histories and journals and documents of all kinds including newspapers.
Qualitative data from interviews can be analysed for content (content analysis) or for the language used (discourse analysis). Qualitative data is difficult to analyse and often opportunities to achieve high marks are lost because the data is treated casually and without rigour. Here we concentrate on the content analysis of data from interviews.
Theory
When using a quantitative methodology, you are normally testing theory through the testing of a hypothesis. In qualitative research, you are either exploring the application of a theory or model in a different context or are hoping for a theory or a model to emerge from the data. In other words, although you may have some ideas about your topic, you are also looking for ideas, concepts and attitudes often from experts or practitioners in the field.
Collecting and organising data
The means of collecting and recording data through interviews and the possible pitfalls are well documented elsewhere but in terms of subsequent analysis, it is essential that you have a complete and accurate record of what was said. Do not rely on your memory (it can be very selective!) and either tape record the conversation (preferably) or take copious notes. If you are taking notes, write them up straight after the interview so that you can elaborate and clarify. If you are using a tape recorder, transcribe the exact words onto paper.
However you record the data, you should end up with a hard copy of either exactly what was said (transcript of tape recording) or nearly exactly what was said (comprehensive notes). It may be that parts of the interview are irrelevant or are more in the nature of background material, in which case you need not put these into your transcript but do make sure that they are indeed unnecessary. You should indicate omissions in the text with short statements.
You should transcribe exactly what is said, with grammatical errors and so on. It does not look very authentic if all your respondents speak with perfect grammar and BBC English! You may also want to indicate other things that happen such as laughter.
Each transcript or set of notes should be clearly marked with the name of the interviewee, the date and place and any other relevant details and, where appropriate, cross-referenced to clearly labelled tapes. These transcripts and notes are not normally required to be included in your dissertation but they should be available to show your supervisor and the second marker if required.
You may wonder why you should go to all the bother of transcribing your audiotapes. It is certainly a time-consuming business, although much easier if you can get access to a transcription machine that enables you to start and stop the tape with your feet while carrying on typing. It is even easier if you have access to an audio-typist who will do this labour intensive part for you. The advantage of having the interviews etc in hard copy is that you can refer to them very quickly, make notes in the margins, re-organise them for analysis, make coding notations in the margins and so on. It is much slower in the long run to have to continually listen to the tapes. You can read much faster than the tape will play! It also has the advantage, especially if you do the transcription yourself, of ensuring that you are very familiar with the material.
Content analysis
Analysis of qualitative data is not simple, and although it does not require complicated statistical techniques of quantitative analysis, it is nonetheless difficult to handle the usually large amounts of data in a thorough, systematic and relevant manner. Marshall and Rossman offer this graphic description:
"Data analysis is the process of bringing order, structure and meaning to the mass of collected data. It is a messy, ambiguous, time-consuming, creative, and fascinating process. It does not proceed in a linear fashion; it is not neat. Qualitative data analysis is a search for general statements about relationships among categories of data."
Marshall and Rossman, 1990:111
Hitchcock and Hughes take this one step further:
"…the ways in which the researcher moves from a description of what is the case to an explanation of why what is the case is the case."
Hitchcock and Hughes 1995:295
Content analysis consists of reading and re-reading the transcripts looking for similarities and differences in order to find themes and to develop categories. Having the full transcript is essential to make sure that you do not leave out anything of importance by only selecting material that fits your own ideas. There are various ways that you can mark the text:
Coding paragraphs – This is where you mark each paragraph with a topic/theme/category with an appropriate word in the margin.
Highlighting paragraphs/sentences/phrases – This is where you use highlighter pens of different colours or different coloured pens to mark bits about the different themes. Using the example above, you could mark the bits relating to childcare and those relating to pay in a different colour, and so on. The use of coloured pens will help you find the relevant bits you need when you are writing up.
With both the above methods you may find that your categories change and develop as you do the analysis. What is important is that you can see that by analysing the text in such a way, you pick up all the references to a given topic and don’t leave anything out. This increases the objectivity and reduces the risk of you only selecting bits that conform to your own preconceptions.
You then need to arrange the data so that all the pieces on one theme are together. There are several ways of doing this:
• Cut and put in folders approach
Make several copies of each transcript (keeping the master safe) and cut up each one according to what is being discussed (your themes or categories). Then sort them into folders, one for each category, so that you have all together what each interviewee said about a given theme. You can then compare and look for similarities/differences/conclusions etc. Do not forget to mark each slip of paper with the respondent’s name, initials or some sort of code or you won’t be able to remember who said what. Several copies may be needed in case one paragraph contains more than one theme or category. This is time consuming and messy at first, but easier in the long run especially if you have a lot of data and categories.
• Card index system
Each transcript must be marked with line numbers for cross-referencing purposes. You have a card for each theme or category and cross-reference each card with each transcript so that you can find what everyone has said about a certain topic. This is quicker initially but involves a lot of referring back to the original transcripts when you write up your results and is usually only suitable for small amounts of data.
• Computer analysis
If you have access to a computer package that analyses qualitative data (e.g. NUDIST) then you can use this. These vary in the way they work but these are some of the basic common principles. You can upload your transcripts created in a compatible word-processing package and then the software allows you to mark different sections with various headings/themes. It will then sort all those sections marked with a particular heading and print them off together. This is the electronic version of the folders approach! It is also possible to use a word-processing package to cut and paste comments and to search for particular words.
There is a great danger of subjective interpretation. You must accurately reflect the views of the interviewees and be thorough and methodical. You need to become familiar with your data. You may find this a daunting and stressful task or you may really enjoy it – sometimes so much that you can delay getting down to the next stage which is interpreting and writing up!
Presenting qualitative data in your dissertation
This would normally follow the topics, themes and categories that you have developed in the analysis and these, in turn, are likely to have been themes that came out in the literature and may have formed the basis for your interview questions. It is usually a mistake to go through each interviewee in turn and what they said on each topic. This is cumbersome and does not give the scope to compare and contrast their ideas with the ideas of others.
Do not analyse the data on a question-by-question basis. You should summarise the key themes that emerge from the data and may give selected quotes if these are particularly appropriate.
Note how a point is made and then illustrated with an appropriate quote. The quotes make the whole text much more interesting and enjoyable to read but be wary of including too many. Please note also the reference to literature (this one is an imaginary piece of literature) – you should evaluate your own findings in this way and refer to the literature where appropriate. Remember the two concepts of presenting and discussing your findings. By presenting we mean a factual description/summary of what you found. The discussion element is your interpretation of what these findings mean and how they confirm or contradict what you wrote about in your literature section.
If you are trying to test a model then this will have been explored in your literature review and your methodology section will explain how you intend to test it. Your methodology should include who was interviewed with a clear rationale for your choices to explain how this fits into your research questions, how you ensured that the data was unbiased and as accurate as possible, and how the data was analysed. If you have been able to present an adapted model appropriate to your particular context then this should come towards the end of your findings section.
It may be desirable to put a small number of transcripts in the appendices but discuss this with your supervisor. Remember you have to present accurately what was said and what you think it means.
In order to write up your methodology section, you are strongly recommended to do some reading in research textbooks on interview techniques and the analysis of qualitative data. There are some suggested texts in the Further Reading section at the end of this pack.
Kamis, 12 Februari 2009
Rabu, 11 Februari 2009
SEM, Sebuah Kombinasi Dari Analisis Faktor - Syndi Octakomala D S (15406098)
Sejarah SEM
SEM adalah sebuah teknik pemodelan statistik yang sangat umum dan digunakan secara luas diberbagai lingkup ilmu pengetahuan. SEM dapat dilihat sebagai sebuah kombinasi dari analisis faktor (confirmatory factor analysis), dan regresi atau analisa alur (path analysis).
Pokok bahasan dalam SEM adalah konstruk teoritis yang digambarkan oleh faktor-faktor laten. Hubungan diantara konstruk teoritis tersebut digambarkan oleh regresi atau koefisien alur diantara berbagai faktor. SEM menunjukkan sebuah struktur bagi berbagai kovarian diantara variabel-variabel yang diobservasi, yang juga sering disebut dengan nama lain pemodelan struktur kovarian.
Namun demikian model tersebut dapat diperluas untuk memasukan rata-rata dari variabel yang diobservasi atau faktor dalam model. Banyak dari peneliti seringkali menyatakan pemodelan struktur kovarian sebagai model LISREL, yang mana penyebutan seperti di atas kurang tepat. Karena Lisrel merupakan kepanjangan dari LInear Structural RELationship yang juga merupakan salah satu program yang umum digunakan para peneliti untuk analisis SEM. Penggunaan nama Lisrel ini diberikan oleh Joreskog, sehubungan dengan program statistik untuk analisis SEM yang dikembangkan oleh Joreskog dengan nama yang sama (J.J. Hox dan T.M. Bechger, 2004).
Saat ini SEM tidak lagi hanya linear dan kemungkinan perluasan SEM akan melebihi program lisrel aslinya. SEM menyediakan kerangka kerja yang memenuhi dan sangat umum bagi analisa statistik yang mana termasuk didalamnya beberapa prosedur multivariat tradisional, sebagai contoh misalnya analisa faktor, analisa regresi, analisa diskriminan dan korelasi kanonikal sebagai kasus khusus.
SEM seringkali digambarkan oleh sebuah gambar diagram alur. Model ini didasarkan atas sistem persamaan linear yang pertama kali dikembangkan oleh Sewall Wright seorang ahli genetika tahun 1921 dalam studinya phylogenetic (Stoelting, 1992; Golob, 2001). Analisa alur ini kemudian diadopsi oleh ilmu-ilmu sosial sepanjang tahun 1960-an dan awal tahun 1970-an.
Para ahli sosiologi khususnya menemukan potensi analisa alur yang berhubungan dengan korelasi parsial. Analisa alur ini kemudian digantikan oleh SEM yang dikembangkan oleh Jöreskog (1970, 1973), Keesling (1972) dan Wiley (1973) yang dalam tulisan Bentler (1980) disebut sebagai JKW model. Model Jöreskog-Keesling-Wiley (JKW model) ini kemudian dianggap sebagai model SEM modern, yang kemudian populer dengan nama LISREL (Linear Structural Relationships) sebagai suatu program yang dikembangkan oleh Jöreskog (1970), Jöreskog, Gruvaeus dan van Thillo (1970), serta Jöreskog dan Sörbom (1979) seperti yang telah disinggung di depan.
Dalam model statistik, biasanya SEM ditampilkan dalam sebuah set persamaan matrik. Pada awal 1970-an ketika software LISREL untuk pertama kali diperkenalkan dalam penelitian, software ini membutuhkan penyesuaian untuk menyesuaikan model dalam hal matrik-matrik tersebut. Jadi para peneliti harus menyeleksi penggambaran matrik dari diagram alur, dan melengkapi software sebuah seri dari matrik untuk berbagai set parameter, seperti halnya faktor loading dan koefisien regresi. Pengembangan software terbaru memungkinkan peneliti untuk menentukan model langsung melalui diagram alur, seperti software yang dikembangkan oleh James L. Arbuckle (1995) yang dikenal dengan nama AMOS (Analysis of Moment Structures) (RumahStatistik).
SEM adalah sebuah teknik pemodelan statistik yang sangat umum dan digunakan secara luas diberbagai lingkup ilmu pengetahuan. SEM dapat dilihat sebagai sebuah kombinasi dari analisis faktor (confirmatory factor analysis), dan regresi atau analisa alur (path analysis).
Pokok bahasan dalam SEM adalah konstruk teoritis yang digambarkan oleh faktor-faktor laten. Hubungan diantara konstruk teoritis tersebut digambarkan oleh regresi atau koefisien alur diantara berbagai faktor. SEM menunjukkan sebuah struktur bagi berbagai kovarian diantara variabel-variabel yang diobservasi, yang juga sering disebut dengan nama lain pemodelan struktur kovarian.
Namun demikian model tersebut dapat diperluas untuk memasukan rata-rata dari variabel yang diobservasi atau faktor dalam model. Banyak dari peneliti seringkali menyatakan pemodelan struktur kovarian sebagai model LISREL, yang mana penyebutan seperti di atas kurang tepat. Karena Lisrel merupakan kepanjangan dari LInear Structural RELationship yang juga merupakan salah satu program yang umum digunakan para peneliti untuk analisis SEM. Penggunaan nama Lisrel ini diberikan oleh Joreskog, sehubungan dengan program statistik untuk analisis SEM yang dikembangkan oleh Joreskog dengan nama yang sama (J.J. Hox dan T.M. Bechger, 2004).
Saat ini SEM tidak lagi hanya linear dan kemungkinan perluasan SEM akan melebihi program lisrel aslinya. SEM menyediakan kerangka kerja yang memenuhi dan sangat umum bagi analisa statistik yang mana termasuk didalamnya beberapa prosedur multivariat tradisional, sebagai contoh misalnya analisa faktor, analisa regresi, analisa diskriminan dan korelasi kanonikal sebagai kasus khusus.
SEM seringkali digambarkan oleh sebuah gambar diagram alur. Model ini didasarkan atas sistem persamaan linear yang pertama kali dikembangkan oleh Sewall Wright seorang ahli genetika tahun 1921 dalam studinya phylogenetic (Stoelting, 1992; Golob, 2001). Analisa alur ini kemudian diadopsi oleh ilmu-ilmu sosial sepanjang tahun 1960-an dan awal tahun 1970-an.
Para ahli sosiologi khususnya menemukan potensi analisa alur yang berhubungan dengan korelasi parsial. Analisa alur ini kemudian digantikan oleh SEM yang dikembangkan oleh Jöreskog (1970, 1973), Keesling (1972) dan Wiley (1973) yang dalam tulisan Bentler (1980) disebut sebagai JKW model. Model Jöreskog-Keesling-Wiley (JKW model) ini kemudian dianggap sebagai model SEM modern, yang kemudian populer dengan nama LISREL (Linear Structural Relationships) sebagai suatu program yang dikembangkan oleh Jöreskog (1970), Jöreskog, Gruvaeus dan van Thillo (1970), serta Jöreskog dan Sörbom (1979) seperti yang telah disinggung di depan.
Dalam model statistik, biasanya SEM ditampilkan dalam sebuah set persamaan matrik. Pada awal 1970-an ketika software LISREL untuk pertama kali diperkenalkan dalam penelitian, software ini membutuhkan penyesuaian untuk menyesuaikan model dalam hal matrik-matrik tersebut. Jadi para peneliti harus menyeleksi penggambaran matrik dari diagram alur, dan melengkapi software sebuah seri dari matrik untuk berbagai set parameter, seperti halnya faktor loading dan koefisien regresi. Pengembangan software terbaru memungkinkan peneliti untuk menentukan model langsung melalui diagram alur, seperti software yang dikembangkan oleh James L. Arbuckle (1995) yang dikenal dengan nama AMOS (Analysis of Moment Structures) (RumahStatistik).
Selasa, 10 Februari 2009
Metode Kuantitatif dalam Pengambilan Keputusan -- Dimas Hartanto E 15406044
Secara umum, terdapat dua pendekatan dalam pengambilan keputusan, yaitu pendekatan kualitatif dan pendekatan kuantitatif.
Secara sederhana, pendekatan kualitatif mengandalkan penilaian subyektif terhadap suatu masalah, sedangkan pendekatan kuantitatif mendasarkan keputusan pada penilaian obyektif yang didasarkan pada model matematika yang dibuat. Jika Anda meramalkan cuaca mendasarkan pada pengalaman, maka pendekatan yang digunakan adalah kualitatif. Namun jika, ramalan didasarkan pada model matematika, maka pendekatan yang digunakan adalah kuantitatif. Keputusan penerimaan karyawan berdasar nilai tes masuk adalah contoh lain pendekatan kuantitatif, sedang jika didasarkan pada hasil wawancara untuk mengetahui kepribadian dan motivasi maka pendekatan yang dilakukan adalah kualitatif.
Umumnya pendekatan kuantitatif dalam pengambilan keputusan yang menggunakan model-model matematika. Matematika sudah ditemukan oleh manusia ribuan tahun yang lalu dan telah banyak digunakan dalam banyak aplikasi. Salah satu aplikasi matematika adalah untuk pengambilan keputusan. Sebagai contoh sederhana, bagaimana mengatur 50 kursi dengan ukuran tertentu ke dalam sebuah ruangan dengan ukuran tertentu pula. Dengan ukuran kursi dan ruangan, maka akan ditemukan cara terbaik untuk mengatur kursi; apakah 5 baris kali 10 lajur, atau sebaliknya, semuanya tergantung ukuran ruangan yang ada.
Untuk kasus yang lebih kompleks tentu saja dibutuhkan model matematika yang lebih rumit. Telah banyak model analisis kuantitatif yang dikembangkan dalam pengambilan keputusan.
Bagaimana prosesnya?
Secara umum, semua metode kuantitatif akan mengkonversikan data mentah menjadi informasi yang bermanfaat untuk pengambilan keputusan dari:
RAW MATERIAL -> ANALISIS KUANTITATIF -> INFORMASI YANG BERGUNA.
Sebagai contoh, dalam memproduksi produk A dan B, menggunakan bahan baku X, Y, Z, diketahui keuntungan penjualan produk A dan B. Angka yang menunjukkan banyak tiap bahan yang tersedia dan keuntungan dari tiap produk adalah data mentah. Analisis kuantitatif akan memproses data tersebut sehingga dihasilkan komposisi produksi (berapa banyak produk A dan B diproduksi) yang menghasilkan untuk optimal. Hasil inilah yang disebut denganinformasi yang bermanfaat untuk pengambilan keputusan.
Langkah-langkah dalam pengambilan keputusan
Mendefinisikan masalah. Secara sederhana, masalah merupakan perbedaan (gap) antara situasi yang diinginkan dengan kenyataan yang ada. Jika seorang mahasiswa ingin memperoleh nilai A, tetapi ternyata hasil yang didapatkan kurang dari itu, maka mahasiswa tersebut menghadapi masalah. Pada dasarnya, semua langkap pengambilan keputusan dilakukan untuk menghilangkan atau mengurangi perbedaan yang ada antara yang diharapkan dan yang terjadi.
Mengembangkan model. Model adalah representasi dari sebuah situasi nyata. Model dapat dikembangkan dalam berbagai bentuk; seperti model fisik, logika, atau matematika. Miniatur mobil atau maket rumah adalah contoh model fisik, sedang aliran listrik dengan rangkaian tertentu atau air mengalir dengan pola saluran tertentu adalah model logika untuk arus lalu-lintas. Model ekonomi yang menyatakan bahwa pendapatan merupakan fungsi dari konsumsi dan tabungan merupakan contoh model matematika.
Dalam langkah pengembangan model dikenal istilah variabel yang nilai-nilainya akan mempengaruhi keputusan yang akan diambil. Dalam kasus nyata, variabel-variabel ini sebagian dapat dikendalikan dan sebagian yang lain tidak. Lama lampu merah pada lampu pengatur lalu lintas dapat dikendalikan dengan mudah, namun laju kendaraan dan jumlah kendaraan yang melewati sebuah jalan tidak mudah dikendalikan.
Mengumpulkan data. Data yang akurat sangat penting untuk menjamin analisis kuantitatif yang dilakukan menghasilkan keluaran seperti yang diinginkan. Sumber data untuk pengujian model dapat berupa laporan-laporan perusahaan seperti laporan keuangan dan dokumen perusahaan lainnya, hasil wawancara, pengukuran langsung di lapangan dan hasilsampling statistik.
Membuat solusi. Solusi yang diambil dalam pendekatan kuantitatif dilakukan dengan memanipulasi model dan dengan masukan data yang dihasilkan pada langkah sebelumnya. Banyak metode yang bisa dilakukan dalam membuat solusi, seperti memecahkan persamaan (model matematika) yang sudah dikembangkan sebelumnya, menggunakan pendekatantrial and error dengan data masukan yang berbeda-beda untuk menghasilkan solusi ”terbaik”, atau menggunakan algoritma atau langkah-langkah penyelesaian detil khusus yang telah dikembangkan.
Apapun metode yang digunakan, solusi yang dihasilkan haruslah praktis (practical) dan dapat diterapkan (implementable). Solusi ”terbaik” yang dihasilkan harus tidak rumit dan dapat digunakan untuk memecahkan masalah yang ada.
Menguji solusi. Untuk menjamin bahwa solusi yang dihasilkan merupakan yang terbaik, maka pengujian harus dilakukan, baik pada model ataupun pada data masukan. Pengujian ini dilakukan untuk melihat akurasi (accuracy) dan kelengkapan model dan data yang digunakan. Untuk melihat akurasi dan kelengkapan data, data yang diperoleh dari berbagai sumber dapat dimasukkan ke dalam model dan hasilnya dibandingkan. Model dan data yang akurat dan lengkap seharusnya menjamin konsistensi hasil. Pengujian ini penting dilakukan sebelum analisis hasil dilakukan.
Menganalisis hasil. Analisis hasil dilakukan untuk memahami langkah-langkah yang harus dilakukan jika sebuah keputusan telah dipilih. Selanjutnya implikasi langkah-langkah yang dilalukan juga harus dianalisis. Dalam langkah ini analisis sensitivitas (sensitivity analysis) menjadi sangat penting. Analisis sensitivitas dilakukan dengan mengubah-ubah nilai-nilai masukan model dan melihat perbedaan apa yang terjadi pada hasil. Dengan demikian, analisis sensitivitas akan membantu untuk lebih memahami masalah yang dihadapi dan kemungkinan-kemungkinan jawaban atas masalah tersebut.
Mengimplementasikan hasil. Langkah implementasi ini dilakukan dengan menerapkan hasil analisis ke dalam proses-proses yang terdapat dalam perusahaan. Tidak kalah penting dalam langkah ini adalah memonitor hasil dari penerapan solusi. Namun, perlu disadari bahwa implementasi hasil analisis (solusi) bukanlah tanpa hambatan. Salah satu hambatan yang mungkin dihadapi adalah bagaimana meyakinkan pihak manajemen bahwa solusi yang ditawarkan merupakan yang terbaik dan akan memecahkan masalah yang ada. Dalam kasus ini, analisis sensitivitas atas model yang dihasilkan sekali lagi dapat digunakan untuk menjual solusi yang dihasilkan kepada pihak manajemen.
Secara sederhana, pendekatan kualitatif mengandalkan penilaian subyektif terhadap suatu masalah, sedangkan pendekatan kuantitatif mendasarkan keputusan pada penilaian obyektif yang didasarkan pada model matematika yang dibuat. Jika Anda meramalkan cuaca mendasarkan pada pengalaman, maka pendekatan yang digunakan adalah kualitatif. Namun jika, ramalan didasarkan pada model matematika, maka pendekatan yang digunakan adalah kuantitatif. Keputusan penerimaan karyawan berdasar nilai tes masuk adalah contoh lain pendekatan kuantitatif, sedang jika didasarkan pada hasil wawancara untuk mengetahui kepribadian dan motivasi maka pendekatan yang dilakukan adalah kualitatif.
Umumnya pendekatan kuantitatif dalam pengambilan keputusan yang menggunakan model-model matematika. Matematika sudah ditemukan oleh manusia ribuan tahun yang lalu dan telah banyak digunakan dalam banyak aplikasi. Salah satu aplikasi matematika adalah untuk pengambilan keputusan. Sebagai contoh sederhana, bagaimana mengatur 50 kursi dengan ukuran tertentu ke dalam sebuah ruangan dengan ukuran tertentu pula. Dengan ukuran kursi dan ruangan, maka akan ditemukan cara terbaik untuk mengatur kursi; apakah 5 baris kali 10 lajur, atau sebaliknya, semuanya tergantung ukuran ruangan yang ada.
Untuk kasus yang lebih kompleks tentu saja dibutuhkan model matematika yang lebih rumit. Telah banyak model analisis kuantitatif yang dikembangkan dalam pengambilan keputusan.
Bagaimana prosesnya?
Secara umum, semua metode kuantitatif akan mengkonversikan data mentah menjadi informasi yang bermanfaat untuk pengambilan keputusan dari:
RAW MATERIAL -> ANALISIS KUANTITATIF -> INFORMASI YANG BERGUNA.
Sebagai contoh, dalam memproduksi produk A dan B, menggunakan bahan baku X, Y, Z, diketahui keuntungan penjualan produk A dan B. Angka yang menunjukkan banyak tiap bahan yang tersedia dan keuntungan dari tiap produk adalah data mentah. Analisis kuantitatif akan memproses data tersebut sehingga dihasilkan komposisi produksi (berapa banyak produk A dan B diproduksi) yang menghasilkan untuk optimal. Hasil inilah yang disebut denganinformasi yang bermanfaat untuk pengambilan keputusan.
Langkah-langkah dalam pengambilan keputusan
Mendefinisikan masalah. Secara sederhana, masalah merupakan perbedaan (gap) antara situasi yang diinginkan dengan kenyataan yang ada. Jika seorang mahasiswa ingin memperoleh nilai A, tetapi ternyata hasil yang didapatkan kurang dari itu, maka mahasiswa tersebut menghadapi masalah. Pada dasarnya, semua langkap pengambilan keputusan dilakukan untuk menghilangkan atau mengurangi perbedaan yang ada antara yang diharapkan dan yang terjadi.
Mengembangkan model. Model adalah representasi dari sebuah situasi nyata. Model dapat dikembangkan dalam berbagai bentuk; seperti model fisik, logika, atau matematika. Miniatur mobil atau maket rumah adalah contoh model fisik, sedang aliran listrik dengan rangkaian tertentu atau air mengalir dengan pola saluran tertentu adalah model logika untuk arus lalu-lintas. Model ekonomi yang menyatakan bahwa pendapatan merupakan fungsi dari konsumsi dan tabungan merupakan contoh model matematika.
Dalam langkah pengembangan model dikenal istilah variabel yang nilai-nilainya akan mempengaruhi keputusan yang akan diambil. Dalam kasus nyata, variabel-variabel ini sebagian dapat dikendalikan dan sebagian yang lain tidak. Lama lampu merah pada lampu pengatur lalu lintas dapat dikendalikan dengan mudah, namun laju kendaraan dan jumlah kendaraan yang melewati sebuah jalan tidak mudah dikendalikan.
Mengumpulkan data. Data yang akurat sangat penting untuk menjamin analisis kuantitatif yang dilakukan menghasilkan keluaran seperti yang diinginkan. Sumber data untuk pengujian model dapat berupa laporan-laporan perusahaan seperti laporan keuangan dan dokumen perusahaan lainnya, hasil wawancara, pengukuran langsung di lapangan dan hasilsampling statistik.
Membuat solusi. Solusi yang diambil dalam pendekatan kuantitatif dilakukan dengan memanipulasi model dan dengan masukan data yang dihasilkan pada langkah sebelumnya. Banyak metode yang bisa dilakukan dalam membuat solusi, seperti memecahkan persamaan (model matematika) yang sudah dikembangkan sebelumnya, menggunakan pendekatantrial and error dengan data masukan yang berbeda-beda untuk menghasilkan solusi ”terbaik”, atau menggunakan algoritma atau langkah-langkah penyelesaian detil khusus yang telah dikembangkan.
Apapun metode yang digunakan, solusi yang dihasilkan haruslah praktis (practical) dan dapat diterapkan (implementable). Solusi ”terbaik” yang dihasilkan harus tidak rumit dan dapat digunakan untuk memecahkan masalah yang ada.
Menguji solusi. Untuk menjamin bahwa solusi yang dihasilkan merupakan yang terbaik, maka pengujian harus dilakukan, baik pada model ataupun pada data masukan. Pengujian ini dilakukan untuk melihat akurasi (accuracy) dan kelengkapan model dan data yang digunakan. Untuk melihat akurasi dan kelengkapan data, data yang diperoleh dari berbagai sumber dapat dimasukkan ke dalam model dan hasilnya dibandingkan. Model dan data yang akurat dan lengkap seharusnya menjamin konsistensi hasil. Pengujian ini penting dilakukan sebelum analisis hasil dilakukan.
Menganalisis hasil. Analisis hasil dilakukan untuk memahami langkah-langkah yang harus dilakukan jika sebuah keputusan telah dipilih. Selanjutnya implikasi langkah-langkah yang dilalukan juga harus dianalisis. Dalam langkah ini analisis sensitivitas (sensitivity analysis) menjadi sangat penting. Analisis sensitivitas dilakukan dengan mengubah-ubah nilai-nilai masukan model dan melihat perbedaan apa yang terjadi pada hasil. Dengan demikian, analisis sensitivitas akan membantu untuk lebih memahami masalah yang dihadapi dan kemungkinan-kemungkinan jawaban atas masalah tersebut.
Mengimplementasikan hasil. Langkah implementasi ini dilakukan dengan menerapkan hasil analisis ke dalam proses-proses yang terdapat dalam perusahaan. Tidak kalah penting dalam langkah ini adalah memonitor hasil dari penerapan solusi. Namun, perlu disadari bahwa implementasi hasil analisis (solusi) bukanlah tanpa hambatan. Salah satu hambatan yang mungkin dihadapi adalah bagaimana meyakinkan pihak manajemen bahwa solusi yang ditawarkan merupakan yang terbaik dan akan memecahkan masalah yang ada. Dalam kasus ini, analisis sensitivitas atas model yang dihasilkan sekali lagi dapat digunakan untuk menjual solusi yang dihasilkan kepada pihak manajemen.
Time Series Analysis - Indah Dwi Kartika 15406012
In the following topics, we will first review techniques used to identify patterns in time series data (such as smoothing and curve fitting techniques and autocorrelations), then we will introduce a general class of models that can be used to represent time series data and generate predictions (autoregressive and moving average models). Finally, we will review some simple but commonly used modeling and forecasting techniques based on linear regression. For more information on these topics, see the topic name below.
General Introduction
In the following topics, we will review techniques that are useful for analyzing time series data, that is, sequences of measurements that follow non-random orders. Unlike the analyses of random samples of observations that are discussed in the context of most other statistics, the analysis of time series is based on the assumption that successive values in the data file represent consecutive measurements taken at equally spaced time intervals.
Detailed discussions of the methods described in this section can be found in Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord (1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway (1988), Vandaele (1983), Walker (1991), and Wei (1989).
Two Main Goals
There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable). Both of these goals require that the pattern of observed time series data is identified and more or less formally described. Once the pattern is established, we can interpret and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g., sesonal commodity prices). Regardless of the depth of our understanding and the validity of our interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict future events.
Identifying Patterns in Time Series Data :
- Systematic pattern and random noise
As in most other analyses, in time series analysis it is assumed that the data consist of a systematic pattern (usually a set of identifiable components) and random noise (error) which usually makes the pattern difficult to identify. Most time series analysis techniques involve some form of filtering out noise in order to make the pattern more salient.
- Two general aspects of time series patterns
Most time series patterns can be described in terms of two basic classes of components: trend and seasonality. The former represents a general systematic linear or (most often) nonlinear component that changes over time and does not repeat or at least does not repeat within the time range captured by our data (e.g., a plateau followed by a period of exponential growth). The latter may have a formally similar nature (e.g., a plateau followed by a period of exponential growth), however, it repeats itself in systematic intervals over time. Those two general classes of time series components may coexist in real-life data. For example, sales of a company can rapidly grow over years but they still follow consistent seasonal patterns (e.g., as much as 25% of yearly sales each year are made in December, whereas only 4% in August).
This general pattern is well illustrated in a "classic" Series G data set (Box and Jenkins, 1976, p. 531) representing monthly international airline passenger totals (measured in thousands) in twelve consecutive years from 1949 to 1960 (see example data file G.sta and graph above). If you plot the successive observations (months) of airline passenger totals, a clear, almost linear trend emerges, indicating that the airline industry enjoyed a steady growth over the years (approximately 4 times more passengers traveled in 1960 than in 1949). At the same time, the monthly figures will follow an almost identical pattern each year (e.g., more people travel during holidays then during any other time of the year). This example data file also illustrates a very common general type of pattern in time series data, where the amplitude of the seasonal changes increases with the overall trend (i.e., the variance is correlated with the mean over the segments of the series). This pattern which is called multiplicative seasonality indicates that the relative amplitude of seasonal changes is constant over time, thus it is related to the trend.
- Trend Analysis
There are no proven "automatic" techniques to identify trend components in the time series data; however, as long as the trend is monotonous (consistently increasing or decreasing) that part of data analysis is typically not very difficult. If the time series data contain considerable error, then the first step in the process of trend identification is smoothing.
Smoothing. Smoothing always involves some form of local averaging of data such that the nonsystematic components of individual observations cancel each other out. The most common technique is moving average smoothing which replaces each element of the series by either the simple or weighted average of n surrounding elements, where n is the width of the smoothing "window" (see Box & Jenkins, 1976; Velleman & Hoaglin, 1981). Medians can be used instead of means. The main advantage of median as compared to moving average smoothing is that its results are less biased by outliers (within the smoothing window). Thus, if there are outliers in the data (e.g., due to measurement errors), median smoothing typically produces smoother or at least more "reliable" curves than moving average based on the same window width. The main disadvantage of median smoothing is that in the absence of clear outliers it may produce more "jagged" curves than moving average and it does not allow for weighting.
In the relatively less common cases (in time series data), when the measurement error is very large, the distance weighted least squares smoothing or negative exponentially weighted smoothing techniques can be used. All those methods will filter out the noise and convert the data into a smooth curve that is relatively unbiased by outliers (see the respective sections on each of those methods for more details). Series with relatively few and systematically distributed points can be smoothed with bicubic splines.
Fitting a function. Many monotonous time series data can be adequately approximated by a linear function; if there is a clear monotonous nonlinear component, the data first need to be transformed to remove the nonlinearity. Usually a logarithmic, exponential, or (less often) polynomial function can be used.
- Analysis of Seasonality
Seasonal dependency (seasonality) is another general component of the time series pattern. The concept was illustrated in the example of the airline passengers data above. It is formally defined as correlational dependency of order k between each i'th element of the series and the (i-k)'th element (Kendall, 1976) and measured by autocorrelation (i.e., a correlation between the two terms); k is usually called the lag. If the measurement error is not too large, seasonality can be visually identified in the series as a pattern that repeats every k elements.
Autocorrelation correlogram. Seasonal patterns of time series can be examined via correlograms. The correlogram (autocorrelogram) displays graphically and numerically the autocorrelation function (ACF), that is, serial correlation coefficients (and their standard errors) for consecutive lags in a specified range of lags (e.g., 1 through 30). Ranges of two standard errors for each lag are usually marked in correlograms but typically the size of auto correlation is of more interest than its reliability (see Elementary Concepts) because we are usually interested only in very strong (and thus highly significant) autocorrelations.
Examining correlograms. While examining correlograms one should keep in mind that autocorrelations for consecutive lags are formally dependent. Consider the following example. If the first element is closely related to the second, and the second to the third, then the first element must also be somewhat related to the third one, etc. This implies that the pattern of serial dependencies can change considerably after removing the first order auto correlation (i.e., after differencing the series with a lag of 1).
Partial autocorrelations. Another useful method to examine serial dependencies is to examine the partial autocorrelation function (PACF) - an extension of autocorrelation, where the dependence on the intermediate elements (those within the lag) is removed. In other words the partial autocorrelation is similar to autocorrelation, except that when calculating it, the (auto) correlations with all the elements within the lag are partialled out (Box & Jenkins, 1976; see also McDowall, McCleary, Meidinger, & Hay, 1980). If a lag of 1 is specified (i.e., there are no intermediate elements within the lag), then the partial autocorrelation is equivalent to auto correlation. In a sense, the partial autocorrelation provides a "cleaner" picture of serial dependencies for individual lags (not confounded by other serial dependencies).
Removing serial dependency. Serial dependency for a particular lag of k can be removed by differencing the series, that is converting each i'th element of the series into its difference from the (i-k)''th element. There are two major reasons for such transformations.
First, one can identify the hidden nature of seasonal dependencies in the series. Remember that, as mentioned in the previous paragraph, autocorrelations for consecutive lags are interdependent. Therefore, removing some of the autocorrelations will change other auto correlations, that is, it may eliminate them or it may make some other seasonalities more apparent.
The other reason for removing seasonal dependencies is to make the series stationary which is necessary for ARIMA and other techniques.
General Introduction
In the following topics, we will review techniques that are useful for analyzing time series data, that is, sequences of measurements that follow non-random orders. Unlike the analyses of random samples of observations that are discussed in the context of most other statistics, the analysis of time series is based on the assumption that successive values in the data file represent consecutive measurements taken at equally spaced time intervals.
Detailed discussions of the methods described in this section can be found in Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord (1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway (1988), Vandaele (1983), Walker (1991), and Wei (1989).
Two Main Goals
There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable). Both of these goals require that the pattern of observed time series data is identified and more or less formally described. Once the pattern is established, we can interpret and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g., sesonal commodity prices). Regardless of the depth of our understanding and the validity of our interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict future events.
Identifying Patterns in Time Series Data :
- Systematic pattern and random noise
As in most other analyses, in time series analysis it is assumed that the data consist of a systematic pattern (usually a set of identifiable components) and random noise (error) which usually makes the pattern difficult to identify. Most time series analysis techniques involve some form of filtering out noise in order to make the pattern more salient.
- Two general aspects of time series patterns
Most time series patterns can be described in terms of two basic classes of components: trend and seasonality. The former represents a general systematic linear or (most often) nonlinear component that changes over time and does not repeat or at least does not repeat within the time range captured by our data (e.g., a plateau followed by a period of exponential growth). The latter may have a formally similar nature (e.g., a plateau followed by a period of exponential growth), however, it repeats itself in systematic intervals over time. Those two general classes of time series components may coexist in real-life data. For example, sales of a company can rapidly grow over years but they still follow consistent seasonal patterns (e.g., as much as 25% of yearly sales each year are made in December, whereas only 4% in August).
This general pattern is well illustrated in a "classic" Series G data set (Box and Jenkins, 1976, p. 531) representing monthly international airline passenger totals (measured in thousands) in twelve consecutive years from 1949 to 1960 (see example data file G.sta and graph above). If you plot the successive observations (months) of airline passenger totals, a clear, almost linear trend emerges, indicating that the airline industry enjoyed a steady growth over the years (approximately 4 times more passengers traveled in 1960 than in 1949). At the same time, the monthly figures will follow an almost identical pattern each year (e.g., more people travel during holidays then during any other time of the year). This example data file also illustrates a very common general type of pattern in time series data, where the amplitude of the seasonal changes increases with the overall trend (i.e., the variance is correlated with the mean over the segments of the series). This pattern which is called multiplicative seasonality indicates that the relative amplitude of seasonal changes is constant over time, thus it is related to the trend.
- Trend Analysis
There are no proven "automatic" techniques to identify trend components in the time series data; however, as long as the trend is monotonous (consistently increasing or decreasing) that part of data analysis is typically not very difficult. If the time series data contain considerable error, then the first step in the process of trend identification is smoothing.
Smoothing. Smoothing always involves some form of local averaging of data such that the nonsystematic components of individual observations cancel each other out. The most common technique is moving average smoothing which replaces each element of the series by either the simple or weighted average of n surrounding elements, where n is the width of the smoothing "window" (see Box & Jenkins, 1976; Velleman & Hoaglin, 1981). Medians can be used instead of means. The main advantage of median as compared to moving average smoothing is that its results are less biased by outliers (within the smoothing window). Thus, if there are outliers in the data (e.g., due to measurement errors), median smoothing typically produces smoother or at least more "reliable" curves than moving average based on the same window width. The main disadvantage of median smoothing is that in the absence of clear outliers it may produce more "jagged" curves than moving average and it does not allow for weighting.
In the relatively less common cases (in time series data), when the measurement error is very large, the distance weighted least squares smoothing or negative exponentially weighted smoothing techniques can be used. All those methods will filter out the noise and convert the data into a smooth curve that is relatively unbiased by outliers (see the respective sections on each of those methods for more details). Series with relatively few and systematically distributed points can be smoothed with bicubic splines.
Fitting a function. Many monotonous time series data can be adequately approximated by a linear function; if there is a clear monotonous nonlinear component, the data first need to be transformed to remove the nonlinearity. Usually a logarithmic, exponential, or (less often) polynomial function can be used.
- Analysis of Seasonality
Seasonal dependency (seasonality) is another general component of the time series pattern. The concept was illustrated in the example of the airline passengers data above. It is formally defined as correlational dependency of order k between each i'th element of the series and the (i-k)'th element (Kendall, 1976) and measured by autocorrelation (i.e., a correlation between the two terms); k is usually called the lag. If the measurement error is not too large, seasonality can be visually identified in the series as a pattern that repeats every k elements.
Autocorrelation correlogram. Seasonal patterns of time series can be examined via correlograms. The correlogram (autocorrelogram) displays graphically and numerically the autocorrelation function (ACF), that is, serial correlation coefficients (and their standard errors) for consecutive lags in a specified range of lags (e.g., 1 through 30). Ranges of two standard errors for each lag are usually marked in correlograms but typically the size of auto correlation is of more interest than its reliability (see Elementary Concepts) because we are usually interested only in very strong (and thus highly significant) autocorrelations.
Examining correlograms. While examining correlograms one should keep in mind that autocorrelations for consecutive lags are formally dependent. Consider the following example. If the first element is closely related to the second, and the second to the third, then the first element must also be somewhat related to the third one, etc. This implies that the pattern of serial dependencies can change considerably after removing the first order auto correlation (i.e., after differencing the series with a lag of 1).
Partial autocorrelations. Another useful method to examine serial dependencies is to examine the partial autocorrelation function (PACF) - an extension of autocorrelation, where the dependence on the intermediate elements (those within the lag) is removed. In other words the partial autocorrelation is similar to autocorrelation, except that when calculating it, the (auto) correlations with all the elements within the lag are partialled out (Box & Jenkins, 1976; see also McDowall, McCleary, Meidinger, & Hay, 1980). If a lag of 1 is specified (i.e., there are no intermediate elements within the lag), then the partial autocorrelation is equivalent to auto correlation. In a sense, the partial autocorrelation provides a "cleaner" picture of serial dependencies for individual lags (not confounded by other serial dependencies).
Removing serial dependency. Serial dependency for a particular lag of k can be removed by differencing the series, that is converting each i'th element of the series into its difference from the (i-k)''th element. There are two major reasons for such transformations.
First, one can identify the hidden nature of seasonal dependencies in the series. Remember that, as mentioned in the previous paragraph, autocorrelations for consecutive lags are interdependent. Therefore, removing some of the autocorrelations will change other auto correlations, that is, it may eliminate them or it may make some other seasonalities more apparent.
The other reason for removing seasonal dependencies is to make the series stationary which is necessary for ARIMA and other techniques.
Model Analisis Peramalan (FORECASTING) - Said Wahyudhi Berry Murja 15406028
MODEL ANALISIS PERAMALAN (FORECASTING)
Metode sistem peramalan yang sering digunakan dapat dilihat pada gambar di bawah ini (Makrdakis, 1999)
METODE SISTEM PERAMALAN (Maksdakis, 1999)
1. Metode Deret Waktu (Time series Method)
Metode peramalan ini menggunakan deret waktu (time series) sebagai dasar peramalan.perlukan data aktual lalu yang akan diramalkan untuk mengetahui pola data yang diperlukan untuk menentukan metode peramalan yang sesuai. Beberapa metode dalam time series yaitu sebagai berikut:
• ARIMA (Autoregressive Integrated Moving Average) pada dasarnya menggunakan fungsi deret waktu, metode ini memerlukan pendekatan model identification serta penaksiran awal dari paramaternya. Sebagai contoh: peramalan nilai tukar mata uang asing, pergerakan nilai IHSG.
• Kalman Filter banyak digunakan pada bidang rekayasa sistem untuk memisahkan sinyal dari noise yang masuk ke sistem. Metoda ini menggunakan pendekatan model state space dengan asumsi white noise memiliki distribusi Gaussian.
• Bayesian merupakan metode yang menggunakan state space berdasarkan model dinamis linear (dynamical linear model). Sebagai contoh: menentukan diagnosa suatu penyakit berdasarkan data-data gejala (hipertensi atau sakit jantung), mengenali warna berdasarkan fitur indeks warna RGB, mendeteksi warna kulit (skin detection) berdasarkan fitur warna chrominant.
• Metode smoothing dipakai untuk mengurangi ketidakteraturan data yang bersifat musiman dengan cara membuat keseimbangan rata-rata dari data masa lampau.
• Regresi menggunakan dummy variabel dalam formulasi matematisnya. Sebagai contoh: kemampuan dalam meramal sales suatu produk berdasarkan harganya.
2. Metode Kausal
Metode ini menggunakan pendekatan sebab-akibat, dan bertujuan untuk meramalkan keadaan di masa yang akan datang dengan menemukan dan mengukur beberapa variabel bebas (independen) yang penting beserta pengaruhnya terhadap variabel tidak bebas yang akan diramalkan. Pada metode kausal terdapat tiga kelompok metode yang sering dipakai yaitu :
• Metoda regresi dan korelasi memakai teknik kuadrat terkecil (least square). Metoda ini sering digunakan untuk prediksi jangka pendek. Contohnya: meramalkan hubungan jumlah kredit yang diberikan dengan giro, deposito dan tabungan masyarakat.
• Metoda ekonometri berdasarkan pada persamaan regresi yang didekati secara simultan. Metoda ini sering digunakan untuk perencanaan ekonomi nasional dalam jangka pendek maupun jangka panjang. Contohnya: meramalkan besarnya indikator moneter buat beberapa tahun ke depan, hal ini sering dilakukan pihak BI tiap tahunnya.
• Metoda input output biasa digunakan untuk perencanaan ekonomi nasional jangka panjang. Contohnya: meramalkan pertumbuhan ekonomi seperti pertumbuhan domestik bruto (PDB) untuk beberapa periode tahun ke depan 5-10 tahun mendatang.
Secara ringkas terdapat tiga tahapan yang harus dilalui dalam perancangan suatu metoda peramalan, yaitu :
a. Melakukan analisa pada data masa lampau. Langkah ini bertujuan untuk mendapatkan gambaran pola dari data bersangkutan;
b. Memilih metoda yang akan digunakan. Terdapat bermacam-macam metoda yang tersedia dengan keperluannya. Metoda yang berlainan akan menghasilkan system prediksi yang berbeda pula untuk data yang sama. Secara umum dapat dikatakan bahwa metoda yang berhasil adalah metoda yang menghasilkan penyimpangan (error) sekecil-kecilnya antara hasil prediksi dengan kenyataan yang terjadi;
c. Proses transformasi dari data masa lampau dengan menggunakan metoda yang dipilih. Kalau diperlukan, diadakan perubahan sesuai kebutuhannya. Menurut John E. Hanke dan Arthur G. Reitch (1995), metode peramalan dapat dibagi menjadi dua yakni :
Metode peramalan kualitatif atau subyektif :
“Qualitative forecasting techniques relied on human judgments and intuition more than manipulation of past historical data,” atau metode yang hanya didasarkan kepada penilaian dan intuisi, bukan kepada pengolahan data historis.
Metode Peramalan Kuantitatif
Sedangkan peramalan kuantitatif diterangkan sebagai : “quantitative techniques that need no input of judgments; they are mechanical procedures that produce quantitative result and some quantitative procedures require a much more sophisticated manipulation of data than do other, of course”. Sedangkan De Lurgio (1998) mengilustrasikan jenis-jenis metode peramalan seperti pada Gambar berikut:
JENIS-JENIS METODE PERAMALAN (De Lurgio, 1998)
(sumber : http://www.ittelkom.ac.id/library/index.php?view=article&catid=25%3Aindustri&id=258%3Ametode-peramalan-forecasting-method&option=com_content&Itemid=15)
Metode sistem peramalan yang sering digunakan dapat dilihat pada gambar di bawah ini (Makrdakis, 1999)
METODE SISTEM PERAMALAN (Maksdakis, 1999)
1. Metode Deret Waktu (Time series Method)
Metode peramalan ini menggunakan deret waktu (time series) sebagai dasar peramalan.perlukan data aktual lalu yang akan diramalkan untuk mengetahui pola data yang diperlukan untuk menentukan metode peramalan yang sesuai. Beberapa metode dalam time series yaitu sebagai berikut:
• ARIMA (Autoregressive Integrated Moving Average) pada dasarnya menggunakan fungsi deret waktu, metode ini memerlukan pendekatan model identification serta penaksiran awal dari paramaternya. Sebagai contoh: peramalan nilai tukar mata uang asing, pergerakan nilai IHSG.
• Kalman Filter banyak digunakan pada bidang rekayasa sistem untuk memisahkan sinyal dari noise yang masuk ke sistem. Metoda ini menggunakan pendekatan model state space dengan asumsi white noise memiliki distribusi Gaussian.
• Bayesian merupakan metode yang menggunakan state space berdasarkan model dinamis linear (dynamical linear model). Sebagai contoh: menentukan diagnosa suatu penyakit berdasarkan data-data gejala (hipertensi atau sakit jantung), mengenali warna berdasarkan fitur indeks warna RGB, mendeteksi warna kulit (skin detection) berdasarkan fitur warna chrominant.
• Metode smoothing dipakai untuk mengurangi ketidakteraturan data yang bersifat musiman dengan cara membuat keseimbangan rata-rata dari data masa lampau.
• Regresi menggunakan dummy variabel dalam formulasi matematisnya. Sebagai contoh: kemampuan dalam meramal sales suatu produk berdasarkan harganya.
2. Metode Kausal
Metode ini menggunakan pendekatan sebab-akibat, dan bertujuan untuk meramalkan keadaan di masa yang akan datang dengan menemukan dan mengukur beberapa variabel bebas (independen) yang penting beserta pengaruhnya terhadap variabel tidak bebas yang akan diramalkan. Pada metode kausal terdapat tiga kelompok metode yang sering dipakai yaitu :
• Metoda regresi dan korelasi memakai teknik kuadrat terkecil (least square). Metoda ini sering digunakan untuk prediksi jangka pendek. Contohnya: meramalkan hubungan jumlah kredit yang diberikan dengan giro, deposito dan tabungan masyarakat.
• Metoda ekonometri berdasarkan pada persamaan regresi yang didekati secara simultan. Metoda ini sering digunakan untuk perencanaan ekonomi nasional dalam jangka pendek maupun jangka panjang. Contohnya: meramalkan besarnya indikator moneter buat beberapa tahun ke depan, hal ini sering dilakukan pihak BI tiap tahunnya.
• Metoda input output biasa digunakan untuk perencanaan ekonomi nasional jangka panjang. Contohnya: meramalkan pertumbuhan ekonomi seperti pertumbuhan domestik bruto (PDB) untuk beberapa periode tahun ke depan 5-10 tahun mendatang.
Secara ringkas terdapat tiga tahapan yang harus dilalui dalam perancangan suatu metoda peramalan, yaitu :
a. Melakukan analisa pada data masa lampau. Langkah ini bertujuan untuk mendapatkan gambaran pola dari data bersangkutan;
b. Memilih metoda yang akan digunakan. Terdapat bermacam-macam metoda yang tersedia dengan keperluannya. Metoda yang berlainan akan menghasilkan system prediksi yang berbeda pula untuk data yang sama. Secara umum dapat dikatakan bahwa metoda yang berhasil adalah metoda yang menghasilkan penyimpangan (error) sekecil-kecilnya antara hasil prediksi dengan kenyataan yang terjadi;
c. Proses transformasi dari data masa lampau dengan menggunakan metoda yang dipilih. Kalau diperlukan, diadakan perubahan sesuai kebutuhannya. Menurut John E. Hanke dan Arthur G. Reitch (1995), metode peramalan dapat dibagi menjadi dua yakni :
Metode peramalan kualitatif atau subyektif :
“Qualitative forecasting techniques relied on human judgments and intuition more than manipulation of past historical data,” atau metode yang hanya didasarkan kepada penilaian dan intuisi, bukan kepada pengolahan data historis.
Metode Peramalan Kuantitatif
Sedangkan peramalan kuantitatif diterangkan sebagai : “quantitative techniques that need no input of judgments; they are mechanical procedures that produce quantitative result and some quantitative procedures require a much more sophisticated manipulation of data than do other, of course”. Sedangkan De Lurgio (1998) mengilustrasikan jenis-jenis metode peramalan seperti pada Gambar berikut:
JENIS-JENIS METODE PERAMALAN (De Lurgio, 1998)
(sumber : http://www.ittelkom.ac.id/library/index.php?view=article&catid=25%3Aindustri&id=258%3Ametode-peramalan-forecasting-method&option=com_content&Itemid=15)
Senin, 09 Februari 2009
factor analysis by Novian_15407025
Factor analysis is used to uncover the latent structure (dimensions) of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors and as such is a "non-dependent" procedure (that is, it does not assume a dependent variable is specified). Factor analysis could be used for any of the following purposes:
To reduce a large number of variables to a smaller number of factors for modeling purposes, where the large number of variables precludes modeling all the measures individually. As such, factor analysis is integrated in structural equation modeling (SEM), helping confirm the latent variables modeled by SEM. However, factor analysis can be and is often used on a stand-alone basis for similar purposes.
To establish that multiple tests measure the same factor, thereby giving justification for administering fewer tests. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor (a notion now rejected, by the way).
To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors.
To create a set of factors to be treated as uncorrelated variables as one approach to handling multicollinearity in such procedures as multiple regression
To identify clusters of cases and/or outliers.
To determine network groups by determining which sets of people cluster together (using Q-mode factor analysis, discussed below)
A non-technical analogy: A mother sees various bumps and shapes under a blanket at the bottom of a bed. When one shape moves toward the top of the bed, all the other bumps and shapes move toward the top also, so the mother concludes that what is under the blanket is a single thing, most likely her child. Similarly, factor analysis takes as input a number of measures and tests, analogous to the bumps and shapes. Those that move together are considered a single thing, which it labels a factor. That is, in factor analysis the researcher is assuming that there is a "child" out there in the form of an underlying factor, and he or she takes simultaneous movement (correlation) as evidence of its existence. If correlation is spurious for some reason, this inference will be mistaken, of course, so it is important when conducting factor analysis that possible variables which might introduce spuriousness, such as anteceding causes, be included in the analysis and taken into account.
Factor analysis is part of the general linear model (GLM) family of procedures and makes many of the same assumptions as multiple regression: linear relationships, interval or near-interval data, untruncated variables, proper specification (relevant variables included, extraneous ones excluded), lack of high multicollinearity, and multivariate normality for purposes of significance testing. Factor analysis generates a table in which the rows are the observed raw indicator variables and the columns are the factors or latent variables which explain as much of the variance in these variables as possible. The cells in this table are factor loadings, and the meaning of the factors must be induced from seeing which variables are most heavily loaded on which factors. This inferential labeling process can be fraught with subjectivity as diverse researchers impute different labels.
There are several different types of factor analysis, with the most common being principal components analysis (PCA), which is preferred for purposes of data reduction. However, common factor analysis is preferred for purposes of causal analysis anf for confirmatory factor analysis in structural equation modeling, among other settings..
To reduce a large number of variables to a smaller number of factors for modeling purposes, where the large number of variables precludes modeling all the measures individually. As such, factor analysis is integrated in structural equation modeling (SEM), helping confirm the latent variables modeled by SEM. However, factor analysis can be and is often used on a stand-alone basis for similar purposes.
To establish that multiple tests measure the same factor, thereby giving justification for administering fewer tests. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor (a notion now rejected, by the way).
To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors.
To create a set of factors to be treated as uncorrelated variables as one approach to handling multicollinearity in such procedures as multiple regression
To identify clusters of cases and/or outliers.
To determine network groups by determining which sets of people cluster together (using Q-mode factor analysis, discussed below)
A non-technical analogy: A mother sees various bumps and shapes under a blanket at the bottom of a bed. When one shape moves toward the top of the bed, all the other bumps and shapes move toward the top also, so the mother concludes that what is under the blanket is a single thing, most likely her child. Similarly, factor analysis takes as input a number of measures and tests, analogous to the bumps and shapes. Those that move together are considered a single thing, which it labels a factor. That is, in factor analysis the researcher is assuming that there is a "child" out there in the form of an underlying factor, and he or she takes simultaneous movement (correlation) as evidence of its existence. If correlation is spurious for some reason, this inference will be mistaken, of course, so it is important when conducting factor analysis that possible variables which might introduce spuriousness, such as anteceding causes, be included in the analysis and taken into account.
Factor analysis is part of the general linear model (GLM) family of procedures and makes many of the same assumptions as multiple regression: linear relationships, interval or near-interval data, untruncated variables, proper specification (relevant variables included, extraneous ones excluded), lack of high multicollinearity, and multivariate normality for purposes of significance testing. Factor analysis generates a table in which the rows are the observed raw indicator variables and the columns are the factors or latent variables which explain as much of the variance in these variables as possible. The cells in this table are factor loadings, and the meaning of the factors must be induced from seeing which variables are most heavily loaded on which factors. This inferential labeling process can be fraught with subjectivity as diverse researchers impute different labels.
There are several different types of factor analysis, with the most common being principal components analysis (PCA), which is preferred for purposes of data reduction. However, common factor analysis is preferred for purposes of causal analysis anf for confirmatory factor analysis in structural equation modeling, among other settings..
quantitative analyst by maylano_15407109
A quantitative analyst is a person who works in finance using numerical or quantitative techniques. Similar work is done in most other modern industries, but the work is not called quantitative analysis. In the investment industry, people who perform quantitative analysis are frequently called quants.
Although the original quants were concerned with risk management and derivatives pricing, the meaning of the term has expanded over time to include those individuals involved in almost any application of mathematics in finance. An example is statistical arbitrage.
Contents
[hide] [hide]
* 1 History
* 2 Education
* 3 Front Office Quant
* 4 Mathematical and statistical approaches
* 5 Seminal publications
* 6 References
* 7 External links
[edit] History
Robert C. Merton, a pioneer of quantitative analysis, introduced stochastic calculus into the study of finance.
Quantitative finance started in the U.S. in the 1930s as some astute investors began using mathematical formulae to price stocks and bonds.
Harry Markowitz's 1952 Ph.D thesis "Portfolio Selection" was one of the first papers to formally adapt mathematical concepts to finance. Markowitz formalized a notion of mean return and covariances for common stocks which allowed him to quantify the concept of "diversification" in a market. He showed how to compute the mean return and variance for a given portfolio and argued that investors should hold only those portfolios whose variance is minimal among all portfolios with a given mean return. Although the language of finance now involves Itō calculus, minimization of risk in a quantifiable manner underlies much of the modern theory.
In 1969 Robert Merton introduced stochastic calculus into the study of finance. Merton was motivated by the desire to understand how prices are set in financial markets, which is the classical economics question of "equilibrium," and in later papers he used the machinery of stochastic calculus to begin investigation of this issue.
At the same time as Merton's work and with Merton's assistance, Fischer Black and Myron Scholes were developing their option pricing formula, which led to winning the 1997 Nobel Prize in Economics. It provided a solution for a practical problem, that of finding a fair price for a European call option, i.e., the right to buy one share of a given stock at a specified price and time. Such options are frequently purchased by investors as a risk-hedging device. In 1981, Harrison and Pliska used the general theory of continuous-time stochastic processes to put the Black-Scholes option pricing formula on a solid theoretical basis, and as a result, showed how to price numerous other "derivative" securities.
[edit] Education
Quants often come from physics or mathematics backgrounds rather than finance related fields, and quants are a major source of employment for people with physics and mathematics Ph.D's. Typically, a quant will also need extensive skills in computer programming.
This demand for quants has led to the creation of specialized Masters and PhD courses in mathematical finance, computational finance, and/or financial reinsurance. In particular, Masters degrees in financial engineering and financial analysis are becoming more popular with students and with employers. London's Cass Business School was the pioneer of quantitative finance programs in Europe, with its MSc Quantitative Finance as well as the MSc Financial Mathematics and MSc Mathematical Trading and Finance programs providing some leading global research. Carnegie Mellon's Tepper School of Business, which created the Masters degree in financial engineering, reported a 21% increase in applicants to their MS in Computational Finance program, which is on top of a 48% increase in the year before.[1][when?] These Masters level programs are generally one year in length and more focused than the broader MBA degree.
[edit] Front Office Quant
Within Banking, quants are employed to support trading and sales functions. At the very simple level Banks buy and sell investment products known as Stocks (Equity) and Bonds (Debt). They can gain a good idea of a fair price to charge for these because they are liquid instruments (many people are buying and selling them) and thus they are governed by the market principles of supply and demand – the lower your price the more people will buy from you, the higher your price the more people will sell to you. Over the last 30 years a massive industry in derivative securities has developed as the risk preferences and profiles of customers have matured. The idiosyncratic, customised nature of many of these products can make them relatively illiquid and hence there are no handy market prices available. The products are managed, that is, actualised, priced and hedged, by means of financial models. The models are implemented as software and then embedded in front-office risk management systems. The role of the quant is to develop these models.
[edit] Mathematical and statistical approaches
According to Fund of Funds analyst Fred Gehm, "There are two types of quantitative analysis and, therefore, two types of quants. One type works primarily with mathematical models and the other primarily with statistical models. While there is no logical reason why one person can't do both kinds of work, this doesn’t seem to happen, perhaps because these types demand different skill sets and, much more important, different psychologies.[2]"
A typical problem for a numerically oriented quantitative analyst would be to develop a model for pricing and managing a complex derivative product.
A typical problem for statistically oriented quantitative analyst would be to develop a model for deciding which stocks are relatively expensive and which stocks are relatively cheap. The model might include a company's book value to price ratio, its trailing earnings to price ratio and other accounting factors. An investment manager might implement this analysis by buying the underpriced stocks, selling the overpriced stocks or both.
One of the principal mathematical tools of quantitative finance is stochastic calculus.
According to a July 2008 Aite Group report, today quants often use alpha generation platforms to help them develop financial models. These software solutions enable quants to centralize and streamline the alpha generation process.[3]
Although the original quants were concerned with risk management and derivatives pricing, the meaning of the term has expanded over time to include those individuals involved in almost any application of mathematics in finance. An example is statistical arbitrage.
Contents
[hide] [hide]
* 1 History
* 2 Education
* 3 Front Office Quant
* 4 Mathematical and statistical approaches
* 5 Seminal publications
* 6 References
* 7 External links
[edit] History
Robert C. Merton, a pioneer of quantitative analysis, introduced stochastic calculus into the study of finance.
Quantitative finance started in the U.S. in the 1930s as some astute investors began using mathematical formulae to price stocks and bonds.
Harry Markowitz's 1952 Ph.D thesis "Portfolio Selection" was one of the first papers to formally adapt mathematical concepts to finance. Markowitz formalized a notion of mean return and covariances for common stocks which allowed him to quantify the concept of "diversification" in a market. He showed how to compute the mean return and variance for a given portfolio and argued that investors should hold only those portfolios whose variance is minimal among all portfolios with a given mean return. Although the language of finance now involves Itō calculus, minimization of risk in a quantifiable manner underlies much of the modern theory.
In 1969 Robert Merton introduced stochastic calculus into the study of finance. Merton was motivated by the desire to understand how prices are set in financial markets, which is the classical economics question of "equilibrium," and in later papers he used the machinery of stochastic calculus to begin investigation of this issue.
At the same time as Merton's work and with Merton's assistance, Fischer Black and Myron Scholes were developing their option pricing formula, which led to winning the 1997 Nobel Prize in Economics. It provided a solution for a practical problem, that of finding a fair price for a European call option, i.e., the right to buy one share of a given stock at a specified price and time. Such options are frequently purchased by investors as a risk-hedging device. In 1981, Harrison and Pliska used the general theory of continuous-time stochastic processes to put the Black-Scholes option pricing formula on a solid theoretical basis, and as a result, showed how to price numerous other "derivative" securities.
[edit] Education
Quants often come from physics or mathematics backgrounds rather than finance related fields, and quants are a major source of employment for people with physics and mathematics Ph.D's. Typically, a quant will also need extensive skills in computer programming.
This demand for quants has led to the creation of specialized Masters and PhD courses in mathematical finance, computational finance, and/or financial reinsurance. In particular, Masters degrees in financial engineering and financial analysis are becoming more popular with students and with employers. London's Cass Business School was the pioneer of quantitative finance programs in Europe, with its MSc Quantitative Finance as well as the MSc Financial Mathematics and MSc Mathematical Trading and Finance programs providing some leading global research. Carnegie Mellon's Tepper School of Business, which created the Masters degree in financial engineering, reported a 21% increase in applicants to their MS in Computational Finance program, which is on top of a 48% increase in the year before.[1][when?] These Masters level programs are generally one year in length and more focused than the broader MBA degree.
[edit] Front Office Quant
Within Banking, quants are employed to support trading and sales functions. At the very simple level Banks buy and sell investment products known as Stocks (Equity) and Bonds (Debt). They can gain a good idea of a fair price to charge for these because they are liquid instruments (many people are buying and selling them) and thus they are governed by the market principles of supply and demand – the lower your price the more people will buy from you, the higher your price the more people will sell to you. Over the last 30 years a massive industry in derivative securities has developed as the risk preferences and profiles of customers have matured. The idiosyncratic, customised nature of many of these products can make them relatively illiquid and hence there are no handy market prices available. The products are managed, that is, actualised, priced and hedged, by means of financial models. The models are implemented as software and then embedded in front-office risk management systems. The role of the quant is to develop these models.
[edit] Mathematical and statistical approaches
According to Fund of Funds analyst Fred Gehm, "There are two types of quantitative analysis and, therefore, two types of quants. One type works primarily with mathematical models and the other primarily with statistical models. While there is no logical reason why one person can't do both kinds of work, this doesn’t seem to happen, perhaps because these types demand different skill sets and, much more important, different psychologies.[2]"
A typical problem for a numerically oriented quantitative analyst would be to develop a model for pricing and managing a complex derivative product.
A typical problem for statistically oriented quantitative analyst would be to develop a model for deciding which stocks are relatively expensive and which stocks are relatively cheap. The model might include a company's book value to price ratio, its trailing earnings to price ratio and other accounting factors. An investment manager might implement this analysis by buying the underpriced stocks, selling the overpriced stocks or both.
One of the principal mathematical tools of quantitative finance is stochastic calculus.
According to a July 2008 Aite Group report, today quants often use alpha generation platforms to help them develop financial models. These software solutions enable quants to centralize and streamline the alpha generation process.[3]
Analisis Kluster - Destri Ayu Pratiwi 15406070
source : http://www.statsoft.com/textbook/stcluan.html
Cluster Analysis
* General Purpose
* Statistical Significance Testing
* Area of Application
* Joining (Tree Clustering)
o Hierarchical Tree
o Distance Measures
o Amalgamation or Linkage Rules
* Two-way Joining
o Introductory Overview
o Two-way Joining
* k-Means Clustering
o Example
o Computations
o Interpretation of results
* EM (Expectation Maximization) Clustering
o Introductory Overview
o The EM Algorithm
* Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation
General Purpose
The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist.
We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering playes an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your business is, sooner or later you will run into a clustering problem of one form or another.
Statistical Significance Testing
Note that the above discussions refer to clustering algorithms and do not mention anything about statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here, even in cases when p-levels are reported (as in k-means clustering).
Area of Application
Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of the many published studies reporting the results of cluster analyses. For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.
To index
Joining (Tree Clustering)
* Hierarchical Tree
* Distance Measures
* Amalgamation or Linkage Rules
General Logic
The example in the General Purpose Introduction illustrates the goal of the joining or tree clustering algorithm. The purpose of this algorithm is to join together objects (e.g., animals) into successively larger clusters, using some measure of similarity or distance. A typical result of this type of clustering is the hierarchical tree.
Hierarchical Tree
Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot, we begin with each object in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster.
As a result we link more and more objects together and aggregate (amalgamate) larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. When the data contain a clear "structure" in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of a successful analysis with the joining method, one is able to detect clusters (branches) and interpret those branches.
Distance Measures
The joining or tree clustering method uses the dissimilarities (similarities) or distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items. In the previous example the rule for grouping a number of dinners was whether they shared the same table or not. These distances (similarities) can be based on a single dimension or multiple dimensions, with each dimension representing a rule or condition for grouping objects. For example, if we were to cluster fast foods, we could take into account the number of calories they contain, their price, subjective ratings of taste, etc. The most straightforward way of computing distances between objects in a multi-dimensional space is to compute Euclidean distances. If we had a two- or three-dimensional space this measure is the actual geometric distance between objects in the space (i.e., as if measured with a ruler). However, the joining algorithm does not "care" whether the distances that are "fed" to it are actual real distances, or some other derived measure of distance that is more meaningful to the researcher; and it is up to the researcher to select the right method for his/her specific application.
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as:
distance(x,y) = {i (xi - yi)2 }½
Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales.
Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as (see also the note in the previous paragraph):
distance(x,y) = i (xi - yi)2
City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:
distance(x,y) = i |xi - yi|
Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as:
distance(x,y) = Maximum|xi - yi|
Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as:
distance(x,y) = (i |xi - yi|p)1/r
where r and p are user-defined parameters. A few example calculations may demonstrate how this measure "behaves." Parameter p controls the progressive weight that is placed on differences on individual dimensions, parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance.
Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as:
distance(x,y) = (Number of xi yi)/ i
Amalgamation or Linkage Rules
At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, how do we determine the distances between those new clusters? In other words, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbors" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbors across clusters that are furthest away from each other; this method is called complete linkage. There are numerous other linkage rules such as these that have been proposed.
Single linkage (nearest neighbor). As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains."
Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate.
Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages.
Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages.
Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average.
Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair-group method using the centroid average.
Ward's method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. Refer to Ward (1963) for details concerning this method. In general, this method is regarded as very efficient, however, it tends to create clusters of small size.
For an overview of the other two methods of clustering, see Two-way Joining and k-Means Clustering.
To index
Two-way Joining
* Introductory Overview
* Two-way Joining
Introductory Overview
Previously, we have discussed this method in terms of "objects" that are to be clustered (see Joining (Tree Clustering)). In all other types of analyses the research question of interest is usually expressed in terms of cases (observations) or variables. It turns out that the clustering of both may yield useful results. For example, imagine a study where a medical researcher has gathered data on different measures of physical fitness (variables) for a sample of heart patients (cases). The researcher may want to cluster cases (patients) to detect clusters of patients with similar syndromes. At the same time, the researcher may want to cluster variables (fitness measures) to detect clusters of measures that appear to tap similar physical abilities.
Two-way Joining
Given the discussion in the paragraph above concerning whether to cluster cases or variables, one may wonder why not cluster both simultaneously? Two-way joining is useful in (the relatively rare) circumstances when one expects that both cases and variables will simultaneously contribute to the uncovering of meaningful patterns of clusters.
For example, returning to the example above, the medical researcher may want to identify clusters of patients that are similar with regard to particular clusters of similar measures of physical fitness. The difficulty with interpreting these results may arise from the fact that the similarities between different clusters may pertain to (or be caused by) somewhat different subsets of variables. Thus, the resulting structure (clusters) is by nature not homogeneous. This may seem a bit confusing at first, and, indeed, compared to the other clustering methods described (see Joining (Tree Clustering) and k-Means Clustering), two-way joining is probably the one least commonly used. However, some researchers believe that this method offers a powerful exploratory data analysis tool (for more information you may want to refer to the detailed description of this method in Hartigan, 1975).
To index
k-Means Clustering
* Example
* Computations
* Interpretation of results
General logic
This method of clustering is very different from the Joining (Tree Clustering) and Two-way Joining. Suppose that you already have hypotheses concerning the number of clusters in your cases or variables. You may want to "tell" the computer to form exactly 3 clusters that are to be as distinct as possible. This is the type of research question that can be addressed by the k- means clustering algorithm. In general, the k-means method will produce exactly k different clusters of greatest possible distinction. It should be mentioned that the best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data (see Finding the Right Number of Clusters).
Example
In the physical fitness example (see Two-way Joining), the medical researcher may have a "hunch" from clinical experience that her heart patients fall basically into three different categories with regard to physical fitness. She might wonder whether this intuition can be quantified, that is, whether a k-means cluster analysis of the physical fitness measures would indeed produce the three clusters of patients as expected. If so, the means on the different measures of physical fitness for each cluster would represent a quantitative way of expressing the researcher's hypothesis or intuition (i.e., patients in cluster 1 are high on measure 1, low on measure 2, etc.).
Computations
Computationally, you may think of this method as analysis of variance (ANOVA) "in reverse." The program will start with k random clusters, and then move objects between those clusters with the goal to 1) minimize variability within clusters and 2) maximize variability between clusters. In other words, the similarity rules will apply maximally to the members of one cluster and minimally to members belonging to the rest of the clusters. This is analogous to "ANOVA in reverse" in the sense that the significance test in ANOVA evaluates the between group variability against the within-group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results.
Interpretation of results
Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters.
To index
EM (Expectation Maximization) Clustering
* Introductory Overview
* The EM Algorithm
Introductory Overview
The methods described here are similar to the k-Means algorithm described above, and you may want to review that section for a general overview of these techniques and their applications. The general purpose of these techniques is to detect clusters in observations (or variables) and to assign those observations to the clusters. A typical example application for this type of analysis is a marketing research study in which a number of consumer behavior related variables are measured for a large sample of respondents. The purpose of the study is to detect "market segments," i.e., groups of respondents that are somehow more similar to each other (to all other members of the same cluster) when compared to respondents that "belong to" other clusters. In addition to identifying such clusters, it is usually equally of interest to determine how the clusters are different, i.e., determine the specific variables or dimensions that vary and how they vary in regard to members in different clusters.
k-means clustering. To reiterate, the classic k-Means algorithm was popularized and refined by Hartigan (1975; see also Hartigan and Wong, 1978). The basic operation of that algorithm is relatively simple: Given a fixed number of (desired or hypothesized) k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible.
Extensions and generalizations. The EM (expectation maximization) algorithm extends this basic approach to clustering in two important ways:
1. Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.
2. Unlike the classic implementation of k-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic k-means algorithm can also be modified to accommodate categorical variables).
The EM Algorithm
The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The resulting distribution of values (in the population) may look like this:
Mixtures of distributions. The illustration shows two normal distributions with different means and different standard deviations, and the sum of the two distributions. Only the mixture (sum) of the two normal distributions (with different means and standard deviations) would be observed. The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters.
With the implementation of the EM algorithm in some computer programs, you may be able to select (for continuous variables) different distributions such as the normal, log-normal, and Poisson distributions. You can select different distributions for different variables and, thus, derive clusters for mixtures of different types of distributions.
Categorical variables. The EM algorithm can also accommodate categorical variables. The method will at first randomly assign different probabilities (weights, to be precise) to each class or category, for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the likelihood of the data given the specified number of clusters.
Classification probabilities instead of classifications. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability.
To index
Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation
An important question that needs to be answered before applying the k-means or EM clustering algorithms is how many clusters there are in the data. This is not known a priori and, in fact, there might be no definite or unique answer as to what value k should take. In other words, k is a nuisance parameter of the clustering model. Luckily, an estimate of k can be obtained from the data using the method of cross-validation. Remember that the k-means and EM methods will determine cluster solutions for a particular user-defined number of clusters. The k-means and EM clustering techniques (described above) can be optimized and enhanced for typical applications in data mining. The general metaphor of data mining implies the situation in which an analyst searches for useful structures and "nuggets" in the data, usually without any strong a priori expectations of what the analysist might find (in contrast to the hypothesis-testing approach of scientific research). In practice, the analyst usually does not know ahead of time how many clusters there might be in the sample. For that reason, some programs include an implementation of a v-fold cross-validation algorithm for automatically determining the number of clusters in the data.
This unique algorithm is immensely useful in all general "pattern-recognition" tasks - to determine the number of market segments in a marketing research study, the number of distinct spending patterns in studies of consumer behavior, the number of clusters of different medical symptoms, the number of different types (clusters) of documents in text mining, the number of weather patterns in meteorological research, the number of defect patterns on silicon wafers, and so on.
The v-fold cross-validation algorithm applied to clustering. The v-fold cross-validation algorithm is described in some detail in Classification Trees and General Classification and Regression Trees (GC&RT). The general idea of this method is to divide the overall sample into a number of v folds. The same type of analysis is then successively applied to the observations belonging to the v-1 folds (training sample), and the results of the analyses are applied to sample v (the sample or fold that was not used to estimate the parameters, build the tree, determine the clusters, etc.; this is the testing sample) to compute some index of predictive validity. The results for the v replications are aggregated (averaged) to yield a single measure of the stability of the respective model, i.e., the validity of the model for predicting new observations.
Cluster analysis is an unsupervised learning technique, and we cannot observe the (real) number of clusters in the data. However, it is reasonable to replace the usual notion (applicable to supervised learning) of "accuracy" with that of "distance." In general, we can apply the v-fold cross-validation method to a range of numbers of clusters in k-means or EM clustering, and observe the resulting average distance of the observations (in the cross-validation or testing samples) from their cluster centers (for k-means clustering); for EM clustering, an appropriate equivalent measure would be the average negative (log-) likelihood computed for the observations in the testing samples.
Reviewing the results of v-fold cross-validation. The results of v-fold cross-validation are best reviewed in a simple line graph.
Shown here is the result of analyzing a data set widely known to contain three clusters of observations (specifically, the well-known Iris data file reported by Fisher, 1936, and widely referenced in the literature on discriminant function analysis). Also shown (in the graph to the right) are the results for analyzing simple normal random numbers. The "real" data (shown to the left) exhibit the characteristic scree-plot pattern (see also Factor Analysis), where the cost function (in this case, 2 times the log-likelihood of the cross-validation data, given the estimated parameters) quickly decreases as the number of clusters increases, but then (past 3 clusters) levels off, and even increases as the data are overfitted. Alternatively, the random numbers show no such pattern, in fact, there is basically no decrease in the cost function at all, and it quickly begins to increase as the number of clusters increases and overfitting occurs.
It is easy to see from this simple illustration how useful the v-fold cross-validation technique, applied to k-means and EM clustering can be for determining the "right" number of clusters in the data
Cluster Analysis
* General Purpose
* Statistical Significance Testing
* Area of Application
* Joining (Tree Clustering)
o Hierarchical Tree
o Distance Measures
o Amalgamation or Linkage Rules
* Two-way Joining
o Introductory Overview
o Two-way Joining
* k-Means Clustering
o Example
o Computations
o Interpretation of results
* EM (Expectation Maximization) Clustering
o Introductory Overview
o The EM Algorithm
* Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation
General Purpose
The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist.
We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering playes an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your business is, sooner or later you will run into a clustering problem of one form or another.
Statistical Significance Testing
Note that the above discussions refer to clustering algorithms and do not mention anything about statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here, even in cases when p-levels are reported (as in k-means clustering).
Area of Application
Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of the many published studies reporting the results of cluster analyses. For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever one needs to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.
To index
Joining (Tree Clustering)
* Hierarchical Tree
* Distance Measures
* Amalgamation or Linkage Rules
General Logic
The example in the General Purpose Introduction illustrates the goal of the joining or tree clustering algorithm. The purpose of this algorithm is to join together objects (e.g., animals) into successively larger clusters, using some measure of similarity or distance. A typical result of this type of clustering is the hierarchical tree.
Hierarchical Tree
Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot, we begin with each object in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster.
As a result we link more and more objects together and aggregate (amalgamate) larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. When the data contain a clear "structure" in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of a successful analysis with the joining method, one is able to detect clusters (branches) and interpret those branches.
Distance Measures
The joining or tree clustering method uses the dissimilarities (similarities) or distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items. In the previous example the rule for grouping a number of dinners was whether they shared the same table or not. These distances (similarities) can be based on a single dimension or multiple dimensions, with each dimension representing a rule or condition for grouping objects. For example, if we were to cluster fast foods, we could take into account the number of calories they contain, their price, subjective ratings of taste, etc. The most straightforward way of computing distances between objects in a multi-dimensional space is to compute Euclidean distances. If we had a two- or three-dimensional space this measure is the actual geometric distance between objects in the space (i.e., as if measured with a ruler). However, the joining algorithm does not "care" whether the distances that are "fed" to it are actual real distances, or some other derived measure of distance that is more meaningful to the researcher; and it is up to the researcher to select the right method for his/her specific application.
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as:
distance(x,y) = {i (xi - yi)2 }½
Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales.
Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as (see also the note in the previous paragraph):
distance(x,y) = i (xi - yi)2
City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:
distance(x,y) = i |xi - yi|
Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as:
distance(x,y) = Maximum|xi - yi|
Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as:
distance(x,y) = (i |xi - yi|p)1/r
where r and p are user-defined parameters. A few example calculations may demonstrate how this measure "behaves." Parameter p controls the progressive weight that is placed on differences on individual dimensions, parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance.
Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as:
distance(x,y) = (Number of xi yi)/ i
Amalgamation or Linkage Rules
At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, how do we determine the distances between those new clusters? In other words, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbors" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbors across clusters that are furthest away from each other; this method is called complete linkage. There are numerous other linkage rules such as these that have been proposed.
Single linkage (nearest neighbor). As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains."
Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate.
Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages.
Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages.
Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average.
Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair-group method using the centroid average.
Ward's method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. Refer to Ward (1963) for details concerning this method. In general, this method is regarded as very efficient, however, it tends to create clusters of small size.
For an overview of the other two methods of clustering, see Two-way Joining and k-Means Clustering.
To index
Two-way Joining
* Introductory Overview
* Two-way Joining
Introductory Overview
Previously, we have discussed this method in terms of "objects" that are to be clustered (see Joining (Tree Clustering)). In all other types of analyses the research question of interest is usually expressed in terms of cases (observations) or variables. It turns out that the clustering of both may yield useful results. For example, imagine a study where a medical researcher has gathered data on different measures of physical fitness (variables) for a sample of heart patients (cases). The researcher may want to cluster cases (patients) to detect clusters of patients with similar syndromes. At the same time, the researcher may want to cluster variables (fitness measures) to detect clusters of measures that appear to tap similar physical abilities.
Two-way Joining
Given the discussion in the paragraph above concerning whether to cluster cases or variables, one may wonder why not cluster both simultaneously? Two-way joining is useful in (the relatively rare) circumstances when one expects that both cases and variables will simultaneously contribute to the uncovering of meaningful patterns of clusters.
For example, returning to the example above, the medical researcher may want to identify clusters of patients that are similar with regard to particular clusters of similar measures of physical fitness. The difficulty with interpreting these results may arise from the fact that the similarities between different clusters may pertain to (or be caused by) somewhat different subsets of variables. Thus, the resulting structure (clusters) is by nature not homogeneous. This may seem a bit confusing at first, and, indeed, compared to the other clustering methods described (see Joining (Tree Clustering) and k-Means Clustering), two-way joining is probably the one least commonly used. However, some researchers believe that this method offers a powerful exploratory data analysis tool (for more information you may want to refer to the detailed description of this method in Hartigan, 1975).
To index
k-Means Clustering
* Example
* Computations
* Interpretation of results
General logic
This method of clustering is very different from the Joining (Tree Clustering) and Two-way Joining. Suppose that you already have hypotheses concerning the number of clusters in your cases or variables. You may want to "tell" the computer to form exactly 3 clusters that are to be as distinct as possible. This is the type of research question that can be addressed by the k- means clustering algorithm. In general, the k-means method will produce exactly k different clusters of greatest possible distinction. It should be mentioned that the best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data (see Finding the Right Number of Clusters).
Example
In the physical fitness example (see Two-way Joining), the medical researcher may have a "hunch" from clinical experience that her heart patients fall basically into three different categories with regard to physical fitness. She might wonder whether this intuition can be quantified, that is, whether a k-means cluster analysis of the physical fitness measures would indeed produce the three clusters of patients as expected. If so, the means on the different measures of physical fitness for each cluster would represent a quantitative way of expressing the researcher's hypothesis or intuition (i.e., patients in cluster 1 are high on measure 1, low on measure 2, etc.).
Computations
Computationally, you may think of this method as analysis of variance (ANOVA) "in reverse." The program will start with k random clusters, and then move objects between those clusters with the goal to 1) minimize variability within clusters and 2) maximize variability between clusters. In other words, the similarity rules will apply maximally to the members of one cluster and minimally to members belonging to the rest of the clusters. This is analogous to "ANOVA in reverse" in the sense that the significance test in ANOVA evaluates the between group variability against the within-group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results.
Interpretation of results
Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters.
To index
EM (Expectation Maximization) Clustering
* Introductory Overview
* The EM Algorithm
Introductory Overview
The methods described here are similar to the k-Means algorithm described above, and you may want to review that section for a general overview of these techniques and their applications. The general purpose of these techniques is to detect clusters in observations (or variables) and to assign those observations to the clusters. A typical example application for this type of analysis is a marketing research study in which a number of consumer behavior related variables are measured for a large sample of respondents. The purpose of the study is to detect "market segments," i.e., groups of respondents that are somehow more similar to each other (to all other members of the same cluster) when compared to respondents that "belong to" other clusters. In addition to identifying such clusters, it is usually equally of interest to determine how the clusters are different, i.e., determine the specific variables or dimensions that vary and how they vary in regard to members in different clusters.
k-means clustering. To reiterate, the classic k-Means algorithm was popularized and refined by Hartigan (1975; see also Hartigan and Wong, 1978). The basic operation of that algorithm is relatively simple: Given a fixed number of (desired or hypothesized) k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible.
Extensions and generalizations. The EM (expectation maximization) algorithm extends this basic approach to clustering in two important ways:
1. Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.
2. Unlike the classic implementation of k-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic k-means algorithm can also be modified to accommodate categorical variables).
The EM Algorithm
The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The resulting distribution of values (in the population) may look like this:
Mixtures of distributions. The illustration shows two normal distributions with different means and different standard deviations, and the sum of the two distributions. Only the mixture (sum) of the two normal distributions (with different means and standard deviations) would be observed. The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters.
With the implementation of the EM algorithm in some computer programs, you may be able to select (for continuous variables) different distributions such as the normal, log-normal, and Poisson distributions. You can select different distributions for different variables and, thus, derive clusters for mixtures of different types of distributions.
Categorical variables. The EM algorithm can also accommodate categorical variables. The method will at first randomly assign different probabilities (weights, to be precise) to each class or category, for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the likelihood of the data given the specified number of clusters.
Classification probabilities instead of classifications. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability.
To index
Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation
An important question that needs to be answered before applying the k-means or EM clustering algorithms is how many clusters there are in the data. This is not known a priori and, in fact, there might be no definite or unique answer as to what value k should take. In other words, k is a nuisance parameter of the clustering model. Luckily, an estimate of k can be obtained from the data using the method of cross-validation. Remember that the k-means and EM methods will determine cluster solutions for a particular user-defined number of clusters. The k-means and EM clustering techniques (described above) can be optimized and enhanced for typical applications in data mining. The general metaphor of data mining implies the situation in which an analyst searches for useful structures and "nuggets" in the data, usually without any strong a priori expectations of what the analysist might find (in contrast to the hypothesis-testing approach of scientific research). In practice, the analyst usually does not know ahead of time how many clusters there might be in the sample. For that reason, some programs include an implementation of a v-fold cross-validation algorithm for automatically determining the number of clusters in the data.
This unique algorithm is immensely useful in all general "pattern-recognition" tasks - to determine the number of market segments in a marketing research study, the number of distinct spending patterns in studies of consumer behavior, the number of clusters of different medical symptoms, the number of different types (clusters) of documents in text mining, the number of weather patterns in meteorological research, the number of defect patterns on silicon wafers, and so on.
The v-fold cross-validation algorithm applied to clustering. The v-fold cross-validation algorithm is described in some detail in Classification Trees and General Classification and Regression Trees (GC&RT). The general idea of this method is to divide the overall sample into a number of v folds. The same type of analysis is then successively applied to the observations belonging to the v-1 folds (training sample), and the results of the analyses are applied to sample v (the sample or fold that was not used to estimate the parameters, build the tree, determine the clusters, etc.; this is the testing sample) to compute some index of predictive validity. The results for the v replications are aggregated (averaged) to yield a single measure of the stability of the respective model, i.e., the validity of the model for predicting new observations.
Cluster analysis is an unsupervised learning technique, and we cannot observe the (real) number of clusters in the data. However, it is reasonable to replace the usual notion (applicable to supervised learning) of "accuracy" with that of "distance." In general, we can apply the v-fold cross-validation method to a range of numbers of clusters in k-means or EM clustering, and observe the resulting average distance of the observations (in the cross-validation or testing samples) from their cluster centers (for k-means clustering); for EM clustering, an appropriate equivalent measure would be the average negative (log-) likelihood computed for the observations in the testing samples.
Reviewing the results of v-fold cross-validation. The results of v-fold cross-validation are best reviewed in a simple line graph.
Shown here is the result of analyzing a data set widely known to contain three clusters of observations (specifically, the well-known Iris data file reported by Fisher, 1936, and widely referenced in the literature on discriminant function analysis). Also shown (in the graph to the right) are the results for analyzing simple normal random numbers. The "real" data (shown to the left) exhibit the characteristic scree-plot pattern (see also Factor Analysis), where the cost function (in this case, 2 times the log-likelihood of the cross-validation data, given the estimated parameters) quickly decreases as the number of clusters increases, but then (past 3 clusters) levels off, and even increases as the data are overfitted. Alternatively, the random numbers show no such pattern, in fact, there is basically no decrease in the cost function at all, and it quickly begins to increase as the number of clusters increases and overfitting occurs.
It is easy to see from this simple illustration how useful the v-fold cross-validation technique, applied to k-means and EM clustering can be for determining the "right" number of clusters in the data
cluster analysis by novian hrbowo_15407025
Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure. Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
•
editTypes of clustering
Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
Distance measure
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
• The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
• The Manhattan distance (also called taxicab norm or 1-norm)
• The maximum norm
• The Mahalanobis distance corrects data for different scales and correlations in the variables
• The angle between two vectors can be used as a distance measure when clustering high dimensional data. See Inner product space.
• The Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
[edit] Hierarchical clustering
[edit] Creating clusters
Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements at one end and a single cluster containing every element at the other. Agglomerative algorithms begin at the leaves of the tree, whereas divisive algorithms begin at the root. (In the figure, the arrows indicate an agglomerative clustering.)
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
[edit] Agglomerative hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
Raw data
The hierarchical clustering dendrogram would be as such:
Traditional representation
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following:
• The maximum distance between elements of each cluster (also called complete linkage clustering):
• The minimum distance between elements of each cluster (also called single-linkage clustering):
• The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
• The sum of all intra-cluster variance.
• The increase in variance for the cluster being merged (Ward's criterion).
• The probability that candidate clusters spawn from the same distribution function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
[edit] Concept clustering
Another variation of the agglomerative clustering approach is conceptual clustering.
[edit] Partitional clustering
[edit] K-means and derivatives
[edit] K-means clustering
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.
The algorithm steps are [1]:
• Choose the number of clusters, k.
• Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.
• Assign each point to the nearest cluster center.
• Recompute the new cluster centers.
• Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
[edit] Fuzzy c-means clustering
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients is defined to be 1:
With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:
The degree of belonging is related to the inverse of the distance to the cluster center:
then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So
For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
The fuzzy c-means algorithm is very similar to the k-means algorithm:
• Choose a number of clusters.
• Assign randomly to each point coefficients for being in the clusters.
• Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
o Compute the centroid for each cluster, using the formula above.
o For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-c-means.
[edit] QT clustering algorithm
QT (quality threshold) clustering (Heyer, Kruglyak, Yooseph, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times.
The algorithm is:
• The user chooses a maximum diameter for clusters.
• Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
• Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points ?
• Recurse with the reduced set of points.
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
[edit] Locality-sensitive hashing
Locality-sensitive hashing can be used for clustering. Feature space vectors are sets, and the metric used is the Jaccard distance. The feature space can be considered high-dimensional. The min-wise independent permutations LSH scheme (sometimes MinHash) is then used to put similar items into buckets. With just one set of hashing methods, there are only clusters of very similar elements. By seeding the hash functions several times (eg 20), it is possible to get bigger clusters. [2]
[edit] Graph-theoretic methods
Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes. Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).
[edit] Determining the number of clusters
Many clustering algorithms require that you specify up front the number of clusters to find. If that number is not apparent from prior knowledge, it should be chosen in some way. Several methods for this have been suggested within the statistical literature.[3]:365 where one rule of thumb sets the number to
with n as the number of objects (data points).
Explained Variance. The "elbow" is indicated by the red circle. The number of clusters chosen should therefore be 4.
Another rule of thumb looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified.[4] Percentage of variance explained is the ratio of the between-group variance to the total variance. A slight variation of this method plots the curvature of the within group variance.[5] The method can be traced to Robert L. Thorndike in 1953.[6]
Other ways to determine the number of clusters use Akaike information criterion (AIC) or Bayesian information criterion (BIC) — if it is possible to make a likelihood function for the clustering model. For example: The k-means model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine AIC and BIC values.[7]
[edit] Spectral clustering
Given a set of data points A, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points . Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.
One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S1,S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix
L = I − D − 1 / 2SD − 1 / 2
of S, where D is the diagonal matrix
Dii = ∑ Sij.
j
This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S1, and the rest in S2. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
A related algorithm is the Meila-Shi algorithm, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix P = SD − 1 for some k, and then invokes another (e.g. k-means) to cluster points by their respective k components in these eigenvectors.
[edit] Applications
[edit] Biology
In biology clustering has many applications
• In imaging, data clustering may take different form based on the data dimensionality. For example, the SOCR EM Mixture model segmentation activity and applet shows how to obtain point, region or volume classification using the online SOCR computational libraries.
• In the fields of plant and animal ecology, clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
• In computational biology and bioinformatics:
o In transcriptomics, clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
o In sequence analysis, clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.
o In high-throughput genotyping platforms clustering algorithms are used to automatically assign genotypes.
• In QSAR and molecular modeling studies as also chemoinformatics
[edit] Medicine
In medical imaging, such as PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.
[edit] Market research
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.
• Segmenting the market and determining target markets
• Product positioning
• New product development
• Selecting test markets (see : experimental techniques)
[edit] Other applications
Social network analysis
In the study of social networks, clustering may be used to recognize communities within large groups of people.
Image segmentation
Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
Data mining
Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.
Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.
IMRT segmentation
Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Grouping of Shopping Items
Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU)
Mathematical chemistry
To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.[8]
Petroleum Geology
Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.
[edit] Comparisons between data clusterings
There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Many of these measures are derived from the matching matrix (aka confusion matrix), e.g., the Rand measure and the Fowlkes-Mallows Bk measures.[9]
Several different clustering systems based on mutual information have been proposed. One is Marina Meila's 'Variation of Information' metric (see ref below); another provides hierarchical clustering[10].
[edit] Algorithms
In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). Among the most popular are CLARANS (Ng and Han,1994), DBSCAN (Ester et al., 1996) and BIRCH (Zhang et al., 1996).
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
•
editTypes of clustering
Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of this kind.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
Distance measure
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
• The Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
• The Manhattan distance (also called taxicab norm or 1-norm)
• The maximum norm
• The Mahalanobis distance corrects data for different scales and correlations in the variables
• The angle between two vectors can be used as a distance measure when clustering high dimensional data. See Inner product space.
• The Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
[edit] Hierarchical clustering
[edit] Creating clusters
Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements at one end and a single cluster containing every element at the other. Agglomerative algorithms begin at the leaves of the tree, whereas divisive algorithms begin at the root. (In the figure, the arrows indicate an agglomerative clustering.)
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
[edit] Agglomerative hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
Raw data
The hierarchical clustering dendrogram would be as such:
Traditional representation
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following:
• The maximum distance between elements of each cluster (also called complete linkage clustering):
• The minimum distance between elements of each cluster (also called single-linkage clustering):
• The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
• The sum of all intra-cluster variance.
• The increase in variance for the cluster being merged (Ward's criterion).
• The probability that candidate clusters spawn from the same distribution function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
[edit] Concept clustering
Another variation of the agglomerative clustering approach is conceptual clustering.
[edit] Partitional clustering
[edit] K-means and derivatives
[edit] K-means clustering
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.
The algorithm steps are [1]:
• Choose the number of clusters, k.
• Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.
• Assign each point to the nearest cluster center.
• Recompute the new cluster centers.
• Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
[edit] Fuzzy c-means clustering
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients is defined to be 1:
With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:
The degree of belonging is related to the inverse of the distance to the cluster center:
then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So
For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
The fuzzy c-means algorithm is very similar to the k-means algorithm:
• Choose a number of clusters.
• Assign randomly to each point coefficients for being in the clusters.
• Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
o Compute the centroid for each cluster, using the formula above.
o For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-c-means.
[edit] QT clustering algorithm
QT (quality threshold) clustering (Heyer, Kruglyak, Yooseph, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times.
The algorithm is:
• The user chooses a maximum diameter for clusters.
• Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
• Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points ?
• Recurse with the reduced set of points.
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
[edit] Locality-sensitive hashing
Locality-sensitive hashing can be used for clustering. Feature space vectors are sets, and the metric used is the Jaccard distance. The feature space can be considered high-dimensional. The min-wise independent permutations LSH scheme (sometimes MinHash) is then used to put similar items into buckets. With just one set of hashing methods, there are only clusters of very similar elements. By seeding the hash functions several times (eg 20), it is possible to get bigger clusters. [2]
[edit] Graph-theoretic methods
Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes. Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).
[edit] Determining the number of clusters
Many clustering algorithms require that you specify up front the number of clusters to find. If that number is not apparent from prior knowledge, it should be chosen in some way. Several methods for this have been suggested within the statistical literature.[3]:365 where one rule of thumb sets the number to
with n as the number of objects (data points).
Explained Variance. The "elbow" is indicated by the red circle. The number of clusters chosen should therefore be 4.
Another rule of thumb looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified.[4] Percentage of variance explained is the ratio of the between-group variance to the total variance. A slight variation of this method plots the curvature of the within group variance.[5] The method can be traced to Robert L. Thorndike in 1953.[6]
Other ways to determine the number of clusters use Akaike information criterion (AIC) or Bayesian information criterion (BIC) — if it is possible to make a likelihood function for the clustering model. For example: The k-means model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine AIC and BIC values.[7]
[edit] Spectral clustering
Given a set of data points A, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points . Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.
One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S1,S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix
L = I − D − 1 / 2SD − 1 / 2
of S, where D is the diagonal matrix
Dii = ∑ Sij.
j
This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S1, and the rest in S2. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
A related algorithm is the Meila-Shi algorithm, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix P = SD − 1 for some k, and then invokes another (e.g. k-means) to cluster points by their respective k components in these eigenvectors.
[edit] Applications
[edit] Biology
In biology clustering has many applications
• In imaging, data clustering may take different form based on the data dimensionality. For example, the SOCR EM Mixture model segmentation activity and applet shows how to obtain point, region or volume classification using the online SOCR computational libraries.
• In the fields of plant and animal ecology, clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
• In computational biology and bioinformatics:
o In transcriptomics, clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
o In sequence analysis, clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.
o In high-throughput genotyping platforms clustering algorithms are used to automatically assign genotypes.
• In QSAR and molecular modeling studies as also chemoinformatics
[edit] Medicine
In medical imaging, such as PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.
[edit] Market research
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.
• Segmenting the market and determining target markets
• Product positioning
• New product development
• Selecting test markets (see : experimental techniques)
[edit] Other applications
Social network analysis
In the study of social networks, clustering may be used to recognize communities within large groups of people.
Image segmentation
Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
Data mining
Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.
Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.
IMRT segmentation
Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Grouping of Shopping Items
Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU)
Mathematical chemistry
To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.[8]
Petroleum Geology
Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.
[edit] Comparisons between data clusterings
There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Many of these measures are derived from the matching matrix (aka confusion matrix), e.g., the Rand measure and the Fowlkes-Mallows Bk measures.[9]
Several different clustering systems based on mutual information have been proposed. One is Marina Meila's 'Variation of Information' metric (see ref below); another provides hierarchical clustering[10].
[edit] Algorithms
In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). Among the most popular are CLARANS (Ng and Han,1994), DBSCAN (Ester et al., 1996) and BIRCH (Zhang et al., 1996).
Analisis Faktor - Destri Ayu Pratiwi 15406070
source : http://www.statsoft.com/textbook/stfacan.html
Principal Components and Factor Analysis
* General Purpose
* Basic Idea of Factor Analysis as a Data Reduction Method
* Factor Analysis as a Classification Method
* Miscellaneous Other Issues and Statistics
General Purpose
The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone, 1931). The topics listed below will describe the principles of factor analysis, and how it can be applied towards these two purposes. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance and correlation; if not, we advise that you read the Basic Statistics chapter at this point.
There are many excellent books on factor analysis. For example, a hands-on how-to approach can be found in Stevens (1986); more detailed technical descriptions are provided in Cooley and Lohnes (1971); Harman (1976); Kim and Mueller, (1978a, 1978b); Lawley and Maxwell (1971); Lindeman, Merenda, and Gold (1980); Morrison (1967); or Mulaik (1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984).
Confirmatory factor analysis. Structural Equation Modeling (SEPATH) allows you to test specific hypotheses about the factor structure for a set of variables, in one or several samples (e.g., you can compare factor structures across samples).
Correspondence analysis. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by factor analysis techniques, and they allow one to explore the structure of categorical variables included in the table. For more information regarding these methods, refer to Correspondence Analysis.
To index
Basic Idea of Factor Analysis as a Data Reduction Method
Suppose we conducted a (rather "silly") study in which we measure 100 people's height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured.
Let us now extrapolate from this "silly" study to something that one might actually do as a researcher. Suppose we want to measure people's satisfaction with their lives. We design a satisfaction questionnaire with various items; among other things we ask our subjects how satisfied they are with their hobbies (item 1) and how intensely they are pursuing a hobby (item 2). Most likely, the responses to the two items are highly correlated with each other. (If you are not familiar with the correlation coefficient, we recommend that you read the description in Basic Statistics - Correlations) Given a high correlation between the two items, we can conclude that they are quite redundant.
Combining Two Variables into a Single Factor. One can summarize the correlation between two variables in a scatterplot. A regression line can then be fitted that represents the "best" summary of the linear relationship between the variables. If we could define a variable that would approximate the regression line in such a plot, then that variable would capture most of the "essence" of the two items. Subjects' single scores on that new factor, represented by the regression line, could then be used in future data analyses to represent that essence of the two items. In a sense we have reduced the two variables to one factor. Note that the new factor is actually a linear combination of the two variables.
Principal Components Analysis. The example described above, combining two correlated variables into one factor, illustrates the basic idea of factor analysis, or of principal components analysis to be precise (we will return to this later). If we extend the two-variable example to multiple variables, then the computations become more involved, but the basic principle of expressing two or more variables by a single factor remains the same.
Extracting Principal Components. We do not want to go into the details about the computational aspects of principal components analysis here, which can be found elsewhere (references were provided at the beginning of this section). However, basically, the extraction of principal components amounts to a variance maximizing (varimax) rotation of the original variable space. For example, in a scatterplot we can think of the regression line as the original X axis, rotated so that it approximates the regression line. This type of rotation is called variance maximizing because the criterion for (goal of) the rotation is to maximize the variance (variability) of the "new" variable (factor), while minimizing the variance around the new variable (see Rotational Strategies).
Generalizing to the Case of Multiple Variables. When there are more than two variables, we can think of them as defining a "space," just as two variables defined a plane. Thus, when we have three variables, we could plot a three- dimensional scatterplot, and, again we could fit a plane through the data.
With more than three variables it becomes impossible to illustrate the points in a scatterplot, however, the logic of rotating the axes so as to maximize the variance of the new factor remains the same.
Multiple orthogonal factors. After we have found the line on which the variance is maximal, there remains some variability around this line. In principal components analysis, after the first factor has been extracted, that is, after the first line has been drawn through the data, we continue and define another line that maximizes the remaining variability, and so on. In this manner, consecutive factors are extracted. Because each consecutive factor is defined to maximize the variability that is not captured by the preceding factor, consecutive factors are independent of each other. Put another way, consecutive factors are uncorrelated or orthogonal to each other.
How many Factors to Extract? Remember that, so far, we are considering principal components analysis as a data reduction method, that is, as a method for reducing the number of variables. The question then is, how many factors do we want to extract? Note that as we extract consecutive factors, they account for less and less variability. The decision of when to stop extracting factors basically depends on when there is only very little "random" variability left. The nature of this decision is arbitrary; however, various guidelines have been developed, and they are reviewed in Reviewing the Results of a Principal Components Analysis under Eigenvalues and the Number-of- Factors Problem.
Reviewing the Results of a Principal Components Analysis. Without further ado, let us now look at some of the standard results from a principal components analysis. To reiterate, we are extracting factors that account for less and less variance. To simplify matters, one usually starts with the correlation matrix, where the variances of all variables are equal to 1.0. Therefore, the total variance in that matrix is equal to the number of variables. For example, if we have 10 variables each with a variance of 1 then the total variability that can potentially be extracted is equal to 10 times 1. Suppose that in the satisfaction study introduced earlier we included 10 items to measure different aspects of satisfaction at home and at work. The variance accounted for by successive factors would be summarized as follows:
STATISTICA
FACTOR
ANALYSIS Eigenvalues (factor.sta)
Extraction: Principal components
Value
Eigenval % total
Variance Cumul.
Eigenval Cumul.
%
1
2
3
4
5
6
7
8
9
10 6.118369
1.800682
.472888
.407996
.317222
.293300
.195808
.170431
.137970
.085334 61.18369
18.00682
4.72888
4.07996
3.17222
2.93300
1.95808
1.70431
1.37970
.85334 6.11837
7.91905
8.39194
8.79993
9.11716
9.41046
9.60626
9.77670
9.91467
10.00000 61.1837
79.1905
83.9194
87.9993
91.1716
94.1046
96.0626
97.7670
99.1467
100.0000
Eigenvalues
In the second column (Eigenvalue) above, we find the variance on the new factors that were successively extracted. In the third column, these values are expressed as a percent of the total variance (in this example, 10). As we can see, factor 1 accounts for 61 percent of the variance, factor 2 for 18 percent, and so on. As expected, the sum of the eigenvalues is equal to the number of variables. The third column contains the cumulative variance extracted. The variances extracted by the factors are called the eigenvalues. This name derives from the computational issues involved.
Eigenvalues and the Number-of-Factors Problem
Now that we have a measure of how much variance each successive factor extracts, we can return to the question of how many factors to retain. As mentioned earlier, by its nature this is an arbitrary decision. However, there are some guidelines that are commonly used, and that, in practice, seem to yield the best results.
The Kaiser criterion. First, we can retain only factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much as the equivalent of one original variable, we drop it. This criterion was proposed by Kaiser (1960), and is probably the one most widely used. In our example above, using this criterion, we would retain 2 factors (principal components).
The scree test. A graphical method is the scree test first proposed by Cattell (1966). We can plot the eigenvalues shown above in a simple line plot.
Cattell suggests to find the place where the smooth decrease of eigenvalues appears to level off to the right of the plot. To the right of this point, presumably, one finds only "factorial scree" -- "scree" is the geological term referring to the debris which collects on the lower part of a rocky slope. According to this criterion, we would probably retain 2 or 3 factors in our example.
Which criterion to use. Both criteria have been studied in detail (Browne, 1968; Cattell & Jaspers, 1967; Hakstian, Rogers, & Cattell, 1982; Linn, 1968; Tucker, Koopman & Linn, 1969). Theoretically, one can evaluate those criteria by generating random data based on a particular number of factors. One can then see whether the number of factors is accurately detected by those criteria. Using this general technique, the first method (Kaiser criterion) sometimes retains too many factors, while the second technique (scree test) sometimes retains too few; however, both do quite well under normal conditions, that is, when there are relatively few factors and many cases. In practice, an additional important aspect is the extent to which a solution is interpretable. Therefore, one usually examines several solutions with more or fewer factors, and chooses the one that makes the best "sense." We will discuss this issue in the context of factor rotations below.
Principal Factors Analysis
Before we continue to examine the different aspects of the typical output from a principal components analysis, let us now introduce principal factors analysis. Let us return to our satisfaction questionnaire example to conceive of another "mental model" for factor analysis. We can think of subjects' responses as being dependent on two components. First, there are some underlying common factors, such as the "satisfaction-with-hobbies" factor we looked at before. Each item measures some part of this common aspect of satisfaction. Second, each item also captures a unique aspect of satisfaction that is not addressed by any other item.
Communalities. If this model is correct, then we should not expect that the factors will extract all variance from our items; rather, only that proportion that is due to the common factors and shared by several items. In the language of factor analysis, the proportion of variance of a particular item that is due to common factors (shared with other items) is called communality. Therefore, an additional task facing us when applying this model is to estimate the communalities for each variable, that is, the proportion of variance that each item has in common with other items. The proportion of variance that is unique to each item is then the respective item's total variance minus the communality. A common starting point is to use the squared multiple correlation of an item with all other items as an estimate of the communality (refer to Multiple Regression for details about multiple regression). Some authors have suggested various iterative "post-solution improvements" to the initial multiple regression communality estimate; for example, the so-called MINRES method (minimum residual factor method; Harman & Jones, 1966) will try various modifications to the factor loadings with the goal to minimize the residual (unexplained) sums of squares.
Principal factors vs. principal components. The defining characteristic then that distinguishes between the two factor analytic models is that in principal components analysis we assume that all variability in an item should be used in the analysis, while in principal factors analysis we only use the variability in an item that it has in common with the other items. A detailed discussion of the pros and cons of each approach is beyond the scope of this introduction (refer to the general references provided in Principal components and Factor Analysis - Introductory Overview). In most cases, these two methods usually yield very similar results. However, principal components analysis is often preferred as a method for data reduction, while principal factors analysis is often preferred when the goal of the analysis is to detect structure (see Factor Analysis as a Classification Method).
To index
Factor Analysis as a Classification Method
Let us now return to the interpretation of the standard results from a factor analysis. We will henceforth use the term factor analysis generically to encompass both principal components and principal factors analysis. Let us assume that we are at the point in our analysis where we basically know how many factors to extract. We may now want to know the meaning of the factors, that is, whether and how we can interpret them in a meaningful manner. To illustrate how this can be accomplished, let us work "backwards," that is, begin with a meaningful structure and then see how it is reflected in the results of a factor analysis. Let us return to our satisfaction example; shown below is the correlation matrix for items pertaining to satisfaction at work and items pertaining to satisfaction at home.
STATISTICA
FACTOR
ANALYSIS Correlations (factor.sta)
Casewise deletion of MD
n=100
Variable WORK_1 WORK_2 WORK_3 HOME_1 HOME_2 HOME_3
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3 1.00
.65
.65
.14
.15
.14 .65
1.00
.73
.14
.18
.24 .65
.73
1.00
.16
.24
.25 .14
.14
.16
1.00
.66
.59 .15
.18
.24
.66
1.00
.73 .14
.24
.25
.59
.73
1.00
The work satisfaction items are highly correlated amongst themselves, and the home satisfaction items are highly intercorrelated amongst themselves. The correlations across these two types of items (work satisfaction items with home satisfaction items) is comparatively small. It thus seems that there are two relatively independent factors reflected in the correlation matrix, one related to satisfaction at work, the other related to satisfaction at home.
Factor Loadings.Let us now perform a principal components analysis and look at the two-factor solution. Specifically, let us look at the correlations between the variables and the two factors (or "new" variables), as they are extracted by default; these correlations are also called factor loadings.
STATISTICA
FACTOR
ANALYSIS Factor Loadings (Unrotated)
Principal components
Variable Factor 1 Factor 2
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3 .654384
.715256
.741688
.634120
.706267
.707446 .564143
.541444
.508212
-.563123
-.572658
-.525602
Expl.Var
Prp.Totl 2.891313
.481885 1.791000
.298500
Apparently, the first factor is generally more highly correlated with the variables than the second factor. This is to be expected because, as previously described, these factors are extracted successively and will account for less and less variance overall.
Rotating the Factor Structure. We could plot the factor loadings shown above in a scatterplot. In that plot, each variable is represented as a point. In this plot we could rotate the axes in any direction without changing the relative locations of the points to each other; however, the actual coordinates of the points, that is, the factor loadings would of course change. In this example, if you produce the plot it will be evident that if we were to rotate the axes by about 45 degrees we might attain a clear pattern of loadings identifying the work satisfaction items and the home satisfaction items.
Rotational strategies. There are various rotational strategies that have been proposed. The goal of all of these strategies is to obtain a clear pattern of loadings, that is, factors that are somehow clearly marked by high loadings for some variables and low loadings for others. This general pattern is also sometimes referred to as simple structure (a more formalized definition can be found in most standard textbooks). Typical rotational strategies are varimax, quartimax, and equamax.
We have described the idea of the varimax rotation before (see Extracting Principal Components), and it can be applied to this problem as well. As before, we want to find a rotation that maximizes the variance on the new axes; put another way, we want to obtain a pattern of loadings on each factor that is as diverse as possible, lending itself to easier interpretation. Below is the table of rotated factor loadings.
STATISTICA
FACTOR
ANALYSIS Factor Loadings (Varimax normalized)
Extraction: Principal components
Variable Factor 1 Factor 2
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3 .862443
.890267
.886055
.062145
.107230
.140876 .051643
.110351
.152603
.845786
.902913
.869995
Expl.Var
Prp.Totl 2.356684
.392781 2.325629
.387605
Interpreting the Factor Structure. Now the pattern is much clearer. As expected, the first factor is marked by high loadings on the work satisfaction items, the second factor is marked by high loadings on the home satisfaction items. We would thus conclude that satisfaction, as measured by our questionnaire, is composed of those two aspects; hence we have arrived at a classification of the variables.
Consider another example, this time with four additional Hobby/Misc variables added to our earlier example.
In the plot of factor loadings above, 10 variables were reduced to three specific factors, a work factor, a home factor and a hobby/misc. factor. Note that factor loadings for each factor are spread out over the values of the other two factors but are high for its own values. For example, the factor loadings for the hobby/misc variables (in green) have both high and low "work" and "home" values, but all four of these variables have high factor loadings on the "hobby/misc" factor.
Oblique Factors. Some authors (e.g., Catell & Khanna; Harman, 1976; Jennrich & Sampson, 1966; Clarkson & Jennrich, 1988) have discussed in some detail the concept of oblique (non-orthogonal) factors, in order to achieve more interpretable simple structure. Specifically, computational strategies have been developed to rotate factors so as to best represent "clusters" of variables, without the constraint of orthogonality of factors. However, the oblique factors produced by such rotations are often not easily interpreted. To return to the example discussed above, suppose we would have included in the satisfaction questionnaire above four items that measured other, "miscellaneous" types of satisfaction. Let us assume that people's responses to those items were affected about equally by their satisfaction at home (Factor 1) and at work (Factor 2). An oblique rotation will likely produce two correlated factors with less-than- obvious meaning, that is, with many cross-loadings.
Hierarchical Factor Analysis. Instead of computing loadings for often difficult to interpret oblique factors, you can use a strategy first proposed by Thompson (1951) and Schmid and Leiman (1957), which has been elaborated and popularized in the detailed discussions by Wherry (1959, 1975, 1984). In this strategy, you first identify clusters of items and rotate axes through those clusters; next the correlations between those (oblique) factors is computed, and that correlation matrix of oblique factors is further factor-analyzed to yield a set of orthogonal factors that divide the variability in the items into that due to shared or common variance (secondary factors), and unique variance due to the clusters of similar variables (items) in the analysis (primary factors). To return to the example above, such a hierarchical analysis might yield the following factor loadings:
STATISTICA
FACTOR
ANALYSIS Secondary & Primary Factor Loadings
Factor Second. 1 Primary 1 Primary 2
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3
MISCEL_1
MISCEL_2
MISCEL_3
MISCEL_4 .483178
.570953
.565624
.535812
.615403
.586405
.780488
.734854
.776013
.714183 .649499
.687056
.656790
.117278
.079910
.065512
.466823
.464779
.439010
.455157 .187074
.140627
.115461
.630076
.668880
.626730
.280141
.238512
.303672
.228351
Careful examination of these loadings would lead to the following conclusions:
1. There is a general (secondary) satisfaction factor that likely affects all types of satisfaction measured by the 10 items;
2. There appear to be two primary unique areas of satisfaction that can best be described as satisfaction with work and satisfaction with home life.
Wherry (1984) discusses in great detail examples of such hierarchical analyses, and how meaningful and interpretable secondary factors can be derived.
Confirmatory Factor Analysis. Over the past 15 years, so-called confirmatory methods have become increasingly popular (e.g., see Jöreskog and Sörbom, 1979). In general, one can specify a priori, a pattern of factor loadings for a particular number of orthogonal or oblique factors, and then test whether the observed correlation matrix can be reproduced given these specifications. Confirmatory factor analyses can be performed via Structural Equation Modeling (SEPATH).
To index
Miscellaneous Other Issues and Statistics
Factor Scores. We can estimate the actual values of individual cases (observations) for the factors. These factor scores are particularly useful when one wants to perform further analyses involving the factors that one has identified in the factor analysis.
Reproduced and Residual Correlations. An additional check for the appropriateness of the respective number of factors that were extracted is to compute the correlation matrix that would result if those were indeed the only factors. That matrix is called the reproduced correlation matrix. To see how this matrix deviates from the observed correlation matrix, one can compute the difference between the two; that matrix is called the matrix of residual correlations. The residual matrix may point to "misfits," that is, to particular correlation coefficients that cannot be reproduced appropriately by the current number of factors.
Matrix Ill-conditioning. If, in the correlation matrix there are variables that are 100% redundant, then the inverse of the matrix cannot be computed. For example, if a variable is the sum of two other variables selected for the analysis, then the correlation matrix of those variables cannot be inverted, and the factor analysis can basically not be performed. In practice this happens when you are attempting to factor analyze a set of highly intercorrelated variables, as it, for example, sometimes occurs in correlational research with questionnaires. Then you can artificially lower all correlations in the correlation matrix by adding a small constant to the diagonal of the matrix, and then restandardizing it. This procedure will usually yield a matrix that now can be inverted and thus factor-analyzed; moreover, the factor patterns should not be affected by this procedure. However, note that the resulting estimates are not exact.
To index
Principal Components and Factor Analysis
* General Purpose
* Basic Idea of Factor Analysis as a Data Reduction Method
* Factor Analysis as a Classification Method
* Miscellaneous Other Issues and Statistics
General Purpose
The main applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone, 1931). The topics listed below will describe the principles of factor analysis, and how it can be applied towards these two purposes. We will assume that you are familiar with the basic logic of statistical reasoning as described in Elementary Concepts. Moreover, we will also assume that you are familiar with the concepts of variance and correlation; if not, we advise that you read the Basic Statistics chapter at this point.
There are many excellent books on factor analysis. For example, a hands-on how-to approach can be found in Stevens (1986); more detailed technical descriptions are provided in Cooley and Lohnes (1971); Harman (1976); Kim and Mueller, (1978a, 1978b); Lawley and Maxwell (1971); Lindeman, Merenda, and Gold (1980); Morrison (1967); or Mulaik (1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984).
Confirmatory factor analysis. Structural Equation Modeling (SEPATH) allows you to test specific hypotheses about the factor structure for a set of variables, in one or several samples (e.g., you can compare factor structures across samples).
Correspondence analysis. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by factor analysis techniques, and they allow one to explore the structure of categorical variables included in the table. For more information regarding these methods, refer to Correspondence Analysis.
To index
Basic Idea of Factor Analysis as a Data Reduction Method
Suppose we conducted a (rather "silly") study in which we measure 100 people's height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured.
Let us now extrapolate from this "silly" study to something that one might actually do as a researcher. Suppose we want to measure people's satisfaction with their lives. We design a satisfaction questionnaire with various items; among other things we ask our subjects how satisfied they are with their hobbies (item 1) and how intensely they are pursuing a hobby (item 2). Most likely, the responses to the two items are highly correlated with each other. (If you are not familiar with the correlation coefficient, we recommend that you read the description in Basic Statistics - Correlations) Given a high correlation between the two items, we can conclude that they are quite redundant.
Combining Two Variables into a Single Factor. One can summarize the correlation between two variables in a scatterplot. A regression line can then be fitted that represents the "best" summary of the linear relationship between the variables. If we could define a variable that would approximate the regression line in such a plot, then that variable would capture most of the "essence" of the two items. Subjects' single scores on that new factor, represented by the regression line, could then be used in future data analyses to represent that essence of the two items. In a sense we have reduced the two variables to one factor. Note that the new factor is actually a linear combination of the two variables.
Principal Components Analysis. The example described above, combining two correlated variables into one factor, illustrates the basic idea of factor analysis, or of principal components analysis to be precise (we will return to this later). If we extend the two-variable example to multiple variables, then the computations become more involved, but the basic principle of expressing two or more variables by a single factor remains the same.
Extracting Principal Components. We do not want to go into the details about the computational aspects of principal components analysis here, which can be found elsewhere (references were provided at the beginning of this section). However, basically, the extraction of principal components amounts to a variance maximizing (varimax) rotation of the original variable space. For example, in a scatterplot we can think of the regression line as the original X axis, rotated so that it approximates the regression line. This type of rotation is called variance maximizing because the criterion for (goal of) the rotation is to maximize the variance (variability) of the "new" variable (factor), while minimizing the variance around the new variable (see Rotational Strategies).
Generalizing to the Case of Multiple Variables. When there are more than two variables, we can think of them as defining a "space," just as two variables defined a plane. Thus, when we have three variables, we could plot a three- dimensional scatterplot, and, again we could fit a plane through the data.
With more than three variables it becomes impossible to illustrate the points in a scatterplot, however, the logic of rotating the axes so as to maximize the variance of the new factor remains the same.
Multiple orthogonal factors. After we have found the line on which the variance is maximal, there remains some variability around this line. In principal components analysis, after the first factor has been extracted, that is, after the first line has been drawn through the data, we continue and define another line that maximizes the remaining variability, and so on. In this manner, consecutive factors are extracted. Because each consecutive factor is defined to maximize the variability that is not captured by the preceding factor, consecutive factors are independent of each other. Put another way, consecutive factors are uncorrelated or orthogonal to each other.
How many Factors to Extract? Remember that, so far, we are considering principal components analysis as a data reduction method, that is, as a method for reducing the number of variables. The question then is, how many factors do we want to extract? Note that as we extract consecutive factors, they account for less and less variability. The decision of when to stop extracting factors basically depends on when there is only very little "random" variability left. The nature of this decision is arbitrary; however, various guidelines have been developed, and they are reviewed in Reviewing the Results of a Principal Components Analysis under Eigenvalues and the Number-of- Factors Problem.
Reviewing the Results of a Principal Components Analysis. Without further ado, let us now look at some of the standard results from a principal components analysis. To reiterate, we are extracting factors that account for less and less variance. To simplify matters, one usually starts with the correlation matrix, where the variances of all variables are equal to 1.0. Therefore, the total variance in that matrix is equal to the number of variables. For example, if we have 10 variables each with a variance of 1 then the total variability that can potentially be extracted is equal to 10 times 1. Suppose that in the satisfaction study introduced earlier we included 10 items to measure different aspects of satisfaction at home and at work. The variance accounted for by successive factors would be summarized as follows:
STATISTICA
FACTOR
ANALYSIS Eigenvalues (factor.sta)
Extraction: Principal components
Value
Eigenval % total
Variance Cumul.
Eigenval Cumul.
%
1
2
3
4
5
6
7
8
9
10 6.118369
1.800682
.472888
.407996
.317222
.293300
.195808
.170431
.137970
.085334 61.18369
18.00682
4.72888
4.07996
3.17222
2.93300
1.95808
1.70431
1.37970
.85334 6.11837
7.91905
8.39194
8.79993
9.11716
9.41046
9.60626
9.77670
9.91467
10.00000 61.1837
79.1905
83.9194
87.9993
91.1716
94.1046
96.0626
97.7670
99.1467
100.0000
Eigenvalues
In the second column (Eigenvalue) above, we find the variance on the new factors that were successively extracted. In the third column, these values are expressed as a percent of the total variance (in this example, 10). As we can see, factor 1 accounts for 61 percent of the variance, factor 2 for 18 percent, and so on. As expected, the sum of the eigenvalues is equal to the number of variables. The third column contains the cumulative variance extracted. The variances extracted by the factors are called the eigenvalues. This name derives from the computational issues involved.
Eigenvalues and the Number-of-Factors Problem
Now that we have a measure of how much variance each successive factor extracts, we can return to the question of how many factors to retain. As mentioned earlier, by its nature this is an arbitrary decision. However, there are some guidelines that are commonly used, and that, in practice, seem to yield the best results.
The Kaiser criterion. First, we can retain only factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much as the equivalent of one original variable, we drop it. This criterion was proposed by Kaiser (1960), and is probably the one most widely used. In our example above, using this criterion, we would retain 2 factors (principal components).
The scree test. A graphical method is the scree test first proposed by Cattell (1966). We can plot the eigenvalues shown above in a simple line plot.
Cattell suggests to find the place where the smooth decrease of eigenvalues appears to level off to the right of the plot. To the right of this point, presumably, one finds only "factorial scree" -- "scree" is the geological term referring to the debris which collects on the lower part of a rocky slope. According to this criterion, we would probably retain 2 or 3 factors in our example.
Which criterion to use. Both criteria have been studied in detail (Browne, 1968; Cattell & Jaspers, 1967; Hakstian, Rogers, & Cattell, 1982; Linn, 1968; Tucker, Koopman & Linn, 1969). Theoretically, one can evaluate those criteria by generating random data based on a particular number of factors. One can then see whether the number of factors is accurately detected by those criteria. Using this general technique, the first method (Kaiser criterion) sometimes retains too many factors, while the second technique (scree test) sometimes retains too few; however, both do quite well under normal conditions, that is, when there are relatively few factors and many cases. In practice, an additional important aspect is the extent to which a solution is interpretable. Therefore, one usually examines several solutions with more or fewer factors, and chooses the one that makes the best "sense." We will discuss this issue in the context of factor rotations below.
Principal Factors Analysis
Before we continue to examine the different aspects of the typical output from a principal components analysis, let us now introduce principal factors analysis. Let us return to our satisfaction questionnaire example to conceive of another "mental model" for factor analysis. We can think of subjects' responses as being dependent on two components. First, there are some underlying common factors, such as the "satisfaction-with-hobbies" factor we looked at before. Each item measures some part of this common aspect of satisfaction. Second, each item also captures a unique aspect of satisfaction that is not addressed by any other item.
Communalities. If this model is correct, then we should not expect that the factors will extract all variance from our items; rather, only that proportion that is due to the common factors and shared by several items. In the language of factor analysis, the proportion of variance of a particular item that is due to common factors (shared with other items) is called communality. Therefore, an additional task facing us when applying this model is to estimate the communalities for each variable, that is, the proportion of variance that each item has in common with other items. The proportion of variance that is unique to each item is then the respective item's total variance minus the communality. A common starting point is to use the squared multiple correlation of an item with all other items as an estimate of the communality (refer to Multiple Regression for details about multiple regression). Some authors have suggested various iterative "post-solution improvements" to the initial multiple regression communality estimate; for example, the so-called MINRES method (minimum residual factor method; Harman & Jones, 1966) will try various modifications to the factor loadings with the goal to minimize the residual (unexplained) sums of squares.
Principal factors vs. principal components. The defining characteristic then that distinguishes between the two factor analytic models is that in principal components analysis we assume that all variability in an item should be used in the analysis, while in principal factors analysis we only use the variability in an item that it has in common with the other items. A detailed discussion of the pros and cons of each approach is beyond the scope of this introduction (refer to the general references provided in Principal components and Factor Analysis - Introductory Overview). In most cases, these two methods usually yield very similar results. However, principal components analysis is often preferred as a method for data reduction, while principal factors analysis is often preferred when the goal of the analysis is to detect structure (see Factor Analysis as a Classification Method).
To index
Factor Analysis as a Classification Method
Let us now return to the interpretation of the standard results from a factor analysis. We will henceforth use the term factor analysis generically to encompass both principal components and principal factors analysis. Let us assume that we are at the point in our analysis where we basically know how many factors to extract. We may now want to know the meaning of the factors, that is, whether and how we can interpret them in a meaningful manner. To illustrate how this can be accomplished, let us work "backwards," that is, begin with a meaningful structure and then see how it is reflected in the results of a factor analysis. Let us return to our satisfaction example; shown below is the correlation matrix for items pertaining to satisfaction at work and items pertaining to satisfaction at home.
STATISTICA
FACTOR
ANALYSIS Correlations (factor.sta)
Casewise deletion of MD
n=100
Variable WORK_1 WORK_2 WORK_3 HOME_1 HOME_2 HOME_3
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3 1.00
.65
.65
.14
.15
.14 .65
1.00
.73
.14
.18
.24 .65
.73
1.00
.16
.24
.25 .14
.14
.16
1.00
.66
.59 .15
.18
.24
.66
1.00
.73 .14
.24
.25
.59
.73
1.00
The work satisfaction items are highly correlated amongst themselves, and the home satisfaction items are highly intercorrelated amongst themselves. The correlations across these two types of items (work satisfaction items with home satisfaction items) is comparatively small. It thus seems that there are two relatively independent factors reflected in the correlation matrix, one related to satisfaction at work, the other related to satisfaction at home.
Factor Loadings.Let us now perform a principal components analysis and look at the two-factor solution. Specifically, let us look at the correlations between the variables and the two factors (or "new" variables), as they are extracted by default; these correlations are also called factor loadings.
STATISTICA
FACTOR
ANALYSIS Factor Loadings (Unrotated)
Principal components
Variable Factor 1 Factor 2
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3 .654384
.715256
.741688
.634120
.706267
.707446 .564143
.541444
.508212
-.563123
-.572658
-.525602
Expl.Var
Prp.Totl 2.891313
.481885 1.791000
.298500
Apparently, the first factor is generally more highly correlated with the variables than the second factor. This is to be expected because, as previously described, these factors are extracted successively and will account for less and less variance overall.
Rotating the Factor Structure. We could plot the factor loadings shown above in a scatterplot. In that plot, each variable is represented as a point. In this plot we could rotate the axes in any direction without changing the relative locations of the points to each other; however, the actual coordinates of the points, that is, the factor loadings would of course change. In this example, if you produce the plot it will be evident that if we were to rotate the axes by about 45 degrees we might attain a clear pattern of loadings identifying the work satisfaction items and the home satisfaction items.
Rotational strategies. There are various rotational strategies that have been proposed. The goal of all of these strategies is to obtain a clear pattern of loadings, that is, factors that are somehow clearly marked by high loadings for some variables and low loadings for others. This general pattern is also sometimes referred to as simple structure (a more formalized definition can be found in most standard textbooks). Typical rotational strategies are varimax, quartimax, and equamax.
We have described the idea of the varimax rotation before (see Extracting Principal Components), and it can be applied to this problem as well. As before, we want to find a rotation that maximizes the variance on the new axes; put another way, we want to obtain a pattern of loadings on each factor that is as diverse as possible, lending itself to easier interpretation. Below is the table of rotated factor loadings.
STATISTICA
FACTOR
ANALYSIS Factor Loadings (Varimax normalized)
Extraction: Principal components
Variable Factor 1 Factor 2
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3 .862443
.890267
.886055
.062145
.107230
.140876 .051643
.110351
.152603
.845786
.902913
.869995
Expl.Var
Prp.Totl 2.356684
.392781 2.325629
.387605
Interpreting the Factor Structure. Now the pattern is much clearer. As expected, the first factor is marked by high loadings on the work satisfaction items, the second factor is marked by high loadings on the home satisfaction items. We would thus conclude that satisfaction, as measured by our questionnaire, is composed of those two aspects; hence we have arrived at a classification of the variables.
Consider another example, this time with four additional Hobby/Misc variables added to our earlier example.
In the plot of factor loadings above, 10 variables were reduced to three specific factors, a work factor, a home factor and a hobby/misc. factor. Note that factor loadings for each factor are spread out over the values of the other two factors but are high for its own values. For example, the factor loadings for the hobby/misc variables (in green) have both high and low "work" and "home" values, but all four of these variables have high factor loadings on the "hobby/misc" factor.
Oblique Factors. Some authors (e.g., Catell & Khanna; Harman, 1976; Jennrich & Sampson, 1966; Clarkson & Jennrich, 1988) have discussed in some detail the concept of oblique (non-orthogonal) factors, in order to achieve more interpretable simple structure. Specifically, computational strategies have been developed to rotate factors so as to best represent "clusters" of variables, without the constraint of orthogonality of factors. However, the oblique factors produced by such rotations are often not easily interpreted. To return to the example discussed above, suppose we would have included in the satisfaction questionnaire above four items that measured other, "miscellaneous" types of satisfaction. Let us assume that people's responses to those items were affected about equally by their satisfaction at home (Factor 1) and at work (Factor 2). An oblique rotation will likely produce two correlated factors with less-than- obvious meaning, that is, with many cross-loadings.
Hierarchical Factor Analysis. Instead of computing loadings for often difficult to interpret oblique factors, you can use a strategy first proposed by Thompson (1951) and Schmid and Leiman (1957), which has been elaborated and popularized in the detailed discussions by Wherry (1959, 1975, 1984). In this strategy, you first identify clusters of items and rotate axes through those clusters; next the correlations between those (oblique) factors is computed, and that correlation matrix of oblique factors is further factor-analyzed to yield a set of orthogonal factors that divide the variability in the items into that due to shared or common variance (secondary factors), and unique variance due to the clusters of similar variables (items) in the analysis (primary factors). To return to the example above, such a hierarchical analysis might yield the following factor loadings:
STATISTICA
FACTOR
ANALYSIS Secondary & Primary Factor Loadings
Factor Second. 1 Primary 1 Primary 2
WORK_1
WORK_2
WORK_3
HOME_1
HOME_2
HOME_3
MISCEL_1
MISCEL_2
MISCEL_3
MISCEL_4 .483178
.570953
.565624
.535812
.615403
.586405
.780488
.734854
.776013
.714183 .649499
.687056
.656790
.117278
.079910
.065512
.466823
.464779
.439010
.455157 .187074
.140627
.115461
.630076
.668880
.626730
.280141
.238512
.303672
.228351
Careful examination of these loadings would lead to the following conclusions:
1. There is a general (secondary) satisfaction factor that likely affects all types of satisfaction measured by the 10 items;
2. There appear to be two primary unique areas of satisfaction that can best be described as satisfaction with work and satisfaction with home life.
Wherry (1984) discusses in great detail examples of such hierarchical analyses, and how meaningful and interpretable secondary factors can be derived.
Confirmatory Factor Analysis. Over the past 15 years, so-called confirmatory methods have become increasingly popular (e.g., see Jöreskog and Sörbom, 1979). In general, one can specify a priori, a pattern of factor loadings for a particular number of orthogonal or oblique factors, and then test whether the observed correlation matrix can be reproduced given these specifications. Confirmatory factor analyses can be performed via Structural Equation Modeling (SEPATH).
To index
Miscellaneous Other Issues and Statistics
Factor Scores. We can estimate the actual values of individual cases (observations) for the factors. These factor scores are particularly useful when one wants to perform further analyses involving the factors that one has identified in the factor analysis.
Reproduced and Residual Correlations. An additional check for the appropriateness of the respective number of factors that were extracted is to compute the correlation matrix that would result if those were indeed the only factors. That matrix is called the reproduced correlation matrix. To see how this matrix deviates from the observed correlation matrix, one can compute the difference between the two; that matrix is called the matrix of residual correlations. The residual matrix may point to "misfits," that is, to particular correlation coefficients that cannot be reproduced appropriately by the current number of factors.
Matrix Ill-conditioning. If, in the correlation matrix there are variables that are 100% redundant, then the inverse of the matrix cannot be computed. For example, if a variable is the sum of two other variables selected for the analysis, then the correlation matrix of those variables cannot be inverted, and the factor analysis can basically not be performed. In practice this happens when you are attempting to factor analyze a set of highly intercorrelated variables, as it, for example, sometimes occurs in correlational research with questionnaires. Then you can artificially lower all correlations in the correlation matrix by adding a small constant to the diagonal of the matrix, and then restandardizing it. This procedure will usually yield a matrix that now can be inverted and thus factor-analyzed; moreover, the factor patterns should not be affected by this procedure. However, note that the resulting estimates are not exact.
To index
Langganan:
Postingan (Atom)