Visual Analytics 4317

Friday, April 26, 2024

Final Project Visual Analytics

For this project, I will be utilizing statistical visualizations derived from the "USMacroB" dataset. Spanning from 1959 to 1995, this dataset offers a view of macroeconomic variables in the United States. Our objective is to glean insights into pivotal economic indicators such as gross national product (GNP), monetary base, and treasury bill rates.

Exploring Economic Indicators:

Gross National Product (GNP): Our exploration commences with an examination of GNP's trajectory over time. Employing a time series plot, we chart the quarterly fluctuations in GNP from 1959 to 1995. This visualization provides a lucid depiction of economic growth trends throughout this period.

Monetary Base and Treasury Bill Rates: Transitioning to monetary indicators, we will be using the average of the seasonally adjusted monetary base and the average of the 3-month treasury bill rates. Through histograms and scatter plots, we delve into the distribution and relationships of these variables.

Analyzing Trends and Relationships:

Correlation Analysis: Venturing beyond individual variables, we delve into the intricate relationships among economic indicators. Employing scatter plots and correlation coefficients, we unearth correlations between GNP, monetary base, and treasury bill rates. This analysis yields valuable insights into the interconnectivity of economic factors.

Time Series Analysis: Leveraging time series plots, we scrutinize the trends and seasonality of economic variables across time. By decomposing the data into trend, seasonal, and residual components, we attain a deeper comprehension of enduring patterns and cyclical fluctuations within the economy.

Interactive Visualization

Animated Scatter Plot: To enrich our exploration, we deploy an animated scatter plot to visualize the dynamic interplay between GNP, monetary base, and treasury bill rates. This dynamic visualization enables us to observe the evolution of these variables over time and explore trends across diverse economic regions.

The Data

The data set is from a provided sources from earlier in the course-- Github Sets. After perusing through the available data sets I decided to use an economic set that contained 146 observations and 3 variables: GNP (Gross National Product), Mbase (Average of the seasonally adjusted monetary base), and tbill, which reflected the 3-month (quarterly) average of treasury bill rates. I went ahead and looked at the summary stats and toyed with the different functions within R to get a feel for the data and its preliminary characteristics.

This image provides a preview of the data; it shows a column designated 'rownames' too, which is something that I had to alter when cleaning the data. Because each row reflected the passage of a single quarter, I knew that I would be able to use it for illustrating quarterly changes and getting a microscopic view of the data as years went on. However this posed another issue--it is tedious to look back at what row your are looking at and determine what year it is based on how many rows have passed (4 per year, 146 rows, 1959-1995) and thus what point in the fiscal year a data point is from. My solution to this was to create a new data frame that repeated the year four times to coincide with the quarters and bind it to the existing data frame.

With the new column, understanding quarterly and yearly changes and continuities become easier to understand visually, as well as making data manipulation easier.

Visualizations

I wanted to have a diverse set of visualizations when creating models and plots so I opted to use different strategies when creating them. For this segment, plot,ly, ggplot, and animated graphics will be employed to produce an array of visualizations to compare variables.

I decided plot.ly provides the baseline for R visualizations and began using that first:

The first plot created was a general plot and produced this result:

This plot does not provide causal relationship evidence, however we can see that the variables appear to have a close connection and provide visual correlation. The second plot I created was meant to compare gnp over time. This is the most common plot recognized when observing economic changes, and as we will see, the trend is apparent.

One concern I have about using the cleaned data set and year is the duplication of data points and odd layering. While users understand these as Q1-Q4, R cannot distinguish them using plot.ly and assumptions are made about each point along the axes as to which quarter they belong. As you will see, I flip-flop between using rownames and year_df to communicate understanding about the data. Rownames (quarters) will frequently produce cleaner and more consistent-looking visualizations. The same is true from using 'rownames', however the clusters of data points do not provide further clarity and therefore are omitted from this presentation.

The next visualization created is meant to compare the frequency and change of our mbase (monetary base) through a histogram. In addition, the use of an 'abline' and mean-line provide us a perspective about the mbase over time.

As the monetary base rises (configured over time), the frequency of related values within the data diminishes. This trend highlights the slow increase near the beginning of our time series and faster progression as the decades continue. This could suggest eras of greater profit and value afforded to the monetary base as density rises.

GGPLOT

ggplot2 provides users with a tremendous visual advantage over plot.ly. Plot.ly is built-into R and can be useful for one-time or infrequent use, however ggplot reigns in the visualization aspect when visualizations are consistently depended on.

I decided to use quite a few different plots to best articulate the data. The ability to compare and distinguish data point and features is a strong-suit of the ggplot package.

Here the differences between using year and quarters is evident. A more precise measurement offers a better grasp of the behavior of the data and adheres to the abline more accurately than 'year_df'. Regardless, both illustrate a positive correlation between years and gnp, as well as depicts the ebb and flow of gnp in certain years.

The growth of the monetary base has been limited according to this plot until the 1980's, when the growth was able to reach new, distinct peaks and rise faster than any time before these quarters. As such, the abline has to accommodate for this sudden strength in the right of the plot; as before the growth was slowly curving, the mbase has rapidly increased in the latter quadrant of the plot

This plot is the most valuable for its inclusion of multiple variables. While the other plots represented a comparative view of two variables, this plot functions to show all three. Using rownames, the treasure bill count appears to have a distinct peak and fall on either side. Additionally, the monetary base is clearly increasing over time. Something we can draw from this plot is that the treasure bills do do not appear to be positively correlated to time, but appear to have a nearly neutral reaction to the other variables. Indeed, it seems as though the quarters between the 75th quarter (1986) and 125th (1990) were exceptionally busy with treasury bills and economic focus. Coincidentally, these years coincide with the election of a certain president, however these can only be assumed correlations since this does not compare presidencies or other factors.

The final ggplot I decided to create was a time-series that utilizes a line plot instead.

This is the most organized and defining visualization for the relationship between bill rates and quarters. The use of 'geom_smooth' encapsulates the overarching behavior of the data and offers a usable visual about the rate of bills over time.

Animated Scatter Plot

The final visualization tool I will be using is the 'gganimate' package. This will allow me to produce a gif that articulates the data input as a moving, alive instrument of statistics.

This is the code I attempted to use. There are quite a few iterations that I tried to use in order to troubleshoot issues I was having or to clean up the visual. The product looks as such:

Unfortunately, the scatter plot cannot be run through blogger, so a still is the closest I can get to visualizing it here in this medium.

Conclusions
In conclusion, our journey through the realm of economic trends utilizing statistical visualizations has provided invaluable insights into the intricacies of the U.S. economy from 1959 to 1995. By harnessing the power of visual representations, we enhance our understanding of economic phenomena and inform strategic decision-making in a complex and interconnected world.

Reflection
I could have made different decisions throughout numerous steps in this process. Cleaning the data so I would have a year data frame ended up being more time-consuming and in the end unnecessary. I have countless issues along the way figuring out which data set best communicated what I was trying to say and problems rendering ggplots and many of the visuals were tedious to fix. Additionally, I could have added more to the plots and made more inferences and statistical decisions to make the impact of my visuals more promising. In the end, I think I produced a usable and quite attractive presentation about my topic that could be used to make assumptions and inferences about a specific period of history.

Thursday, April 25, 2024

Visual Analytics Final Project

Introduction:

High school and college students are the primary demographic that credit card companies appeal to in the United States and across the globe alike. Because of the new concern for credit scores and their necessity for buying power as a part of this generation, I consider understanding the usage and qualities people tend to display when using them incredibly important. As such, I have decided to do an exploratory analysis of the Credit_Card data set from the https://vincentarelbundock.github.io/Rdatasets/datasets.html website which was provided earlier in the course. This resource is very helpful for finding unique and viable data sets to explore relationships in raw data. The primary purpose of using this data is to uncover any trends and relationships between user data and their spending habits. In an age where there are abundant places to spend money and people to use credit, developing a better understanding of the relationships are an important statistical and practical use of data.

Setting up the data:

The first step for completing this analysis required cleaning the data—getting rid of NA values and ensuring the information was properly communicated meant that I could articulate the data efficiently and properly when creating the visualizations.

These steps in R allowed me to eliminate NA values and scale the figures to a ratio that saved me time and energy making sense of later. In addition, they afforded me an opportunity to explore the data and make observations about eh variables, their qualities (max/min, mean, etc.).

Visualizations:

One of my first steps when starting analysis is always to understand distributions and recognize patterns between variables. Determining which variables offer the most insight as well as getting a grasp on where relationships could pose interesting theoretical questions were among my chief concerns.

I decided to view a distribution of age and income first, as they most likely contribute to the most important aspects of the data.

In addition, I wanted to better understand the expenditures listed in the data set:

These visualizations offered insight into the characteristics of the different variables. For example, a pareto curve for the expenditure histogram is indicative of the higher frequency of card holders.

For my other visualizations, I opted to use comparative scatter and box plots to illustrate correlations between different variables.

This visualization provides important perspective about the data; by using "ggplots" 'fill' feature, we are able to identify another important factor in the expenditure of card holders. Some data points depict users of the credit card that are not the owners, e.g. parents and relatives who set up a card in their name for 3rd party use. In addition, the use of scatter plots provided valuable visual comparisons to better understand the grouping and density dynamics present in the data.

The visualization here presents a interesting trend towards the lower left quadrant of the plot. Many users lean towards the lower end of the income spectrum and similarly spend on average less than those that have more disposable income. Further, this plot appears to suggest no visible difference in the owner as significantly effecting the spending habits of the user, as many of the points are condensed around the same point.

Another statistical process that can be used within R is the svm and machine learning algorithms to produce comparable data and understand if an underlying relationship exists between income and expenditure.

The table produces a result that suggests a clear correlation between income and expenditure, The algorithm correctly predicted the result to a degree that provides evidence about the relationship to be a worthwhile insight.

To conclude, the visualizations and support vector machine are useful tools for ascertaining the validity and patterns of relationships within data sets. We can deduce from our findings that income, ownership, age, and more play a role in the expenditure of card holders. Further, we can identify income as a key actor in the choice of users to spend or not.

Sunday, April 14, 2024

Module #13

This assignment was aimed at using r to animate. When I was watching the module videos I was very interested to know exactly how R would be able to produce legitiamte 'animations' in a more standard sense. I had seen how ggplot and other packages could be used to create renderings of 3D models without moving pieces, so I already had the belief that R could be used this as well. Besides the posted websites, I found some other ways to animate besides the 'animate' package by Yihui Xie. One of these was a rendering of a scatter plot that presented the x,y, and z axis and rotated around the edges of this illustration. I found another that produced a scatter plot, this time 2d and showed movement of the data points over time, with the year changing as the gif continued. ScatterPlot Gif Origin

In addition, I wanted to test out the animation that Bryan posted on his blog through the link provided in the assignment. R Bloggers.

This is the code for the standard deviation animation provided by Bryan.

This next code segment is for the animated scatter plot:

This really helps to show the different ways in which R is conducive towards creating unique visual graphics and aiding in users to be innovative. There are countless packages and tools available that makes animating an engaging and even artistic aspect of data visualization!

Sunday, April 7, 2024

Module #12

For this assignment, the use of social network analysis skills were the highlight. I was somewhat confused with the inclusion of Python in addition to using NodeXL for this module. I was able to run the provided code and attempted some of my own articulations with different online data, but it was behaving in a way that made it seem incompatible with the NodeXL software and I was getting errors. I took to Google to see if I could instead use R itself and found a helpful resource that relied on an API key, which I had already setup, but was unable to get the API key to allow the raw Twitter data to load into R and become accessible. I'm not sure what I am doing wrong, but regardless I have a handle on the conceptual side of Social Network Analysis. It is a very useful tool that offers a wide range of practical application in numerous fields of study and occupations. It is definitely beneficial to people needing insights that rely on data mining and data visualization techniques and I am excited to explore them in the future.

Sunday, March 31, 2024

Module 11 Assignment

This weeks assignment was focused on generating visualization using a specific kind of R tool. Tufte is a new concept and I found the types of visualizations offered by the authors and contributors of these packages incredibly interesting. I decided to use a built-in data set for this assignment. "Economics" provides a thorough report of economic information from 1967 to 2015. I decided pce against psavert and applied the formula of tufte to the other characteristics of ggplot.

I also installed and initialized the other packages prior to this.

The output of my code is this:

The output is a marginal histogram scatterplot comparing the variables on either side of the opposing axes through histograms as well as reflecting the data in a central scatter plot.

Sunday, March 24, 2024

Module #11 Debugging and Defensive Programming

The error I noticed pretty quickly about this code was the last line; the "return" line was improperly placed. Once I ran the code and got this error,

I was convinced that the error could be attributed to the same line. The (i in 1:nrow) was correct because it properly selected the numbers 1to the last row; the outlier part of the function was properly written, although I hadn't seen the "all" function before. After dropping the 'return' line to the next, the syntax error was solved and the function could be used to print. I'm not sure exactly what the function might do, but I'm assuming that the input is considered against the data set and deemed an outlier or not.

Module #10 Assignment

This module and lecture were focused on visualizing data through time series; the changes and continuities of statistical behaviors within data sets can be shown over any metric of time (seconds, days, years, etc.) Macroscopic perspectives typically use years to understand financial periods, celestial-bodies (i.e. rotations and revolutions), or historical trends to name a few examples. I decided t use the pre-loaded dataset within R "EuStockMarkets", which describes the stock market data concerning four countries within the EU: Germany DAX (Ibis), Switzerland (SM)I, France (CAC), and UK (FTSE) through the years 1991-1998. First, I loaded the data and my packages; in particular, I loaded ggplot to visualize the data, and later incorporated the tidyverse package to manage and manipulate the data.

I changed the data set into a data frame to prepare it for visualization and make it easier to use. Using tidyverse, I told R to use the price and index attributes; using the mutate function allowed me to compose the annual cycles (repetitions) of each country: Before the mutation, years were represented in decimals and broke down years into an unusable collection of quantities (i.e. 1991.xxx). Next, I used plot.ly to visualize the time series.

The plot is designed to compare years(x) and price(y), with colors being differentiated by the country the stats belong to. The resulting plot is as such:

This visualization presents a trend that depicts a growing rate of stock market prices over time during the period of interest. There are clear moments of oscillation, however the bigger picture presents a clear trend towards higher prices no matter which country is being depicted.