First Draft of Chapter 4, Section 1 - The Ontology of Social Facts and Data in Quantitative Social Science Research

Citation: HUANG, W. (2026, January 20). [Book] Foundations of Quantitative Social Science Research. https://doi.org/10.17605/OSF.IO/EVR46

2.1 The Definition of Data

Data is conceptualized differently across disciplines. In statistics, data are observations that can be analyzed to reveal patterns (Fisher, 1925). In computer science, data are discrete, machine-readable representations stored and processed algorithmically (Date, 2003). In the social sciences, data are empirical evidence collected through systematic methods to test hypotheses (King et al., 1994). Critical data studies scholars emphasize that data are not simply given but actively constructed through processes of selection, categorization, and measurement (Gitelman, 2013; Bowker & Star, 1999).

We adopt a phenomenological perspective: we live in a world of phenomena, the lived immediacy of experience. Data constitute an operational representation of these phenomena. This datafication process transforms the continuous, qualitative flux of experience into discrete, formalizable elements that can be manipulated, analyzed, and communicated. As Husserl argued, formal operations require idealization, the transformation of intuitive experience into exact, repeatable objects of thought (Husserl, 1970).

This definition lies in its recognition that phenomena in their immediacy are not directly operationable for systematic inquiry. Data provide a formal operational space, a symbolic domain where phenomena are re-presented in ways that enable knowledge production. This is not merely technical translation but ontological transformation: we move from the lifeworld (Lebenswelt) to a constructed space of measurable entities. What becomes data, and what resists datafication, is never neutral but reflects epistemic choices, power relations, and the limits of formalization itself (van Dijck, 2014).

2.2 The Social Facts and Its Deconstruction

In social science research, it is commonly acknowledged that the main target of social science research is social facts. In Li’s work (Li, 2019), social facts can be seen as entities and five concepts associated with the entities. Specifically, such five concepts are (1) states, (2) processes, (3) properties, (4) relations, and (5) events.

For quantitative social science, data serve as operational and formal representations of such concepts.

The transformation from abstract social facts to measurable data raises fundamental methodological questions: (1) How do we operationalize abstract concepts into concrete variables? (2) What methods ensure systematic and reliable data collection? (3) Which dimensions of social reality resist datafication? (4) Do our measurements validly capture the phenomena they claim to represent?

Question (1) is already answered in the research design phase. In the data collection phase, we mainly focus on questions (2) to (4).

2.3 Scale and Hierarchy in Spatiotemporal Structures of Social Facts

Similar to dynamical systems in physics, social facts in quantitative research exhibit spatiotemporal structures that are multi-scale and hierarchical. Phenomena unfold across different temporal scales (from moments to epochs) and spatial scales (from individuals to global systems). Moreover, these scales are hierarchically nested: micro-level interactions aggregate into meso-level patterns, which constitute macro-level structures.

In datafication, we must attend to the scale and hierarchical level at which we observe entities and their associated concepts, i.e., states, processes, properties, relations, and events. A phenomenon characterized at one scale may not be meaningfully represented at another.

2.4 Representations

Having established what we dataficate (i.e., entities and their associated concepts) and their spatiotemporal characteristics, we now address how these are represented in quantitative research. We examine each component separately, attending to both their representational forms and scale-hierarchical considerations.

2.4.1 Entities

Entities are the fundamental units of observation in social research, the objects that possess states, undergo processes, exhibit properties, enter into relations, and participate in events. In Li’s framework (Li, 2019), entities constitute the ontological foundation upon which the five concepts operate.

In quantitative data, entities are typically represented through identification and categorical classification. Entities can be identified through unique identifiers or through combinations of properties that distinguish one entity from another. Beyond identification, entities are classified into types through categorical variables distinguishing individual versus collective actors, public versus private organizations, or state versus non-state actors. Some research designs represent entities through their attributes in multidimensional feature spaces, where each entity becomes a vector of characteristics.

Category theory provides a formal framework for reasoning about entity equality and similarity (Awodey, 2010). Two entities are equal if they are isomorphic, meaning there exists a structure-preserving bidirectional mapping between them. More commonly in empirical research, we deal with similarity rather than equality: entities are similar if there exist morphisms, or structure-preserving mappings, between them. This framework is practically important for data collection: it guides decisions about which entities in the real world should be treated as instances of the same type and thus included in our dataset. For example, when studying organizations, we must determine whether non-profit entities, governmental agencies, and for-profit corporations are sufficiently similar morphologically to be compared, or whether their structural differences require separate treatment.

Entities exist at multiple scales and are often hierarchically nested. Individual persons nest within households, which nest within communities, which nest within nations. Organizations have departments, divisions, and subsidiaries. This nesting creates representational challenges: the same phenomenon may involve entities at different levels, such as individual voting behavior, party strategy, and national electoral systems simultaneously. Researchers must specify the primary unit of analysis while acknowledging cross-level interactions. Aggregating from lower to higher levels, such as from individual opinions to public opinion, involves assumptions about emergence and composition. Disaggregating from higher to lower levels risks ecological fallacy.

2.4.2 States

States refer to the conditions or situations that entities occupy at particular points in time. A state is a snapshot characterization: the configuration of an entity at a given moment. States can be simple or complex, capturing single attributes or multidimensional configurations.

In quantitative data, states are typically represented through variables measured at specific time points. For categorical states, nominal or ordinal variables capture discrete conditions such as employed versus unemployed, democratic versus authoritarian, or conflict versus peace. For continuous states, numerical variables measure magnitudes such as GDP, temperature, or public approval ratings. Complex states may be represented through vectors or matrices capturing multiple simultaneous attributes, such as a nation’s economic, political, and social conditions at time t. In some frameworks, states are represented as probability distributions over possible configurations rather than point estimates, acknowledging uncertainty or heterogeneity.

States manifest at multiple scales and levels. Micro-states describe individual-level conditions, such as a person’s employment status. Meso-states characterize group or organizational conditions, such as a firm’s financial health. Macro-states describe system-level configurations, such as a nation’s regime type. The relationship between states at different levels is not straightforward: macro-states are not simple aggregations of micro-states but may exhibit emergent properties. For example, a society can be in a state of high inequality even when most individuals are in similar economic states. Temporal granularity also matters: a state defined over microseconds differs fundamentally from one defined over decades. The choice of temporal resolution affects what patterns become visible and what variations are smoothed away.

States are inherently reductive, capturing entities in frozen moments. This temporal slicing obscures continuous flux and transitional phases. Many social phenomena resist clear state classification: is a society undergoing revolution in a stable state or between states? States also privilege measurable attributes while marginalizing qualitative characteristics that resist quantification. The decision of which attributes constitute the relevant state description embeds theoretical assumptions about what matters.

2.4.3 Processes

Processes refer to the temporal dynamics through which entities transition between states and develop properties. Unlike states which are snapshots, processes capture how entities evolve over time. A process is inherently temporal and directional, describing trajectories rather than positions. This conception is isomorphic to dynamical systems in physics, where processes correspond to trajectories through phase space (states) and parameter space (properties).

We distinguish processes from event emission and process generation, which we consider aspects of states rather than processes themselves. At any given time t, an entity’s state may include its propensity to emit events or spawn sub-processes, but these emissions are state characteristics, not the process. The process is the evolution of these states and properties over time.

In quantitative data, processes are typically represented through time series, longitudinal measurements, or transition models. Time series capture repeated measurements of variables over successive time points, revealing trends, cycles, or fluctuations. Transition matrices represent discrete state changes, showing probabilities of moving from one state to another. Differential equations or difference equations model continuous processes mathematically, describing rates of change as trajectories through state space. Growth curves, decay functions, and trajectory models characterize specific process patterns. Phase space diagrams show how multiple state variables co-evolve, with processes appearing as curves or flows in this space.

Processes resist complete datafication because continuous temporal flow must be discretized into measurement intervals. The choice of sampling frequency determines what process characteristics are captured and what are aliased or missed entirely. Many social processes are non-stationary, meaning their dynamics change over time, violating assumptions of many analytical methods. Processes involving feedback loops, emergence, and non-linearity challenge simple representational schemes. Qualitative transformations, tipping points, and regime shifts may not be adequately captured by gradual quantitative change measures.

2.4.4 Properties

Properties are characteristics or attributes that entities possess. Unlike states which describe conditions at particular times, properties are relatively stable features that characterize entities across contexts. Properties can be intrinsic, belonging to the entity itself, or relational, defined through comparison with other entities.

In quantitative data, properties are represented through variables that characterize entities. Categorical properties use nominal variables to denote types, such as gender, nationality, or organizational form. Ordinal properties capture ranked characteristics, such as education level or firm size categories. Continuous properties use numerical variables to measure magnitudes, such as age, wealth, or geographic area. Composite properties combine multiple indicators into indices or scales, such as socioeconomic status or state capacity indices. Latent properties not directly observable are inferred through multiple manifest indicators using techniques like factor analysis or item response theory. In multidimensional representations, properties define coordinate systems in feature spaces where entities are positioned.

Properties can be defined at multiple levels and may not aggregate straightforwardly across scales. Individual-level properties such as education are distinct from group-level properties such as average education, which is distinct from distributional properties such as educational inequality. Some properties are emergent, existing only at higher levels of organization: a network has centralization properties that individual nodes do not possess. Other properties are contextual, defined relative to the surrounding environment: a person’s relative income depends on the income distribution of their reference group. The ecological fallacy warns against inferring individual properties from aggregate properties, while the atomistic fallacy warns against inferring collective properties from individual attributes. Properties may also be scale-dependent: organizational complexity measured at the department level differs from complexity measured at the enterprise level.

Many theoretically important properties resist quantification. Concepts like legitimacy, identity, or cultural meaning are difficult to reduce to numerical indicators without significant loss. The operationalization of properties through specific indicators always involves construct validity concerns: does the measure actually capture the theoretical concept? Properties that are fluid, contested, or context-dependent challenge stable measurement. The reification of properties through measurement can obscure their socially constructed nature, treating as natural what is actually historical and contingent.

2.4.5 Relations

Relations describe connections, associations, or interactions between entities. Unlike properties which characterize individual entities, relations are inherently multi-entity concepts. Relations can be directed or undirected, symmetric or asymmetric, binary or multi-way.

In quantitative data, relations are represented through various relational structures. Dyadic relations between pairs of entities are captured in adjacency matrices, edge lists, or relational databases. Network representations use graphs where nodes represent entities and edges represent relations, with edge weights indicating relation strength. Multi-way relations involving more than two entities can be represented through hypergraphs or tensor structures. Relational attributes capture characteristics of the relations themselves, such as tie strength, duration, or type. Temporal networks track how relations change over time. Bipartite or multi-mode networks represent relations between different types of entities. Hierarchical or nested relations are represented through tree structures or multilevel network models.

Relations exist at multiple scales and form hierarchical structures. Micro-relations connect individuals through friendships, conversations, or transactions. Meso-relations link organizations through partnerships, supply chains, or alliances. Macro-relations connect nations through trade, treaties, or conflicts. These levels are interconnected: individual diplomatic interactions constitute interstate relations; organizational partnerships create industry structures. Relations themselves can be nested: individuals embedded in groups which are embedded in organizations which are embedded in broader institutional fields. The structure of relations at one level constrains and enables relations at other levels. Aggregating from micro-relations to macro-relations involves questions about how individual ties constitute systemic structures. Network properties like density, centralization, or clustering may differ fundamentally across scales.

Relational data faces unique challenges. Boundary specification is critical: defining which entities and which types of relations to include fundamentally shapes the resulting analysis. Many important relations are latent, informal, or difficult to observe directly. Relations may be multiplexed, with multiple types of ties between the same entities, complicating representation. Temporal dynamics of relation formation and dissolution are often inadequately captured in cross-sectional network data. The meaning and significance of relations can be context-dependent and not fully captured by structural position alone. Power relations, symbolic relations, and relations of meaning often resist reduction to measurable ties.

2.4.6 Events

Events are discrete occurrences that happen at specific points or intervals in time. Unlike processes which describe continuous change, events are bounded happenings: transitions, occurrences, or incidents that mark temporal discontinuities.

In quantitative data, events are represented through temporal markers and occurrence indicators. Binary variables indicate whether an event occurred during a given period. Timestamps record exact timing of events. Event counts aggregate how many times an event type occurred. Duration variables measure how long events lasted. Event sequences track ordered series of occurrences. Survival or hazard models represent time-to-event data, analyzing when events occur and what factors affect their timing. Point process models treat events as points in time with associated intensities. Event attributes capture characteristics of specific occurrences, such as magnitude, location, or participants. Complex events may be decomposed into event structures showing sub-events and their relationships.

Events occur at multiple temporal and organizational scales. Micro-events are brief and localized, such as a single speech act or transaction. Macro-events span extended periods and broad scope, such as wars, revolutions, or economic crises. Events at different scales may be hierarchically related: micro-events can constitute or trigger macro-events, while macro-events provide contexts that shape micro-events. A revolution consists of countless individual acts of resistance, protests, and confrontations, yet the revolution as macro-event has properties and consequences not reducible to its component micro-events. The temporal granularity of event measurement affects what is visible: daily event data capture fluctuations that monthly aggregates smooth away. Events can also cascade across scales: a local bank failure may trigger regional financial instability which precipitates a national crisis.

Event identification and boundaries are often ambiguous. When exactly does an event begin and end? What counts as a distinct event versus a continuation or recurrence? Many significant events leave minimal empirical traces amenable to datafication. The timing resolution of event data affects analysis: recording only the date versus the exact second of occurrence provides different analytical possibilities. Rare or unprecedented events challenge statistical approaches developed for recurring patterns. The interpretation of events depends on theoretical framing: the same occurrence may be categorized as different event types depending on analytical perspective. Events may be socially constructed, with their recognition and classification reflecting power and interpretive struggles rather than objective happenings.

References

Awodey, S. (2010). Category Theory (2nd ed.). Oxford University Press.

Bowker, G. C., & Star, S. L. (1999). Sorting Things Out: Classification and Its Consequences. MIT Press.

Date, C. J. (2003). An Introduction to Database Systems (8th ed.). Addison-Wesley.

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd.

Gitelman, L. (Ed.). (2013). Raw Data is an Oxymoron. MIT Press.

Husserl, E. (1970). The Crisis of European Sciences and Transcendental Phenomenology. Northwestern University Press.

King, G., Keohane, R. O., & Verba, S. (1994). Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton University Press.

Li, S. (李少军). (2019). 国际政治学概论 [Introduction to International Politics] (5th ed.). 上海人民出版社 [Shanghai People’s Publishing House].

van Dijck, J. (2014). Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology. Surveillance & Society, 12(2), 197-208.