Introduction
censaurus isn’t just a data science tool. It’s a Census-specific data science tool. Leveraging the intricacies and design of the Census — down to how the Census names its variables — censaurus empowers users to explore, retrieve, and analyze Census data like never before. Key features of censaurus include:
Extensive dataset support:
censaurusnatively supports 17 popular Census datasets:Decennial Census (general):
DecennialDecennial Census Redistricting Data:
DecennialPLDecennial Census Summary File 1:
DecennialSF1Decennial Census Summary File 2:
DecennialSF2
American Community Survey Census (general):
ACSAmerican Community Survey 1-Year Data:
ACS1American Community Survey 1-Year Supplemental Data:
ACSSupplementalAmerican Community Survey 3-Year Data:
ACS3American Community Survey 5-Year Data:
ACS5American Community Survey Migration Flows:
ACSFlowsAmerican Community Survey Language Statistics:
ACSLanguage
Public Use Microdata Sample (general):
PUMSCurrent Population Survey (general):
CPSEconomic Census (general):
EconomicEconomic Census Key Statistics:
EconomicKeyStatistics
Population Estimates (general):
EstimatesPopulation Projections (general):
Projections
In addition to these built-in datasets,
censaurusalso supports all other Census API datasets through the genericDatasetclass. To filter and find available datasets that meet your requirements, you can use theDatasetExplorerclass, which comes with tools likeDatasetExplorer.filter_by_term()andDatasetExplorer.filter_by_year().Powerful variable filtering: Utilizing the relationships between variables,
censaurusgrants users full control over the variable selection process. Users can easily select the exact set of variables they want, eliminating the need for the tedious and error-prone task of selecting and listing variables individually. Here’s an example that uses the parent/child relationship between variables:>>> from censaurus.dataset import ACS5 >>> acs = ACS5() >>> # B01001_002E is the [sex by age -> total -> male] variable >>> acs.variables.children_of(variable='B01001_002E') VariableCollection of 23 variables: B01001_003E group: B01001 concept: sex by age path: [sex by age -> estimate -> total -> male -> under 5 years] B01001_004E group: B01001 concept: sex by age path: [sex by age -> estimate -> total -> male -> 5 to 9 years] ... B01001_024E group: B01001 concept: sex by age path: [sex by age -> estimate -> total -> male -> 80 to 84 years] B01001_025E group: B01001 concept: sex by age path: [sex by age -> estimate -> total -> male -> 85 years and over]
Note
There is also support for finding the parent, siblings, cousins, descendants, and ancestors of a particular variable.
This behavior is available thanks to the rich way
censaurusprocesses variables. In particular,censauruskeeps track of each variable as aVariableobject, and stores the set of variables available to a particular dataset in aVariableCollection.Of course, like other Census tools,
censaurusalso has support for more basic variable filtering. Here is an example where variables are filtered based on specific search terms:>>> from censaurus.dataset import ACS5() >>> acs = ACS5() >>> acs.variables.filter_by_term(term=['age', 'sex'], by='concept') VariableCollection of 3417 variables: B01001A_001E group: B01001A concept: sex by age (white alone) path: [sex by age (white alone) -> estimate -> total] B01001A_002E group: B01001A concept: sex by age (white alone) path: [sex by age (white alone) -> estimate -> total -> male] ... C27009_020E group: C27009 concept: va health care by sex by age path: [va health care by sex by age -> estimate -> total -> female -> 65 years and over -> with va health care] C27009_021E group: C27009 concept: va health care by sex by age path: [va health care by sex by age -> estimate -> total -> female -> 65 years and over -> no va health care]
Flexible geographic support:
censauruslets users circumvent the rigid system of geographic specification typically enforced by the Census. As an example, within the American Community Survey, data at theblock group-level is only available within a specifiedstate,county, or Censustract.censaurus, on the other hand, lets users requestblock group-level data within any possible geographic area, such asblock group-level data within a Censusplace:>>> from censaurus.dataset import ACS5 >>> acs = ACS5() >>> acs.block_groups(within=acs.areas.place(place='Houston, TX')) successfully matched 'Houston, TX' to 'Houston city, Texas' (GEOID = 4835000) in layer 'Incorporated Places' NAME GEO_ID state county tract block group 0 Block Group 1, Census Tract 6701.01, Fort Bend... 1500000US481576701011 48 157 670101 1 1 Block Group 2, Census Tract 6701.01, Fort Bend... 1500000US481576701012 48 157 670101 2 2 Block Group 3, Census Tract 6701.01, Fort Bend... 1500000US481576701013 48 157 670101 3 3 Block Group 4, Census Tract 6701.01, Fort Bend... 1500000US481576701014 48 157 670101 4 4 Block Group 1, Census Tract 6701.02, Fort Bend... 1500000US481576701021 48 157 670102 1 ... ... ... ... ... ... ... 2352 Block Group 2, Census Tract 6923.03, Montgomer... 1500000US483396923032 48 339 692303 2 2353 Block Group 3, Census Tract 6923.03, Montgomer... 1500000US483396923033 48 339 692303 3 2354 Block Group 1, Census Tract 6924.02, Montgomer... 1500000US483396924021 48 339 692402 1 2355 Block Group 4, Census Tract 6924.02, Montgomer... 1500000US483396924024 48 339 692402 4 2356 Block Group 5, Census Tract 6924.02, Montgomer... 1500000US483396924025 48 339 692402 5 [2357 rows x 6 columns]
Note
You can even geographically subset your data with a list of geographic areas. This is particularly useful for comparisons. For example, you could request state level data within two particular Census divisions:
>>> acs.states(within=[acs.areas.division(division='New England'), acs.areas.division(division='Mountain')])
This feature greatly expands access to more flexible and specific geographic queries. You could even request
block-level data for the entire United States, if you wanted to!Internally,
censaurusinterfaces with the TIGERWeb API (another product of the U.S. Census Bureau) to make this behavior possible. When a user requests data inconsistent with the default geographic hierarchy,censaurusdynamically and efficiently converts that request into one (or, potentially, more than one) that can be properly parsed by the Census API.Census-focused data cleaning tools:
censauruscomes with powerful and convenient tools to help you clean and prepare your Census data for analysis. For example, you can easily rename your output columns to be more legible:>>> from censaurus.dataset import ACS5 >>> # go from this >>> acs = ACS5() >>> counties = acs.counties(variables=acs.variables.filter_by_group('B01001')) >>> counties B01001_001E B01001_002E B01001_003E B01001_004E ... GEO_ID NAME state county 0 58239 28206 1783 1959 ... 0500000US01001 Autauga County, Alabama 01 001 1 227131 110781 6121 5709 ... 0500000US01003 Baldwin County, Alabama 01 003 2 25259 13361 647 743 ... 0500000US01005 Barbour County, Alabama 01 005 3 22412 12300 603 646 ... 0500000US01007 Bibb County, Alabama 01 007 4 58884 29530 1818 1906 ... 0500000US01009 Blount County, Alabama 01 009 ... ... ... ... ... ... ... ... ... ... 3216 54544 26057 1123 1277 ... 0500000US72145 Vega Baja Municipio, Puerto Rico 72 145 3217 8317 4239 255 165 ... 0500000US72147 Vieques Municipio, Puerto Rico 72 147 3218 22341 10796 509 664 ... 0500000US72149 Villalba Municipio, Puerto Rico 72 149 3219 31047 15000 559 698 ... 0500000US72151 Yabucoa Municipio, Puerto Rico 72 151 3220 34704 16548 611 699 ... 0500000US72153 Yauco Municipio, Puerto Rico 72 153 [3221 rows x 53 columns] >>> from censaurus.renamer import SIMPLE_RENAMER >>> # to this >>> counties = SIMPLE_RENAMER.rename(data=counties) >>> counties sex by age|total sex by age|total|male sex by age|total|male|0-5 ... NAME state county 0 58239 28206 1783 ... Autauga County, Alabama 01 001 1 227131 110781 6121 ... Baldwin County, Alabama 01 003 2 25259 13361 647 ... Barbour County, Alabama 01 005 3 22412 12300 603 ... Bibb County, Alabama 01 007 4 58884 29530 1818 ... Blount County, Alabama 01 009 ... ... ... ... ... ... ... ... 3216 54544 26057 1123 ... Vega Baja Municipio, Puerto Rico 72 145 3217 8317 4239 255 ... Vieques Municipio, Puerto Rico 72 147 3218 22341 10796 509 ... Villalba Municipio, Puerto Rico 72 149 3219 31047 15000 559 ... Yabucoa Municipio, Puerto Rico 72 151 3220 34704 16548 611 ... Yauco Municipio, Puerto Rico 72 153 [3221 rows x 53 columns]
Note
The
SIMPLE_RENAMERcan be customized to meet your needs: you can change the separator, add prefixes for specific groups, or add custom renaming functions. You can also create your own renamer from scratch using theRenamerclass.Next, if the Census data you requested is too granular for your needs, you can use the regrouping tools built into
censaurusto automatically aggregate your data into new, custom buckets. Here is an example where the fine-grained Census age groups are collapsed into more general ones:>>> from censaurus.dataset import ACS5 >>> from censaurus.renamer import SIMPLE_RENAMER >>> # go from this >>> acs = ACS5() >>> counties = acs.counties(variables=acs.variables.filter_by_group('B01001')) >>> counties = SIMPLE_RENAMER.rename(data=counties) >>> counties.columns Index(['sex by age|total', 'sex by age|total|male', 'sex by age|total|male|0-5', 'sex by age|total|male|5-9', 'sex by age|total|male|10-14', 'sex by age|total|male|15-17', 'sex by age|total|male|18-19', 'sex by age|total|male|20', 'sex by age|total|male|21', 'sex by age|total|male|22-24', 'sex by age|total|male|25-29', 'sex by age|total|male|30-34', 'sex by age|total|male|35-39', 'sex by age|total|male|40-44', 'sex by age|total|male|45-49', 'sex by age|total|male|50-54', 'sex by age|total|male|55-59', 'sex by age|total|male|60-61', 'sex by age|total|male|62-64', 'sex by age|total|male|65-66', 'sex by age|total|male|67-69', 'sex by age|total|male|70-74', 'sex by age|total|male|75-79', 'sex by age|total|male|80-84', 'sex by age|total|male|85+', 'sex by age|total|female', 'sex by age|total|female|0-5', 'sex by age|total|female|5-9', 'sex by age|total|female|10-14', 'sex by age|total|female|15-17', 'sex by age|total|female|18-19', 'sex by age|total|female|20', 'sex by age|total|female|21', 'sex by age|total|female|22-24', 'sex by age|total|female|25-29', 'sex by age|total|female|30-34', 'sex by age|total|female|35-39', 'sex by age|total|female|40-44', 'sex by age|total|female|45-49', 'sex by age|total|female|50-54', 'sex by age|total|female|55-59', 'sex by age|total|female|60-61', 'sex by age|total|female|62-64', 'sex by age|total|female|65-66', 'sex by age|total|female|67-69', 'sex by age|total|female|70-74', 'sex by age|total|female|75-79', 'sex by age|total|female|80-84', 'sex by age|total|female|85+', 'GEO_ID', 'NAME', 'state', 'county'], dtype='object') >>> from censaurus.regroup import AgeRegrouper >>> # to this >>> regrouper = AgeRegrouper(brackets=["0-17", "18-29", "30-49", "50-64", "65+"]) >>> counties = regrouper.regroup(data=counties) >>> counties = renamer.renamer(data=counties) >>> counties.columns Index(['sex by age|total', 'sex by age|total|male', 'sex by age|total|male|0-17', 'sex by age|total|male|18-29', 'sex by age|total|male|30-49', 'sex by age|total|male|50-64', 'sex by age|total|male|65+', 'sex by age|total|female', 'sex by age|total|female|0-17', 'sex by age|total|female|18-29', 'sex by age|total|female|30-49', 'sex by age|total|female|50-64', 'sex by age|total|female|65+', 'GEO_ID', 'NAME', 'state', 'county'], dtype='object')
Note
For regrouping variables based on attributes other than age, you can use the generic
Regrouperclass.Finally, the
Recoderclass allows user to recode state names and identifiers to and from various formats (FIPS, GNIS, etc.).censaurusadds custompandas.DataFrameandpandas.Seriesaccessors to make this renaming, regrouping, and recoding possible.
For an example of a full walkthrough using these tools, check out the tutorial page.