WEBVTT

00:00:00.000 --> 00:00:01.000
Well, we can get started, I guess. Um, yeah, we're very happy to have Mariel here.

00:00:01.000 --> 00:00:02.000
Who tells us about…

00:00:02.000 --> 00:00:09.000
Um, AI-enabled fundamental physics research. Mariel is just starting in Wisconsin?

00:00:09.000 --> 00:00:14.000
Finishing my first year. Finishing your first year. Just started. As a new faculty in Wisconsin.

00:00:14.000 --> 00:00:18.000
And also splitting your time with Flatiron Institute? Yeah, okay.

00:00:18.000 --> 00:00:25.000
And, um, and yep, we're very excited to hear more about this work. Thanks.

00:00:25.000 --> 00:00:32.000
Thank you again for the invitation. I was here maybe 2 years ago, I don't know if I met any of you on that visit.

00:00:32.000 --> 00:00:35.000
So, um…

00:00:35.000 --> 00:00:47.000
Yeah, my talk today is, uh, speculating about, you know, a future, maybe oncoming era of what… how AI will change the way that we do physics analyses.

00:00:47.000 --> 00:00:52.000
Um, with the caveat that I'm not going to talk about agents, that's another, um…

00:00:52.000 --> 00:00:55.000
rabbit hole. Um…

00:00:55.000 --> 00:01:03.000
So, uh, a metaphor that I like to give when I… when I talk about sort of how I think about physics analysis is also thinking about how humans orient themselves in the world.

00:01:03.000 --> 00:01:07.000
So I… I like the symbolism of thinking about

00:01:07.000 --> 00:01:11.000
you know, not just physics data, but how do we orient ourselves around a city?

00:01:11.000 --> 00:01:13.000
Um…

00:01:13.000 --> 00:01:18.000
So, for instance, New York, when I first moved here a couple years ago, um,

00:01:18.000 --> 00:01:25.000
It's a very overwhelming place, and I think any new city is like this, where it's kind of too much to sort of encapsulate all

00:01:25.000 --> 00:01:29.000
it all in one image, or one idea.

00:01:29.000 --> 00:01:37.000
So, when you move anywhere new, of course, the first thing you do is just sort of look in your neighborhood and see what's around. You look in your immediate surroundings, and you understand

00:01:37.000 --> 00:01:42.000
Where do you get bagels? Where do you get pizza? Where do you do your laundry? Where do you get groceries?

00:01:42.000 --> 00:01:58.000
But, okay, I maybe mastered this. This is not my block. I maybe mastered this after, like, a month or something of moving to a new city, but this is not to say that I understand how the city that I live in really operates. Clearly, I need, like, some other layers to pick this apart.

00:01:58.000 --> 00:02:04.000
So, um, a critical error, uh, to understanding New York, of course, is the subway system, and

00:02:04.000 --> 00:02:15.000
walking around your neighborhood, you might not know that this is, like, a critical length through which you should understand the city. It defines everything that's growing above the subway system as well.

00:02:15.000 --> 00:02:22.000
And, um, we don't have to stop here, so, you know, maybe after, like, a couple months of living in a city, you understand your neighborhood, you understand your subway.

00:02:22.000 --> 00:02:31.000
Maybe you also want to understand the history of the city that you live in, and by sort of placing a different layer onto your map of a historical context of the city,

00:02:31.000 --> 00:02:37.000
like this one, I love, um, that sort of illustrates the wall of Wall Street, um, back when New York was New Amsterdam.

00:02:37.000 --> 00:02:43.000
It gives you this, like, different perspective on the city that you live in that really informs your depth of your knowledge.

00:02:43.000 --> 00:02:52.000
And maybe you don't want to think about humans at all in this, uh, in this city. Maybe you actually just want to think about, like, the geologic framework of your city and how…

00:02:52.000 --> 00:03:00.000
different deposits of different types of rocks informed what kinds of buildings could be built on top of them, and then that shapes what humans came to those areas.

00:03:00.000 --> 00:03:06.000
So, you can go back much further in geologic time to kind of understand the place that you live in.

00:03:06.000 --> 00:03:24.000
Um, and then, of course, you know, maybe there's more critical maps, like the best pizza in the city that you live in, or, you know, things that are very personal, but, you know, no less important to your understanding of the place that you live in. People can disagree about what the… For sure. Um, yeah, I think quite a lot is missing from this list.

00:03:24.000 --> 00:03:42.000
So, anyway, I like this metaphor in thinking about, you know, what are we doing as scientists, thinking of us all as living in this one big city that is the universe, that is, like, way too complex for any single image or map to really capture, and so instead, what we're doing as scientists is

00:03:42.000 --> 00:03:48.000
Taking very particular lenses to render different maps of the universe at different scales.

00:03:48.000 --> 00:03:50.000
Uh, with different limitations.

00:03:50.000 --> 00:03:52.000
And then our job is to, like,

00:03:52.000 --> 00:03:59.000
tell the story of stitching all of these different layers together to understand how they inform one another.

00:03:59.000 --> 00:04:07.000
And so this is kind of what I think about when I am imagining in my ideal scenario, how would computational tools

00:04:07.000 --> 00:04:13.000
bolster this type of analysis or this type of study of, like, mapping in a big picture sense.

00:04:13.000 --> 00:04:22.000
So, um, what I'm going to talk about today is sort of imagining what it would take to build a wide-ranging, multi-layered, and data-driven representation of our universe.

00:04:22.000 --> 00:04:32.000
that is meant to complement, as well as deepen our existing scientific understanding. So, you know, I won't claim that this data-driven…

00:04:32.000 --> 00:04:37.000
perspective is going to be, um, a replacement for our understanding, or that it'll be complete.

00:04:37.000 --> 00:04:42.000
But it doesn't have to be complete to be useful, I think, and I think that that's true of maps also.

00:04:42.000 --> 00:04:47.000
Um, so, uh, this is a little bit vague, maybe, so be… to be more specific, what I'll talk about is…

00:04:47.000 --> 00:04:57.000
by wide-ranging, I'm thinking about multi-detector analyses, or even multidisciplinary analyses, like putting many datasets in conversation with each other.

00:04:57.000 --> 00:05:03.000
multi-layered, I'm talking about modalities of information, um, which I'll define later on in the talk.

00:05:03.000 --> 00:05:08.000
And then, um, data-driven, I mostly mean training on real data without labels.

00:05:08.000 --> 00:05:14.000
To take advantage of just the huge amount and volume of data that we have.

00:05:14.000 --> 00:05:16.000
Um, and so to train

00:05:16.000 --> 00:05:23.000
AI models without labels, we turn to representation learning or self-supervised learning.

00:05:23.000 --> 00:05:30.000
So the… the result of all of this, um, you could call it a foundation model, and maybe you've heard of this metaphor

00:05:30.000 --> 00:05:34.000
for things like ChatGPT as a blurry JPEG of the web.

00:05:34.000 --> 00:05:39.000
Um, so it's kind of like a blurry, condensed representation of something like the internet.

00:05:39.000 --> 00:05:46.000
And I really like this metaphor for a lot of large language models, and so I think we can draw from that in thinking about

00:05:46.000 --> 00:05:48.000
the potential of foundation models in science.

00:05:48.000 --> 00:06:01.000
That, you know, this blurry JPEG of the web, it is not the internet. It's a distinct thing. But that doesn't mean it's not useful, and so in capturing kind of the essential patterns of the internet,

00:06:01.000 --> 00:06:05.000
We can interact with it differently than we can with the body of information of the internet. And it's…

00:06:05.000 --> 00:06:12.000
something in that dynamic, that relationship between the two, um, where it finds utility.

00:06:12.000 --> 00:06:18.000
Cool, so, um, we have a lot of foundation models. Many of us probably use them

00:06:18.000 --> 00:06:26.000
Frequently. Um, so the question is kind of, like, why would this be a particularly unique conversation to be happening within the sciences?

00:06:26.000 --> 00:06:37.000
If we already have foundation models that are widely available. So, uh, even though my talk is framed around this idea of, like, what would it take to build a foundation model, why would we want it?

00:06:37.000 --> 00:06:43.000
Um, I actually think a lot of my talk is just, like, framed around questions of data, like, what makes physics data special?

00:06:43.000 --> 00:06:49.000
And what may or may not, um, present challenges for the existing framework of how we build

00:06:49.000 --> 00:06:53.000
machine learning foundation models, um, in the world.

00:06:53.000 --> 00:06:57.000
So, we… even though we have…

00:06:57.000 --> 00:06:58.000
Oh, yeah.

00:06:58.000 --> 00:07:08.000
I'm sorry. I have a question. So I just want to make sure I'm hearing this right. You're using the word foundation model. Are you going to define what that means?

00:07:08.000 --> 00:07:13.000
Sure, yeah, kind of intentionally in this talk, I guess I, um…

00:07:13.000 --> 00:07:18.000
I'm not framing this as, like, a pitch only for foundation models, but I guess what I would define it as is…

00:07:18.000 --> 00:07:24.000
a large-scale machine learning model that's been trained on diverse datasets.

00:07:24.000 --> 00:07:30.000
in a self-supervised way, meaning without labels, so it's kind of using the data itself to form its own labels.

00:07:30.000 --> 00:07:36.000
And the goal is not to solve any one particular task, but the goal is to create a useful representation

00:07:36.000 --> 00:07:46.000
that can translate onto future tasks that we did not necessarily define ahead of time. So that it has some even small amount of emergent properties.

00:07:46.000 --> 00:07:50.000
such that it can generalize easily to new tasks.

00:07:50.000 --> 00:07:54.000
Or even new datasets that the model hasn't seen before.

00:07:54.000 --> 00:07:59.000
Um, but I think maybe just as useful would be thinking of, um,

00:07:59.000 --> 00:08:03.000
multi-layered, uh, data representations? Like, how do we put, um,

00:08:03.000 --> 00:08:11.000
sources of data from physics into conversation with each other.

00:08:11.000 --> 00:08:17.000
Hopefully that answers your question, and if not, we can talk more about it later.

00:08:17.000 --> 00:08:18.000
Um…

00:08:18.000 --> 00:08:20.000
Okay, okay. Yes, thank you. Right.

00:08:20.000 --> 00:08:25.000
break. So, we've… we see a lot of examples of foundation models for… for text or language.

00:08:25.000 --> 00:08:28.000
For images, for video,

00:08:28.000 --> 00:08:30.000
out in the world right now, and so…

00:08:30.000 --> 00:08:42.000
the core question that I'm interested in is basically to what extent will physics data, um, fit within these pre-existing methods that we have to build foundation models for these other, more common data types?

00:08:42.000 --> 00:08:45.000
Or to what extent will physics data present, kind of, um…

00:08:45.000 --> 00:08:54.000
unique challenges that only physicists will have to, um, step up and present solutions for.

00:08:54.000 --> 00:08:58.000
So, I'll frame my talk.

00:08:58.000 --> 00:09:00.000
With, um…

00:09:00.000 --> 00:09:04.000
three ways in which physics data

00:09:04.000 --> 00:09:11.000
has some qualities that I argue are relatively underexplored in standard machine learning settings.

00:09:11.000 --> 00:09:14.000
So the… the first point that I want to make is

00:09:14.000 --> 00:09:19.000
just zooming in on one type of data that I happen to care about, which is particle physics data.

00:09:19.000 --> 00:09:30.000
And so I'll talk about, um, some recent examples that I've worked on, thinking about, uh, how to design machine learning methods for particular qualities of particle physics data.

00:09:30.000 --> 00:09:34.000
So, um, I think maybe the most obvious, um,

00:09:34.000 --> 00:09:36.000
uh… uh…

00:09:36.000 --> 00:09:42.000
quality that makes particle physics data special, I think, is the… the…

00:09:42.000 --> 00:09:43.000
grounding in the standard model of particle physics.

00:09:43.000 --> 00:09:50.000
It's such a unique resource. It's so, um, it's so powerful and profound.

00:09:50.000 --> 00:09:58.000
And, uh, a core part of the standard model of particle physics is, um, is built on symmetry groups, right?

00:09:58.000 --> 00:10:04.000
So, um, so we know that symmetries are critical in understanding particle physics.

00:10:04.000 --> 00:10:20.000
We also know that machine learning models are actually pretty good at understanding how to represent certain symmetries. So, for instance, image models, by design, have symmetric properties included, so things like translation invariance is, um, kind of naturally encoded into how

00:10:20.000 --> 00:10:23.000
convolutional neural nets, which are designed for images.

00:10:23.000 --> 00:10:29.000
are trained. So we know that it can handle some types of symmetries, it being machine learning.

00:10:29.000 --> 00:10:49.000
Um, we also know that we can do this for even more complicated-looking symmetries that we care about within physics. So, for instance, we could imagine encoding some of the other symmetry groups of the standard model, and we have a framework to build group equivariant convolutional networks, and so you can sort of imagine adapting neural network frameworks for any number of, uh,

00:10:49.000 --> 00:10:51.000
symmetry groups that you might care about.

00:10:51.000 --> 00:11:06.000
At the same time, um, sort of more recent evidence over the last few years, um, to me suggests that the benefits of equivariant networks, so neural networks that explicitly build in symmetric invariance to certain groups that you care about,

00:11:06.000 --> 00:11:10.000
It's really only super beneficial in a data-limited regime.

00:11:10.000 --> 00:11:17.000
And that's what these two plots are kind of showing, um, is that even though a symmetry-invariant network

00:11:17.000 --> 00:11:24.000
Um, or an equivariant network, uh, is, uh, kind of more accurate, even across higher Lorentz boosts.

00:11:24.000 --> 00:11:32.000
Um, most of the accuracy results are reported when you only have a fraction of your original training data. So if you're in a setting where you only have less than half a percent,

00:11:32.000 --> 00:11:37.000
of your original training data, maybe you can really see some strong, uh, benefits.

00:11:37.000 --> 00:11:42.000
But today, we live in a pretty data-rich regime, and so, to me, the benefits are less clear.

00:11:42.000 --> 00:11:56.000
And I think that this is just an interesting example to start out with, because I might naively expect that symmetries should be, like, the most critical piece, um, to consider when we build, uh, models that represent physics data, especially particle physics.

00:11:56.000 --> 00:12:01.000
But already, it's… it's clear to me that maybe my intuition is not, um, totally aligned with

00:12:01.000 --> 00:12:05.000
the computational needs of representing physics data.

00:12:05.000 --> 00:12:07.000
So, I've… yeah, I find this something…

00:12:07.000 --> 00:12:11.000
Sorry, so what is in your, in your graph, what is beta?

00:12:11.000 --> 00:12:13.000
Oh, this…

00:12:13.000 --> 00:12:18.000
If I read, how should I read that graph? As I more data is going to the left? Is that the idea?

00:12:18.000 --> 00:12:25.000
It's, um, it's the strength of the Lorentz boost, so it's sort of how extreme

00:12:25.000 --> 00:12:26.000
of a shift is applied.

00:12:26.000 --> 00:12:29.000
Oh, I…

00:12:29.000 --> 00:12:43.000
Yeah, so going from left to right, the Lorenz boost is getting larger, and so the accuracy, um, decreases for a fully connected network that does not have any symmetry awareness built in.

00:12:43.000 --> 00:12:44.000
But the plot is at which training percentage?

00:12:44.000 --> 00:12:49.000
100%, or…? It's a great question, I don't remember. Um, yeah, I'd have to check.

00:12:49.000 --> 00:13:01.000
And then from the table, we're supposed to take away what? As the training percentage goes up, the gap between the two narrows in some sense, or what? I think, really, I would just note that, you know, this is how…

00:13:01.000 --> 00:13:07.000
these results had to be reported, is… the comparison is really only useful

00:13:07.000 --> 00:13:16.000
for under 5% of the training percentage. Otherwise, the results are compatible, or are the equivariant network is outperformed by the unconstrained flow. Okay.

00:13:16.000 --> 00:13:24.000
Right. You said that, but I'm trying to understand how that's being illustrated by the graph or the table.

00:13:24.000 --> 00:13:40.000
that symmetry. I mean it intuitively what you said is very, very clear that if you have symmetry and only a little bit of data, then the symmetry is going to be powerful. If you have lots and lots of data, then who needs the symmetry? But I'm just trying to understand how that's illustrated in the graph or the table.

00:13:40.000 --> 00:13:46.000
The table, I, like, I have an artificially cut off the table. These are the results that are presented in this paper.

00:13:46.000 --> 00:13:50.000
And, uh, with training fractions that are larger, um,

00:13:50.000 --> 00:13:52.000
Uh, they…

00:13:52.000 --> 00:13:58.000
the Lorentznet is outperformed by the ParticleNet. So here, uh, the accuracy is higher.

00:13:58.000 --> 00:14:02.000
Um, but ParticleNet wins out, ultimately, with full access to the data.

00:14:02.000 --> 00:14:10.000
Um, but you're right that I don't… I don't show the reverse situation here, because I don't extend the… the table further.

00:14:10.000 --> 00:14:15.000
So, within a certain regime, which I argue is, you know, in a smaller data regime, um,

00:14:15.000 --> 00:14:23.000
we do see these benefits, but in a larger data regime, uh, the benefits are less clear.

00:14:23.000 --> 00:14:36.000
I think for the sake of the audience, maybe the particle net is the unconstrained… it doesn't use… That's right, Lorentznet, right, has sort of Lorentz equinariates, um, constructed into its architecture.

00:14:36.000 --> 00:14:37.000
Great. So, uh, I'm just gonna keep…

00:14:37.000 --> 00:14:40.000
Breezing ahead, but I invite questions.

00:14:40.000 --> 00:14:42.000
throughout, and we can talk more at the end, too.

00:14:42.000 --> 00:14:45.000
Um, my point here is basically, um,

00:14:45.000 --> 00:14:51.000
it's not always obvious what the best representation of a given dataset is.

00:14:51.000 --> 00:14:56.000
And so, arguably, you could think of our choices as sort of spanning, uh…

00:14:56.000 --> 00:15:09.000
a, um, a spectrum between maybe fully prescribed industry standard formats that are friendly for image models or sequential models, or point cloud models to ingest.

00:15:09.000 --> 00:15:19.000
Um, maybe in the middle is something like a physics-informed representation, where we do choose to build in specific symmetries that are not present in industry standard datasets, such as Lorentz equivariance.

00:15:19.000 --> 00:15:28.000
And then maybe on the furthest extreme is, like, a fully self-supervised, unconstrained, you know, we let the model determine what representation is most useful.

00:15:28.000 --> 00:15:31.000
in a fully data-driven format.

00:15:31.000 --> 00:15:34.000
So, um, it…

00:15:34.000 --> 00:15:40.000
I don't think there's a single answer right now to which of these representations is most useful.

00:15:40.000 --> 00:15:45.000
It used to be that maybe we had more industry standard formats for our physics data, um,

00:15:45.000 --> 00:15:48.000
a few years ago, maybe 10 years ago,

00:15:48.000 --> 00:15:56.000
Nowadays, I think more researchers are exploring, kind of, more unconstrained and self-supervised representations for physics. On the previous slide…

00:15:56.000 --> 00:16:00.000
Were you comparing the physics formed to the left one or the right one.

00:16:00.000 --> 00:16:02.000
On the previous slide, um,

00:16:02.000 --> 00:16:08.000
This, uh, this plot is showing the unconstrained model.

00:16:08.000 --> 00:16:13.000
Yeah, so is that data-driven, or is that industry standard?

00:16:13.000 --> 00:16:24.000
Um, you could say that this is, uh, more of an industry standard representation, and then this would be an equivariant representation.

00:16:24.000 --> 00:16:31.000
Um, yes, right.

00:16:31.000 --> 00:16:38.000
So, I think my point here is just that the question is not resolved yet. We have some evidence that it seems to be most effective

00:16:38.000 --> 00:16:44.000
in a data-limited regime, I'm not counting out the usefulness of building in

00:16:44.000 --> 00:16:48.000
Um, physical symmetric constraints into our representations.

00:16:48.000 --> 00:16:54.000
But its intention with this other, uh, clear trend, which is that scaling up our models

00:16:54.000 --> 00:17:00.000
also yields very powerful representations of our models, and so I think it's, um…

00:17:00.000 --> 00:17:07.000
challenging a lot of these intuitions, and I think these two facts will continue to be in tension over the next few years.

00:17:07.000 --> 00:17:13.000
Um, next, I want to quickly also mention another application of machine learning, sort of,

00:17:13.000 --> 00:17:17.000
capturing particularities of physics data.

00:17:17.000 --> 00:17:21.000
And this is related to the idea of what anomalies look like in particle physics.

00:17:21.000 --> 00:17:31.000
Um, and so I, uh, I presented, um, slides like this a couple years ago, but I wanted to mention it because it was from a collaboration, um, that David and I had a few years ago.

00:17:31.000 --> 00:17:35.000
Um, so, thinking about, um,

00:17:35.000 --> 00:17:38.000
When I talk to statisticians about anomalies,

00:17:38.000 --> 00:17:43.000
sometimes I realize that we're actually talking about pretty different things. You know, I'm not talking about a one-off.

00:17:43.000 --> 00:17:48.000
anomaly. Um, I'm thinking instead of statistical anomalies or group anomalies,

00:17:48.000 --> 00:17:58.000
Um, because particle physicists care about, you know, aggregating more and more data, and then showing that our results are incompatible with the background hypothesis.

00:17:58.000 --> 00:18:06.000
So, if we care about kind of looking for bump-like anomalies in particles, in particle physics, our machine learning methods can also be designed to reflect this type of search.

00:18:06.000 --> 00:18:20.000
And, um, we see this, uh, in a number of more recent applications of machine learning and particle physics to do things like rediscover the Upsilon particle in CMS open data.

00:18:20.000 --> 00:18:25.000
using model-agnostic anomaly detection search strategies with machine learning.

00:18:25.000 --> 00:18:33.000
And, uh, also for Beyond the Standard Model searches, um, this is a search for a SUSI signal, um, and, uh,

00:18:33.000 --> 00:18:41.000
We've shown that, you know, you can inject much smaller amounts of signal, and you can still be sensitive to beyond the standard model signals.

00:18:41.000 --> 00:18:46.000
With these types of machine learning anomaly detection strategies.

00:18:46.000 --> 00:18:55.000
And, um, uh, what David and I worked on is, um, thinking also about where else in the… in physics do group anomalies show up.

00:18:55.000 --> 00:19:01.000
And, uh, realizing that in astrophysics data, and particularly in the form of stellar streams,

00:19:01.000 --> 00:19:07.000
Um, you can cast this in a similar light of thinking of group anomalies, um,

00:19:07.000 --> 00:19:10.000
in… in, uh, the Milky Way.

00:19:10.000 --> 00:19:14.000
And so, um, what we worked on is, sort of,

00:19:14.000 --> 00:19:19.000
Uh, thinking about stellar streams, which are these ancient remnants of smaller galaxies or dwarf galaxies,

00:19:19.000 --> 00:19:26.000
Um, that have been stretched out due to gravitational forces as they orbit the Milky Way.

00:19:26.000 --> 00:19:33.000
And, uh, nevertheless, even though they spatially are very, uh, spread out and quite faint and thin,

00:19:33.000 --> 00:19:36.000
Um, they all came from the same astrophysical source.

00:19:36.000 --> 00:19:42.000
And so, as a result, they… the stars that are members of the stellar stream just have very similar velocities.

00:19:42.000 --> 00:19:46.000
And so if you project into proper motion space,

00:19:46.000 --> 00:19:56.000
the Stellar stream, um, velocities really stand out as kind of their own bump compared to the background of other stars that don't appear to be moving in the same way.

00:19:56.000 --> 00:19:59.000
And, uh, so we showed that you can sort of cast

00:19:59.000 --> 00:20:07.000
this same search for stellar streams in the same way that you would cast a model-agnostic search for particles against a smoothly falling distribution.

00:20:07.000 --> 00:20:15.000
And, um, as a result, our machine learning model, which was designed to detect these types of pumps, um, automatically,

00:20:15.000 --> 00:20:16.000
can, uh, reproduce, um,

00:20:16.000 --> 00:20:24.000
can get very similar, um, can identify similar stars as in sort of a more, um,

00:20:24.000 --> 00:20:26.000
Uh… uh…

00:20:26.000 --> 00:20:32.000
a more hand-drawn, maybe, or, um, curated list of stars that should be members of the stream.

00:20:32.000 --> 00:20:36.000
And, um, and David has extended these studies much further in recent years, too.

00:20:36.000 --> 00:20:40.000
Um, so, uh…

00:20:40.000 --> 00:20:44.000
So I just wanted to, you know, start out by just sort of listing, like, outside of…

00:20:44.000 --> 00:20:55.000
the idea of combining multiple datasets together, just within a particular analysis or study, already we've seen that physics data has particular qualities that are kind of unusual in other settings.

00:20:55.000 --> 00:21:00.000
And, um, sometimes treating those qualities carefully

00:21:00.000 --> 00:21:02.000
creates, uh, real benefits.

00:21:02.000 --> 00:21:07.000
Sometimes the benefits are not as obvious, um, with a careful treatment.

00:21:07.000 --> 00:21:09.000
So, next I'm going to imagine

00:21:09.000 --> 00:21:18.000
let's say we, uh, were fully convinced that it was actually useful to train a model on multiple datasets, um, from physics.

00:21:18.000 --> 00:21:26.000
Even if we wanted to do it, um, what would some of the potential challenges be in combining datasets from multiple areas or multiple detectors?

00:21:26.000 --> 00:21:33.000
So, the first thing I'll talk about here is just that, of course, physics data comes in all kinds of formats that are

00:21:33.000 --> 00:21:35.000
not obvious, um, to combine.

00:21:35.000 --> 00:21:43.000
So, whether we're talking about particle physics data from collider physics, um, we could talk about fluid dynamics, or PDEs,

00:21:43.000 --> 00:21:55.000
We could talk about LIGO data, um, or other sources of astrophysics data. All of these, you know, came from very specialized detectors, specialized formats, um, vastly different scales.

00:21:55.000 --> 00:21:59.000
Uh, it's not at all clear how we might sort of, um,

00:21:59.000 --> 00:22:04.000
have these datasets talk to one another in the context of a machine learning model.

00:22:04.000 --> 00:22:14.000
So, um, particle physicists have already thought about this a little bit, um, because we are especially interested in using our data and combining it

00:22:14.000 --> 00:22:16.000
or, sorry, comparing it with theoretical predictions.

00:22:16.000 --> 00:22:31.000
So, already we have some methods in particle physics to publish our data by removing… and remove detector distortions. So the idea being that we know that our detectors have some biases built into them, so maybe at minimum, if we were to publish our data and have it be, um,

00:22:31.000 --> 00:22:38.000
Combined with other sources, we would want to remove those detector biases before we use the data.

00:22:38.000 --> 00:22:46.000
So in particle physics, the way that this works is through a process called unfolding, which is also, like, deconvolution. Um, it's basically removing the detector distortions from your data.

00:22:46.000 --> 00:22:58.000
And the core idea of it, um, in sort of cartoon terms, is that you start with your measured distribution of some parameter that you care about, but we know that this is not actually the true

00:22:58.000 --> 00:23:04.000
Sage at momentum distribution that we cared about, and our detector modified this distribution in some way.

00:23:04.000 --> 00:23:13.000
Um, as the information traveled through our detector, which has, um, inefficiencies and smearing effects and, uh, these types of distortions.

00:23:13.000 --> 00:23:26.000
So, traditionally, um, the way that we think about, uh, uh, sort of removing these effects is in a kind of re-weighting scenario, where you might start with your measured distribution, and you want to learn some re-weighting,

00:23:26.000 --> 00:23:32.000
So that way you can make your measured distribution look more like a true distribution.

00:23:32.000 --> 00:23:36.000
And this requires some knowledge of how your detector operates, so that way you have, kind of, samples of

00:23:36.000 --> 00:23:43.000
distorted data and, um, true data. So usually it's done with simulation.

00:23:43.000 --> 00:23:56.000
And, um, a nice way to do this type of re-weighting is to learn a likelihood ratio. So you can take the probability distributions of each of these two, the measured and the true, um, observable.

00:23:56.000 --> 00:24:06.000
And, uh, by learning the likelihood ratio, this gives you a nice re-weighting function, so for any point along the spectrum, you can re-weight the distribution to look more like the true measured.

00:24:06.000 --> 00:24:08.000
Uh, data.

00:24:08.000 --> 00:24:15.000
And as a quick aside, um, the way that machine learning is involved in this process, um,

00:24:15.000 --> 00:24:17.000
is exploiting this fact, which is that a

00:24:17.000 --> 00:24:23.000
uh, sort of very vanilla, um, uh, traditional neural network.

00:24:23.000 --> 00:24:29.000
that classifies between two classes can often be trained, say, with binary cross-entropy loss.

00:24:29.000 --> 00:24:31.000
Which looks like this.

00:24:31.000 --> 00:24:35.000
And, uh, the function that ends up minimizing this expression

00:24:35.000 --> 00:24:39.000
will satisfy your following for any small variation.

00:24:39.000 --> 00:24:43.000
Because it's a minimum of this loss functional.

00:24:43.000 --> 00:24:51.000
And from this, we can just, um, kind of extract, uh, this central term and, uh, rearrange, um,

00:24:51.000 --> 00:25:04.000
some, uh, the expression to find, uh, this really useful expression. So what I'm showing here is basically on the left is a rescaling of the classifier output that has learned how to optimally classify between two classes, A and B.

00:25:04.000 --> 00:25:10.000
And so F of X is, like, the function that defines the classification score.

00:25:10.000 --> 00:25:16.000
Um, so you can just kind of take your classifier output and divide it by 1 minus the classifier score.

00:25:16.000 --> 00:25:23.000
And on the right, um, we have our likelihood ratio.

00:25:23.000 --> 00:25:34.000
um… likelihood ratio here on the right, which is the parameter that we… or, sorry, the function that we wanted to learn in order to learn our re-weighting function in order to remove detector effects.

00:25:34.000 --> 00:25:40.000
Okay, so, um, so in practice, the way that we use neural networks is to train classifiers, and then we

00:25:40.000 --> 00:25:44.000
sort of reuse the classifiers in order to get these likelihood ratios.

00:25:44.000 --> 00:25:50.000
So you might be able to do this for, um, say, two, not just one, but maybe two different observables at once.

00:25:50.000 --> 00:25:56.000
And, um, in practice, this is how a lot of LHC measurements are published, um, as…

00:25:56.000 --> 00:26:02.000
differential cross-sections at the particle level, meaning they have been sort of corrected for any detector effects.

00:26:02.000 --> 00:26:11.000
And what I'm showing here is… is really… it's a two-dimensional measurement, even though it looks like six plots, um, just because it's… it's a measurement of one observable.

00:26:11.000 --> 00:26:17.000
in bins of another observable. So you can think of it as just two dimensions, but it's in… it's in bins.

00:26:17.000 --> 00:26:26.000
Um, so this is sort of, uh, in a more standard way, how a lot of standard model measurements are reported, um, to correct for these distortions.

00:26:26.000 --> 00:26:35.000
Um, but by using machine learning in this framework that I just described, um, we've recently been able to do this for much higher dimensional datasets. So, um, this…

00:26:35.000 --> 00:26:43.000
measurement that I'm describing now is a 24-dimensional measurement, whereas previously this was really only able to be done for, like, 2 or 3 observables at a time.

00:26:43.000 --> 00:26:52.000
And, uh, so this is what it looks like in cartoon form. You know, we're sort of estimating these likelihood ratios, and we're applying them onto 24 different observables.

00:26:52.000 --> 00:26:58.000
such that, as a distribution, all 24 of these observables are corrected to look more like true

00:26:58.000 --> 00:27:00.000
true data.

00:27:00.000 --> 00:27:06.000
Um, and the non-cartoon version looks like this, where we can sort of report the corrected, um, measurements.

00:27:06.000 --> 00:27:11.000
And compare them, the pink and the blue here are, um…

00:27:11.000 --> 00:27:16.000
Monte Carlo simulations, and so we can really try to… try to look and see where does our Monte Carlo coverage…

00:27:16.000 --> 00:27:22.000
look good, and where does it, um, fall short of our corrected data?

00:27:22.000 --> 00:27:24.000
And another nice feature of this?

00:27:24.000 --> 00:27:33.000
is that, um, we don't have to be limited to just the 24 observables that we happen to choose for this measurement, so you can take the ratio of two of them,

00:27:33.000 --> 00:27:44.000
And, um, this is done because it's using machine learning, but it's done in an unbitten way, and so it allows us to sort of, um, trivially change the bins here to allow

00:27:44.000 --> 00:27:49.000
For, um, multiple observables to be compatible.

00:27:49.000 --> 00:27:59.000
And further, maybe you want to change the phase space of your measurement. This is something that would not be easy to do with a sort of fixed binned measurement, but because it's unbbinned,

00:27:59.000 --> 00:28:05.000
We can apply these types of cuts after the measurement has happened, and then, um, uh,

00:28:05.000 --> 00:28:08.000
compare our Monte Carlo in new regimes.

00:28:08.000 --> 00:28:19.000
And we can do, sort of, fancier combinations. This is a delta R distribution, which again was not part of the original measurement, but it can be constructed post hoc.

00:28:19.000 --> 00:28:22.000
Um, using a combination of around 8 different observables.

00:28:22.000 --> 00:28:27.000
Or we can report the average of one observable in bins of another, and so even though all of these are binned,

00:28:27.000 --> 00:28:34.000
plots. It's an unbinned measurement, and so I'm just choosing my bins, um, and I could trivially change them.

00:28:34.000 --> 00:28:39.000
Point being, we can choose to publish our data in much more flexible formats.

00:28:39.000 --> 00:28:43.000
with, uh, corrections for detector effects.

00:28:43.000 --> 00:28:51.000
And, uh, this is not something that's only happened at the LHC in recent years, even though the Atlas measurement was a major one.

00:28:51.000 --> 00:28:58.000
So, more recently, uh, I led this white paper that's, uh, like a practical guide to unbinned unfolding across a series of experiments.

00:28:58.000 --> 00:29:07.000
So, we've had, um, measurements do this from ATLAS, from CMS, from H1, LHCB star, and T2K.

00:29:07.000 --> 00:29:11.000
And so, the point being that now this is a, um…

00:29:11.000 --> 00:29:18.000
Uh, it's becoming a more popular standardized format for how to think about publishing our data in particle physics.

00:29:18.000 --> 00:29:25.000
Such that, um, it's easier for theorists to work with, it's easier for other experimentalists to work with.

00:29:25.000 --> 00:29:32.000
And I argue, if we ever wanted to put these datasets in conversation with one another by training a machine learning model on multiple datasets,

00:29:32.000 --> 00:29:39.000
using this type of format and correcting for detector effects makes that easier.

00:29:39.000 --> 00:29:47.000
Um, and then one other point on this idea of, like, how… how do we have to configure our data in order to

00:29:47.000 --> 00:29:50.000
um, potentially enable combinations of multiple datasets.

00:29:50.000 --> 00:29:58.000
is that, um, not only do we have to worry about detector biases, like detector-specific effects in our data,

00:29:58.000 --> 00:30:04.000
But we also have to think about the representations of our data in a machine learning model.

00:30:04.000 --> 00:30:07.000
These are not obvious, um, and…

00:30:07.000 --> 00:30:16.000
Uh, right now, it's sort of the common paradigm for how to do this is to create custom embeddings for every single dataset that you might want to train a model on.

00:30:16.000 --> 00:30:19.000
So, um, maybe you have…

00:30:19.000 --> 00:30:27.000
A model that takes in jets, um, particle jets. You can train your own tokenizer that sort of creates a representation of particle jets.

00:30:27.000 --> 00:30:35.000
And, um, you have to sort of verify that the tokenizer has captured all the qualities that you care about of your data.

00:30:35.000 --> 00:30:46.000
And then if you want to train a model on, um, galaxy spectra, you have to do the same thing. So you'd have to design your own tokenizer and, like, carefully verify that it captures the properties that you care about.

00:30:46.000 --> 00:31:01.000
So we can do this. It's a little tedious, and it's also a bit inflexible, especially if you imagine publishing a model that's been trained on some kind of data, but then you have a different kind of data. You wouldn't… it wouldn't be obvious how to, um,

00:31:01.000 --> 00:31:08.000
represent your data such that you could use that same model, um, to work on it.

00:31:08.000 --> 00:31:12.000
So, it's also interesting to consider more generic embeddings as, like, a potential alternative to this.

00:31:12.000 --> 00:31:18.000
Um, and, uh, the most generic embedding that I know about would be something like text.

00:31:18.000 --> 00:31:26.000
Where you can imagine sort of rendering all of your data in, like, a JSON or, like, a text-like format, such that it's, um,

00:31:26.000 --> 00:31:32.000
totally, uh, trivial to combine data from many different formats into a single training corpus.

00:31:32.000 --> 00:31:40.000
Which is why large language models were the first foundation models to be published, is because text is the easiest dataset to, um, combine.

00:31:40.000 --> 00:31:48.000
Um, so, uh, there are a lot of challenges with this, like, more speculative, um, alternate to, uh, dedicated embeddings.

00:31:48.000 --> 00:31:55.000
Um, the biggest, uh, obstacle to this is that language and text are, like, not, uh,

00:31:55.000 --> 00:31:57.000
natively, uh,

00:31:57.000 --> 00:32:05.000
They don't… they don't operate well, um, with each other, and language… language models especially, and sort of the tokenizers that they use to represent language,

00:32:05.000 --> 00:32:12.000
don't really capture what makes numbers different from other types of language tokens.

00:32:12.000 --> 00:32:16.000
And so, common tokenization methods for numbers,

00:32:16.000 --> 00:32:20.000
um, are pretty rudimentary, in a sense, if you think about it. Like, you can sort of alternately…

00:32:20.000 --> 00:32:27.000
tokenize each individual digit, but then that sort of takes a lot of the contextual information out of your representation.

00:32:27.000 --> 00:32:38.000
Or you can imagine tokenizing every digit as its own token, which clearly does not scale if we try to tokenize every number. It's like a nonsense, uh…

00:32:38.000 --> 00:32:43.000
task. So this is clearly different, uh, difficult, and, um,

00:32:43.000 --> 00:32:54.000
One possible way of getting around this is by imagining creating dedicated numerical tokens to represent a continuous numerical spectrum. So, how do we make this continuous?

00:32:54.000 --> 00:33:01.000
is by scaling the magnitude of the representation, so… such that a larger number would sort of be the same, like, number

00:33:01.000 --> 00:33:06.000
object, but would be larger in embedding space, um, in magnitude.

00:33:06.000 --> 00:33:13.000
And, uh, this worked better than we expected, um, especially in out-of-distribution generalization.

00:33:13.000 --> 00:33:20.000
This is an interpolation plot that interpolates, um, so this whole region has not been seen during training.

00:33:20.000 --> 00:33:26.000
And if we combine this method, or if we compare this method with other, sort of, standard numerical tokenizations,

00:33:26.000 --> 00:33:33.000
Only this kind of continuous tokenization gets close at a smooth interpolation out of distribution.

00:33:33.000 --> 00:33:42.000
And I was quite surprised by this, that the models, um, trained using this continuous tokenization were able to even reproduce, um,

00:33:42.000 --> 00:33:47.000
planetary orbits, um, even though they're not trained autoregressively. And you can see that, uh, non…

00:33:47.000 --> 00:33:52.000
continuous tokenizations just completely fail at this type of task.

00:33:52.000 --> 00:33:58.000
So, um, who knows? It could be that, you know, maybe, uh, we…

00:33:58.000 --> 00:34:05.000
will want to have an alternative to specialized tokenization schemes for all of our variety of datasets out there.

00:34:05.000 --> 00:34:12.000
Um, regardless, I think, um, a message I want to send here is that I think language will be a critical

00:34:12.000 --> 00:34:15.000
modality or input, um,

00:34:15.000 --> 00:34:17.000
Whether it's used as a data format or not.

00:34:17.000 --> 00:34:24.000
And the reason for this is basically context is going to be critical, especially if we imagine putting multiple datasets into a single model.

00:34:24.000 --> 00:34:38.000
Like, these are two images, but they represent completely different scales of what they're capturing. Um, this is, of course, a black hole on the left, and then on the right is, like, an image in a calorimeter at Atlas of a pion.

00:34:38.000 --> 00:34:43.000
So, um, even though these both are sort of, like, pixelated scientific images,

00:34:43.000 --> 00:34:50.000
to a model, um, it's critical that the model understands what it is… what is being captured in each of these settings.

00:34:50.000 --> 00:34:59.000
And, um, we see that this actually has, like, tangible performance outcomes, too. So, um, I like this example from… this is just time series forecasting.

00:34:59.000 --> 00:35:11.000
Um, that these researchers found that context is key in the sense that by providing just a little bit of context to a time series prediction model that's been trained on a very diverse source of different time series,

00:35:11.000 --> 00:35:17.000
Um, having just a little bit of context about, like, what, uh, the time series represents, when it was captured.

00:35:17.000 --> 00:35:24.000
really constrains the overall uncertainty of the, um, rollout predictions of where the time series should go.

00:35:24.000 --> 00:35:34.000
And I think it's likely that this will hold up in, um, in physics contexts, where our model is having to make predictions across a variety of huge scales.

00:35:34.000 --> 00:35:37.000
And, um, we're also seeing that… Oh, yep.

00:35:37.000 --> 00:35:40.000
So, you took a model…

00:35:40.000 --> 00:35:44.000
It was just built to…

00:35:44.000 --> 00:35:46.000
interpreted time series.

00:35:46.000 --> 00:35:51.000
And you told it something about historical data, or…

00:35:51.000 --> 00:35:53.000
power plants.

00:35:53.000 --> 00:35:58.000
whatever. Sure, right, yeah. In this case, I think it's power plants. And it was able to take…

00:35:58.000 --> 00:36:04.000
that information and get a more accurate prediction for protection.

00:36:04.000 --> 00:36:11.000
That's right, and these blue bands are sort of representing… they test the model many times to sort of ask it to complete the time series.

00:36:11.000 --> 00:36:22.000
And the variation is huge, um, unless you provide a little bit of context in which the variation gets… And that context was just that bit of text that you picked. That's right.

00:36:22.000 --> 00:36:25.000
Wow. Yeah. I don't know, this…

00:36:25.000 --> 00:36:29.000
This is a model trained on our, uh…

00:36:29.000 --> 00:36:38.000
What is it? It's real… it's time series data, but it's across, like, a number of different, um, settings, so it's not just power plants, um… Oh, okay. The additional data is just those sentences.

00:36:38.000 --> 00:36:43.000
I was telling it prior that it's, you know, it's a solar power, so of course it should…

00:36:43.000 --> 00:36:46.000
go down during the nighttime, right?

00:36:46.000 --> 00:36:51.000
definitely, and I think it's not obvious if… We have all sorts of time series, how it is…

00:36:51.000 --> 00:36:56.000
Where… was his previous experience like?

00:36:56.000 --> 00:37:02.000
power plants and five other different types of time series, or was it a million different?

00:37:02.000 --> 00:37:12.000
other time touches, any kind of time series going up, I mean, what does it do? I think it would crane up the stock market, yeah, like, um, population, like…

00:37:12.000 --> 00:37:18.000
traffic. Yeah, but it really makes a difference whether the…

00:37:18.000 --> 00:37:21.000
The training data,

00:37:21.000 --> 00:37:28.000
fell into a few discrete categories, and that sentence was enough to tell her, oh, it's…

00:37:28.000 --> 00:37:33.000
my Category 5. Right, right. Or whether it was a million different…

00:37:33.000 --> 00:37:38.000
Categories. Yeah. And then it was smart enough to…

00:37:38.000 --> 00:37:50.000
somehow guess which… Right. So I don't know from what you've told us which of those things is true. I believe this one was trained on, um, the Monash time series benchmark dataset, which

00:37:50.000 --> 00:37:56.000
Uh, off the top of my head, I think it's probably, like, 30 broad category of time series data.

00:37:56.000 --> 00:37:58.000
But, um…

00:37:58.000 --> 00:38:03.000
I think the settings are quite diverse, so I think just… I don't think, like, power plant…

00:38:03.000 --> 00:38:05.000
time series isn't… is…

00:38:05.000 --> 00:38:08.000
Um, one category, so, um…

00:38:08.000 --> 00:38:17.000
And I think you're right that, you know, of course it has a little bit more information, so it… I don't think that this is super surprising. I think it's more…

00:38:17.000 --> 00:38:23.000
It indicates to me that if we train a model in an unconstrained way on very diverse data sources,

00:38:23.000 --> 00:38:29.000
It is going to be helpful to provide context as we prompt the model to make predictions.

00:38:29.000 --> 00:38:35.000
So, yeah, I think that this gets more extreme as you have more diverse training data.

00:38:35.000 --> 00:38:45.000
So, I actually wanted to ask you about your numbers, uh, work. So… Oh, yeah. Like, I don't know if… I'm sure everyone's noticed, but, like, the…

00:38:45.000 --> 00:38:49.000
chatbots have gotten a lot better at math in the last few years.

00:38:49.000 --> 00:38:55.000
So by now, you can almost basically count on them to do arithmetic and bookkeeping correctly.

00:38:55.000 --> 00:39:00.000
That's right. So do you know what they've done under the hood? Have they used your work?

00:39:00.000 --> 00:39:05.000
For the… They use tools. They use tools, yeah. It secret stuff they can't tell us about. At the moment.

00:39:05.000 --> 00:39:13.000
Basically, they use Python, um, or they'll use a calculator. Um, so, uh, that's what's happening under the hood, is that the core representation

00:39:13.000 --> 00:39:19.000
has not gotten much better at arithmetic, and I actually think I have a backup slide on this, um, which I thought was kind of fun.

00:39:19.000 --> 00:39:23.000
And so that's sort of distinct from what you guys were proposing?

00:39:23.000 --> 00:39:25.000
You're proposing a different representation?

00:39:25.000 --> 00:39:31.000
Right. It's not just tokenizing all the numbers, but actually, you had a list of how you could represent the data, uh,

00:39:31.000 --> 00:39:35.000
And one of your options was just tokenize all the numbers…

00:39:35.000 --> 00:39:39.000
Straight up. Individually, right? Individually, and that's still what they're doing, probably?

00:39:39.000 --> 00:39:50.000
Often what they do is they tokenize numbers, um, like, 3 digits at a time, in reverse order, because the, um, later digits are more important than the earlier ones.

00:39:50.000 --> 00:39:59.000
So there's, like, some of these twists, but really, when you are interacting now with, like, ChatGPT or Claude, it's just calling Python under the hood. So, um…

00:39:59.000 --> 00:40:09.000
We tried to really focus on settings where it's not clear, um, what function you would use to solve the problem. Like, there's no…

00:40:09.000 --> 00:40:15.000
Um, there's no Python script to run to make that prediction, because we wanted it to run directly on data.

00:40:15.000 --> 00:40:16.000
Um, and…

00:40:16.000 --> 00:40:22.000
Yeah, you're right that these models have not been good at arithmetic, and this is sort of the classic, like,

00:40:22.000 --> 00:40:28.000
Back in the day, when you would ask ChatGPT to do four-digit multiplication, it would fail over 90%, 95% of the time.

00:40:28.000 --> 00:40:33.000
And I checked more recently, um, because, you know, DeepSeek and some of these other models have come out, and even this

00:40:33.000 --> 00:40:36.000
$700 billion parameter model.

00:40:36.000 --> 00:40:40.000
This is also multiplication tables, um,

00:40:40.000 --> 00:40:42.000
it maybe memorizes up to, like,

00:40:42.000 --> 00:40:47.000
10, or maybe, like, 12 by 12-digit accuracy.

00:40:47.000 --> 00:40:53.000
Um, but then it drops off, so the language models are not understanding what does multiplication…

00:40:53.000 --> 00:40:57.000
Meaning. True. They are equipped with now tool use.

00:40:57.000 --> 00:40:59.000
Yes. Calculator use. Yeah.

00:40:59.000 --> 00:41:10.000
The second one, it's not using a calculator? No. It doesn't understand, like, long multiplication or whatever, doesn't… It doesn't understand those rules, yeah.

00:41:10.000 --> 00:41:14.000
You need to have a use thing, you know, based off the fact that it's bad, or, like…

00:41:14.000 --> 00:41:19.000
I mean, obviously, like, I guess, how would you know whether or not it is using a cover?

00:41:19.000 --> 00:41:23.000
Hmm. I mean, of course there's, um…

00:41:23.000 --> 00:41:31.000
I guess these are from papers where they're able to use the open source model directly, and so the researchers know

00:41:31.000 --> 00:41:34.000
how the model is being, uh, evaluated. Um, but you're right that in settings like

00:41:34.000 --> 00:41:43.000
working with GPT now, we don't have perfect insight into what is happening under the hood as it's making those calculations.

00:41:43.000 --> 00:41:49.000
They want to make you think that it can just calculate it, right? Right. Secretly, it's like…

00:41:49.000 --> 00:41:59.000
Yeah. Pretty misleading.

00:41:59.000 --> 00:42:03.000
Um…

00:42:03.000 --> 00:42:05.000
Cool, so one more quick comment on language, and then I'll move on to the last part.

00:42:05.000 --> 00:42:09.000
Oh, Professor? Professor Petey?

00:42:09.000 --> 00:42:10.000
Yeah. Yes.

00:42:10.000 --> 00:42:15.000
Hello, can you hear me? Hi, uh, can you go back… can you share your screen again? Um, because it looks like a…

00:42:15.000 --> 00:42:16.000
Ah, thank you.

00:42:16.000 --> 00:42:18.000
Yeah, sorry.

00:42:18.000 --> 00:42:19.000
Yeah, of course.

00:42:19.000 --> 00:42:21.000
Thank you.

00:42:21.000 --> 00:42:22.000
There we go.

00:42:22.000 --> 00:42:24.000
Perfect, alright, that works. Thank you!

00:42:24.000 --> 00:42:27.000
Of course. Um…

00:42:27.000 --> 00:42:30.000
Great, so last comment on language is that, um,

00:42:30.000 --> 00:42:36.000
you know, maybe it's useful as a data format.

00:42:36.000 --> 00:42:37.000
Maybe it's useful as…

00:42:37.000 --> 00:42:41.000
context for a model as it makes predictions across diverse data.

00:42:41.000 --> 00:42:49.000
Maybe it's also useful as an input itself, like, it's arguably, like, a pretty important

00:42:49.000 --> 00:42:55.000
type of data that we use in physics. It embeds a lot of contextual information.

00:42:55.000 --> 00:42:57.000
And so, um, there's been some interesting work

00:42:57.000 --> 00:43:00.000
recently about, um, using

00:43:00.000 --> 00:43:06.000
papers, um, so it could be… it could be, um, text in the form of, like, um,

00:43:06.000 --> 00:43:12.000
captions on an image, but we don't have a lot of captioned images in physics. They're sort of hard to aggregate, and so, um,

00:43:12.000 --> 00:43:17.000
more useful, maybe, is the sense of scientific papers, and so you can imagine

00:43:17.000 --> 00:43:24.000
pairing scientific papers with actual spectra, or, like, scientific data sources.

00:43:24.000 --> 00:43:32.000
And using a contrastive loss to, um, force these representations to live close together into a models embedding space.

00:43:32.000 --> 00:43:40.000
And, um, this paper argues that this improves the quality of some of the downstream tasks that you might want to do on Spectra alone.

00:43:40.000 --> 00:43:42.000
And so, um…

00:43:42.000 --> 00:43:47.000
Point being that it… I think we're gonna think creatively about how text…

00:43:47.000 --> 00:43:56.000
should be included into physics analyses in the future, and it could be that we'll just consider text to be a critical data source on its own.

00:43:56.000 --> 00:43:58.000
Um,

00:43:58.000 --> 00:43:59.000
Yep.

00:43:59.000 --> 00:44:05.000
Do I have a question about this? What do you? Yeah. What do you do if the papers are wrong?

00:44:05.000 --> 00:44:08.000
Um…

00:44:08.000 --> 00:44:10.000
I…

00:44:10.000 --> 00:44:17.000
I don't know if this paper addresses that explicitly, um…

00:44:17.000 --> 00:44:20.000
I think, actually, they may still be useful.

00:44:20.000 --> 00:44:23.000
simply as, um…

00:44:23.000 --> 00:44:27.000
embedding relationships between objects.

00:44:27.000 --> 00:44:29.000
you know, the idea that, like,

00:44:29.000 --> 00:44:30.000
galaxies have spectra.

00:44:30.000 --> 00:44:34.000
spectra have whatever parameters, um…

00:44:34.000 --> 00:44:43.000
you could think of the model as, like, frantically trying to learn, like, a world model of how the universe operates, and the spectra are giving it one source of insight there.

00:44:43.000 --> 00:44:51.000
And, um, at the same time, language is also providing its own world model for relationships between objects that we study, so…

00:44:51.000 --> 00:44:55.000
I think even if the numerical claims of papers are wrong,

00:44:55.000 --> 00:45:03.000
Um… I still think there might be utility in, um, just encoding those relationships in language form.

00:45:03.000 --> 00:45:04.000
in a representation.

00:45:04.000 --> 00:45:12.000
What if you… what if you… what if you're trying to answer some question about electromagnetism, and you train on papers that all are based on the ether?

00:45:12.000 --> 00:45:17.000
Yeah, clearly there's a limit, yeah. Um, and if you're… if you're…

00:45:17.000 --> 00:45:29.000
Um, claiming that something exists that is not represented in your spectra or in your other data, right? Like, maybe this claims that there's aliens, but we don't have any spectra showing evidence of aliens.

00:45:29.000 --> 00:45:35.000
You'd hope that a good contrastive loss should really only pair, like, what are…

00:45:35.000 --> 00:45:39.000
the shared semantic concepts that are present in both inputs.

00:45:39.000 --> 00:45:47.000
Um, and so you'd hope that the model would largely ignore noise, um, whether that's noise in the spectra or whether that's noise in the papers.

00:45:47.000 --> 00:45:55.000
Um, but it's a… it's a really cool point, and it's something I've been thinking about, is, like, in a basic contrastive loss setting,

00:45:55.000 --> 00:45:59.000
Are there ways that we can sculpt a latent space to have

00:45:59.000 --> 00:46:03.000
shared information and not shared information.

00:46:03.000 --> 00:46:09.000
And maybe the not shared information is noise that you want to ignore. So I think that there's more room for, like, uh…

00:46:09.000 --> 00:46:17.000
more curated ways of representat… representing your information, um, other than just putting everything together.

00:46:17.000 --> 00:46:20.000
And hoping that the model learns what's useful and what's not.

00:46:20.000 --> 00:46:22.000
Well, part of it could be it has the author list, right?

00:46:22.000 --> 00:46:30.000
I learned that, you know, certain authors don't pay attention to their paper. It's true.

00:46:30.000 --> 00:46:32.000
No, I think more likely, it's really just the… the…

00:46:32.000 --> 00:46:38.000
Um, it's weird because we think of papers as, you know, producing…

00:46:38.000 --> 00:46:43.000
overall scientific, um, results, and of course they do.

00:46:43.000 --> 00:46:49.000
But I think to a model like this, it's paying attention more to just sentence structure and, like, what sentences talk about what objects.

00:46:49.000 --> 00:46:56.000
And, um, which authors are associated with the useful papers, maybe.

00:46:56.000 --> 00:47:04.000
Um, just for time, I'll mention some of these last few slides, and then I'd love to get into more conversation.

00:47:04.000 --> 00:47:07.000
Um, so, uh, I've talked about the challenges of

00:47:07.000 --> 00:47:10.000
putting datasets together.

00:47:10.000 --> 00:47:15.000
And now I want to think about, as we scale this to even more extreme versions, um…

00:47:15.000 --> 00:47:17.000
what new problems might emerge.

00:47:17.000 --> 00:47:25.000
So I'm thinking about basically taking information from across an entire detector and using all of those inputs into a single model.

00:47:25.000 --> 00:47:32.000
And even potentially across many detectors, and maybe even different disciplines of physics, say.

00:47:32.000 --> 00:47:35.000
So, um, first just to define, um,

00:47:35.000 --> 00:47:38.000
modality is…

00:47:38.000 --> 00:47:43.000
kind of a fuzzy concept, but it's a specific way of perceiving the world, and it's often, um…

00:47:43.000 --> 00:47:49.000
Compared to the human senses. So you can think of, like, vision as different from touch, which is different from smell.

00:47:49.000 --> 00:47:56.000
Um, in the same way, um, in machine learning, we talk about how to ingest inputs of different formats and different sources.

00:47:56.000 --> 00:48:06.000
And in standard machine learning contexts, this really just refers to, like, a handful of specific data types, and usually they're, like, images, text, speech.

00:48:06.000 --> 00:48:12.000
Um, but in physics, uh, we have way more modalities, and um…

00:48:12.000 --> 00:48:17.000
Why do we even care about modalities and having multimodality in a model?

00:48:17.000 --> 00:48:19.000
Um,

00:48:19.000 --> 00:48:24.000
It's because the benefits have seemed kind of, um, evident in recent years.

00:48:24.000 --> 00:48:30.000
That, um, including multiple views or multiple modalities in a single model can improve performance.

00:48:30.000 --> 00:48:36.000
This is kind of the classic paper that demonstrated this, which was a clip from OpenAI in 2021.

00:48:36.000 --> 00:48:41.000
And they compare, um, results trained on ImageNet, in this case, the examples they give are of bananas.

00:48:41.000 --> 00:48:49.000
Um, and they compare CLIP, which explicitly pairs images and text into a single representation.

00:48:49.000 --> 00:48:56.000
versus ImageNet, which is a large-scale image model trained only on images and only on ImageNet in this case.

00:48:56.000 --> 00:49:07.000
And you can see that they get the same performance when they evaluate in-domain on the same bananas, the same datasets. But as you get sort of further out of domain on, say, like, more abstract

00:49:07.000 --> 00:49:13.000
ways of representing the concept of a banana to, like, things that, to me, don't even look like bananas down here.

00:49:13.000 --> 00:49:21.000
This clip representation remains really strong, um, at, at, uh, identifying these objects as bananas.

00:49:21.000 --> 00:49:30.000
Um, whereas the image model alone, uh, really drops off in performance. So the idea is, like, having multiple views with different formats.

00:49:30.000 --> 00:49:35.000
create some more robust embedding or a representation of a concept.

00:49:35.000 --> 00:49:44.000
And, uh, we've also seen this in science, too, so it's not just an industry. So, um, AstroClip, um, came out from, uh, my team a few years ago.

00:49:44.000 --> 00:49:49.000
Um, pairing images with Spectra, in this case, instead of images and text.

00:49:49.000 --> 00:49:53.000
into a shared embedding space, and, um, this allows for

00:49:53.000 --> 00:50:10.000
querying of nearest neighbors, and so you could sort of search, like, given a galaxy, give me galaxies that look like this galaxy. You can search image to image, or you can search spectrum to spectrum, or you can also cross the modalities, say, give me a spectrum that corresponds to this image of a galaxy.

00:50:10.000 --> 00:50:17.000
Um, and you can also do downstream tasks using this embedding as a starting point. So the, uh, aligned embedding space here,

00:50:17.000 --> 00:50:32.000
gets better predictions of redshifts of these galaxies than an unaligned representation that's only been trained on images. So, point being that, you know, there seems to be some useful information that has been retained in this combined latent space that exceeds

00:50:32.000 --> 00:50:37.000
the performance of an unaligned latent space.

00:50:37.000 --> 00:50:46.000
No, uh, that's great for two modalities, but we have a lot of modalities in science, and… Okay, I actually want to understand that example better. So, um, so…

00:50:46.000 --> 00:50:50.000
puts the X and the Y axis of the…

00:50:50.000 --> 00:50:52.000
of those 2D plots?

00:50:52.000 --> 00:50:57.000
Down here. It's Redshift, true Redshift, predicted redshift.

00:50:57.000 --> 00:51:03.000
And so a perfect model would have, like, a line on Y equals X for perfect predictions.

00:51:03.000 --> 00:51:10.000
And this is just showing the R-squared score of, like, how well it aligns, um, with a perfect frame rate. I'm just trying to predict the redshift.

00:51:10.000 --> 00:51:12.000
Given what? In this case?

00:51:12.000 --> 00:51:24.000
This one is starting from the shared embedding space that has knowledge of images and spectra. Yeah. Um, and then this is just going from images, which should not give a great representation of redshift.

00:51:24.000 --> 00:51:32.000
So the images, uh, are you getting the redshirt from the shapes of the galaxies, or, like, they're photometric? It's like a photometric redshift, or what?

00:51:32.000 --> 00:51:40.000
Um, for this one, I think it's not clear. I mean, I think basically we would expect these results to not be so strong.

00:51:40.000 --> 00:51:42.000
Uh, because I… it's…

00:51:42.000 --> 00:51:46.000
2D image, um, across a couple different bands.

00:51:46.000 --> 00:52:00.000
Um, so there's not a lot of information in Redshift that's encoded there. And then, in the case of the Astro clip, are you feeding it both an image and a spectrum, or just still an image, but it's going to a better late space?

00:52:00.000 --> 00:52:09.000
It's starting from the embedding space here. So, the embedding space is like a fixed dimensionality, um, but it has been trained to align

00:52:09.000 --> 00:52:13.000
semantic information from Spectra and images.

00:52:13.000 --> 00:52:18.000
However, how would it… sorry, how does it start? Does the embedding space have a ZHP

00:52:18.000 --> 00:52:25.000
Didn't it? Or how does it start from the… Oh, well, it starts from this, like, I think it's 768-dimensional space.

00:52:25.000 --> 00:52:31.000
And then we train another, like, lightweight neural network on top of that to predict the midshift.

00:52:31.000 --> 00:52:35.000
Right, but where's the true richest come from?

00:52:35.000 --> 00:52:41.000
Uh, well, we'll have labeled data, and so we evaluate it on some labeled test set.

00:52:41.000 --> 00:52:44.000
But the label is not accessible to the model.

00:52:44.000 --> 00:52:52.000
Right, yeah. So, but you must be feeding it either images or spectra or both, then, right, to get to the embedding space.

00:52:52.000 --> 00:52:56.000
Like, you're not starting for the embedding space, you must be starting from…

00:52:56.000 --> 00:53:00.000
an image or something. Otherwise, it's not a fair comparison with the…

00:53:00.000 --> 00:53:02.000
Oh, sure, um…

00:53:02.000 --> 00:53:04.000
So, uh…

00:53:04.000 --> 00:53:08.000
the embedding spaces of a single galaxy.

00:53:08.000 --> 00:53:15.000
It has, uh, access to both image and spectral representations, and so in this case, uh,

00:53:15.000 --> 00:53:18.000
Right, yes. These plots are both sort of the image

00:53:18.000 --> 00:53:29.000
portion of the embedding. There are other plots that show the spectrum-only, um, predictions, and they're much sharper, as you would expect, because the spectrum encodes much more information on the redshift.

00:53:29.000 --> 00:53:39.000
So this is really kind of probing, like, the image portion of these intentions. So you can improve the image base, like, regression or prediction of a redshift?

00:53:39.000 --> 00:53:41.000
By having this shared embedding. That's right, yeah.

00:53:41.000 --> 00:53:48.000
And the… if I showed the Redshift version, you… the argument is kind of the opposite of, like, the shared embedding space.

00:53:48.000 --> 00:53:54.000
is still able to give good predictions at the redshift, and it hasn't lost information.

00:53:54.000 --> 00:53:57.000
Yeah, good question.

00:53:57.000 --> 00:53:59.000
Um…

00:53:59.000 --> 00:54:02.000
Uh, I think I probably just have, like,

00:54:02.000 --> 00:54:12.000
2 minutes left or something? We ask a lot of questions. We started a little late, so, yeah. I'll say 5 minutes, yeah. They're really good questions, so I appreciate it.

00:54:12.000 --> 00:54:17.000
Um, okay, so what I just showed was for two modalities.

00:54:17.000 --> 00:54:23.000
But it's not obvious how to extend it to more modalities than just images and spectra, but in physics,

00:54:23.000 --> 00:54:29.000
Our detectors capture a lot of information across many different, um, parts of the machine all at once.

00:54:29.000 --> 00:54:33.000
So, arguably, you know, if you think about all the components that go into

00:54:33.000 --> 00:54:39.000
a particle physics detector, or even a telescope. There are a lot of different modalities that you might argue are present.

00:54:39.000 --> 00:54:42.000
Um, uh, s…

00:54:42.000 --> 00:54:46.000
similar illustration of the same thing, you know, maybe you have tracks, maybe you have…

00:54:46.000 --> 00:54:52.000
different layers of calorimeter images, point clouds, you might want to include all of this information into a single model.

00:54:52.000 --> 00:54:55.000
How do we do that? Um…

00:54:55.000 --> 00:54:58.000
Well, first of all, we've, um…

00:54:58.000 --> 00:55:01.000
just produce some datasets to try to help enable these types of studies.

00:55:01.000 --> 00:55:05.000
So this is a large-scale data set called the Multimodal Universe.

00:55:05.000 --> 00:55:13.000
That, um, takes a lot of publicly available, um, astrophysics data and sort of makes it, like, machine learning ready. It sort of renders it in a format that's

00:55:13.000 --> 00:55:18.000
easy to combine, has clear metadata, and is released on Hugging Face.

00:55:18.000 --> 00:55:21.000
Um, and using this dataset, um,

00:55:21.000 --> 00:55:25.000
Uh, we trained this foundation model called ION1,

00:55:25.000 --> 00:55:32.000
Which is an astrophysics foundation model that's been trained on 39 different modalities across 5 different surveys.

00:55:32.000 --> 00:55:38.000
So, there's a lot of cross-matching between the different surveys that are indicated by these lines here.

00:55:38.000 --> 00:55:44.000
And the types of modalities are sort of illustrated in these little cartoons. Um, so we have things like spectra,

00:55:44.000 --> 00:55:52.000
Um, galaxy images, um, metadata in the form of, uh, tabular information, etc.

00:55:52.000 --> 00:55:54.000
And, um…

00:55:54.000 --> 00:56:01.000
So in order to handle 39 different inputs with different formats, um, we trained a bunch of different tokenizers that were meant

00:56:01.000 --> 00:56:08.000
to tokenize, uh, all of the different types of properties the model might encounter, whether that's images, spectra, or more.

00:56:08.000 --> 00:56:10.000
It's all fed into this kind of, like,

00:56:10.000 --> 00:56:16.000
masked modeling, uh, framework, so it's GPT-like.

00:56:16.000 --> 00:56:21.000
But instead of predicting the next token, um,

00:56:21.000 --> 00:56:25.000
It's masking random tokens, um, as part of the overall inputs.

00:56:25.000 --> 00:56:36.000
And then this overall embedding is then used as a baseline for a number of different downstream tasks that were not part of the training process. So that includes regressing physical parameters, um,

00:56:36.000 --> 00:56:48.000
us doing instant segmentation or image segmentation, so maybe you want to take a galaxy image and then train a lightweight convolutional neural network on top of it, and then that network will sort of identify, like, important parts of the galaxy image.

00:56:48.000 --> 00:56:55.000
Um, or you can do retrieval, so maybe you start with one type of image, like a strong lens.

00:56:55.000 --> 00:57:04.000
And then you can do a cosine similarity, and you can easily find other images across all of these different inputs that, um, are similar to that image.

00:57:04.000 --> 00:57:12.000
So, um, this was a cool stab at sort of how to extend this to, like, on the order of tens of modalities at once.

00:57:12.000 --> 00:57:16.000
Um, it's not clear if this is the only framework that is able to

00:57:16.000 --> 00:57:19.000
taken, um, dozens of modalities.

00:57:19.000 --> 00:57:24.000
Um, but I think this is a quickly developing space.

00:57:24.000 --> 00:57:32.000
Um, and here… and another exciting consequence of this is not just, like, you know, here's what it can do on astrophysics data.

00:57:32.000 --> 00:57:38.000
But, um, we're starting to see some signs that, uh, some of these models can also seemingly transfer out of domain, um, to new settings.

00:57:38.000 --> 00:57:46.000
So this is an early example of this, um, fluid dynamics foundation model that was trained on incompressible Navier stokes.

00:57:46.000 --> 00:57:51.000
And then evaluated, like, out of domain on compressible Nadia-Stokes simulations.

00:57:51.000 --> 00:57:58.000
But in two different settings, where, um, there's the NEAR regime, where the data kind of looks more similar to the training

00:57:58.000 --> 00:58:02.000
domain, and the FAR regime, where the data looks less similar.

00:58:02.000 --> 00:58:10.000
And in each of these two settings, the pre-trained model on physics data outperformed training from scratch on these baselines.

00:58:10.000 --> 00:58:20.000
As well as outperformed video models that had been trained on similar amounts of non-physics video data. And so there's, like, some early indications that, you know, training on physics data

00:58:20.000 --> 00:58:25.000
is useful for, um, transferring out of domain on physics data.

00:58:25.000 --> 00:58:33.000
And then more recently, um, we've seen a number of different foundation models, for instance, in particle physics, there are a handful, um,

00:58:33.000 --> 00:58:36.000
And it seems like some of these

00:58:36.000 --> 00:58:44.000
transfer surprisingly well out of domain also. So these are two evaluations of one particle physics foundation model.

00:58:44.000 --> 00:58:46.000
on, um, cosmological data.

00:58:46.000 --> 00:58:56.000
As well as, uh, molecular data. And so, basically, they use the particle physics model as the baseline, and then they do just a little bit of fine-tuning, and then they're able to use this model on

00:58:56.000 --> 00:59:00.000
very different types of, uh, data sources.

00:59:00.000 --> 00:59:08.000
And, uh, in such a way where the predictions, um, outperform training from scratch on these types of methods.

00:59:08.000 --> 00:59:13.000
So, I think these types of questions, or sorry, these types of results

00:59:13.000 --> 00:59:19.000
challenge what we think, um, in terms of training data set curation.

00:59:19.000 --> 00:59:20.000
for large-scale models.

00:59:20.000 --> 00:59:24.000
It seems like maybe at some threshold of model capacity.

00:59:24.000 --> 00:59:34.000
Um, historical definitions of, like, where scientific domains begin and end might not be so important to a model, and what might be more important is maybe, like,

00:59:34.000 --> 00:59:40.000
Is the data structurally similar to one another? And maybe the semantic content is not so useful.

00:59:40.000 --> 00:59:45.000
But, you know, we're kind of just scratching the surface of, like, what makes training data useful.

00:59:45.000 --> 00:59:50.000
Um, and what makes it easy to transfer to a new dataset out of domain.

00:59:50.000 --> 00:59:51.000
Um, so to sort of help…

00:59:51.000 --> 00:59:58.000
explore this territory, um, I've thought a bit about, like, mapping scientific research with knowledge graphs.

00:59:58.000 --> 01:00:02.000
To kind of get a sense of, like, what concepts seem to be related in scientific papers.

01:00:02.000 --> 01:00:10.000
And maybe this can help us define what types of datasets might be considered appropriate for in-domain and out-of-domain applications.

01:00:10.000 --> 01:00:20.000
Um, and, uh, where I'm really interested right now is a question of, like, the information content of these datasets, and can we say anything about them?

01:00:20.000 --> 01:00:24.000
Um, from the lens of information theory. Um…

01:00:24.000 --> 01:00:35.000
So, uh, uh, in particular, thinking about, like, the mutual information between different inputs, and wondering, like, what is the right combination of data that is similar versus data that's different?

01:00:35.000 --> 01:00:42.000
In order to make, like, a good recipe for a strong foundation model that can transfer very broadly across fields.

01:00:42.000 --> 01:00:50.000
Um, so to help with these types of studies, um, a team of collaborators at Flatiron in Wisconsin

01:00:50.000 --> 01:00:51.000
produced this, um,

01:00:51.000 --> 01:00:58.000
series of datasets. Sorry, I'm gonna take a quick drink.

01:00:58.000 --> 01:01:02.000
Basically, we produce a series of datasets that, um,

01:01:02.000 --> 01:01:07.000
can resemble non-trivial physics datasets, but nevertheless have known mutual information.

01:01:07.000 --> 01:01:19.000
And this could be true for two modalities, or it could be for dozens of modalities. And this is a real improvement on the current state, because right now it's hard to… it's very hard to estimate mutual information in real data.

01:01:19.000 --> 01:01:23.000
And if you want to know the true mutual information in your datasets,

01:01:23.000 --> 01:01:30.000
More often, you just have to stick with very simple data sets, like correlated Gaussians, if you want it to be, like, fully tractable.

01:01:30.000 --> 01:01:36.000
Um, and, uh, datasets like ours allow for kind of non-trivial studies.

01:01:36.000 --> 01:01:40.000
using information theory to understand the relationship between different data sets.

01:01:40.000 --> 01:01:45.000
And it also allows us to benchmark mutual information estimators to kind of give you a sense of, like,

01:01:45.000 --> 01:01:54.000
how hard it is for current mutual information estimators to actually get the right answer, um, where all of these dots refer to different, um, popular mutual information estimators.

01:01:54.000 --> 01:02:00.000
the perfect estimation would be this red line. And there's a lot of variation, so, um…

01:02:00.000 --> 01:02:08.000
Uh, it's very hard to know the ground truth, mutual information, and real data, but if we had access to that, it might really change how we think about representing our data in a model.

01:02:08.000 --> 01:02:12.000
So I'm optimistic that stuff like this will help sketch that out.

01:02:12.000 --> 01:02:15.000
Um…

01:02:15.000 --> 01:02:18.000
So, in conclusion, basically, um,

01:02:18.000 --> 01:02:25.000
thinking about, uh, what we've seen in other industry foundation models,

01:02:25.000 --> 01:02:34.000
is that, um, a lot of their… a lot of the excitement around foundation models in industry is not just the fact that they are large or trained on diverse data.

01:02:34.000 --> 01:02:37.000
But it's that, at a certain threshold,

01:02:37.000 --> 01:02:40.000
They exhibited these emergent capacities.

01:02:40.000 --> 01:02:48.000
that were not necessarily expected. Um, so this is as a function of model scale on the x-axis, and then the, um…

01:02:48.000 --> 01:02:51.000
figure of merit of various, um,

01:02:51.000 --> 01:02:54.000
various tasks, um…

01:02:54.000 --> 01:03:07.000
And, uh, I think this really drove a lot of the excitement into investigating foundation models, is that it seemed like emergent properties were suddenly happening at certain scales, um, and also with a certain amount of

01:03:07.000 --> 01:03:12.000
diverse data, um, and a certain amount of, uh, size of the overall data set.

01:03:12.000 --> 01:03:19.000
Um, etc. So I think it's still an open question, like, will we see something like this in scientific foundation models? So…

01:03:19.000 --> 01:03:28.000
If we imagine training larger models on very diverse physics datasets, should we also expect some kind of emergent properties?

01:03:28.000 --> 01:03:33.000
Um, I don't know the answer to this, and part of my…

01:03:33.000 --> 01:03:35.000
messaging, I think, for this talk, is that

01:03:35.000 --> 01:03:43.000
I still think that this exercise of, um, creating representations of diverse physics data is still useful.

01:03:43.000 --> 01:03:46.000
Even if this is… if this never happens, um…

01:03:46.000 --> 01:03:56.000
I think just putting these different layers of the map in conversation will likely point out really interesting, um, patterns across scales in our data.

01:03:56.000 --> 01:04:00.000
And, uh, we'll also just enable us to be

01:04:00.000 --> 01:04:04.000
more playful and more curious with our… with our analyses.

01:04:04.000 --> 01:04:09.000
So, um…

01:04:09.000 --> 01:04:14.000
Normally, I like to talk about the title of my talk here, but I think, um…

01:04:14.000 --> 01:04:22.000
I think maybe just for time, um, I can pause here and, uh, we can take any questions, and if anyone wants to know about the title of my talk, you can stay after and ask me how that?

01:04:22.000 --> 01:04:29.000
Thanks for your time.

01:04:29.000 --> 01:04:33.000
Hey, uh, thank you for the very nice talk, and we have time for questions.